1 1.1 christos /* 2 1.1 christos * Copyright (c) Meta Platforms, Inc. and affiliates. 3 1.1 christos * All rights reserved. 4 1.1 christos * 5 1.1 christos * This source code is licensed under both the BSD-style license (found in the 6 1.1 christos * LICENSE file in the root directory of this source tree) and the GPLv2 (found 7 1.1 christos * in the COPYING file in the root directory of this source tree). 8 1.1 christos * You may select, at your option, one of the above-listed licenses. 9 1.1 christos */ 10 1.1 christos 11 1.1 christos #ifndef ZSTD_ZDICT_H 12 1.1 christos #define ZSTD_ZDICT_H 13 1.1 christos 14 1.1.1.2 christos 15 1.1 christos /*====== Dependencies ======*/ 16 1.1 christos #include <stddef.h> /* size_t */ 17 1.1 christos 18 1.1.1.2 christos #if defined (__cplusplus) 19 1.1.1.2 christos extern "C" { 20 1.1.1.2 christos #endif 21 1.1 christos 22 1.1 christos /* ===== ZDICTLIB_API : control library symbols visibility ===== */ 23 1.1 christos #ifndef ZDICTLIB_VISIBLE 24 1.1 christos /* Backwards compatibility with old macro name */ 25 1.1 christos # ifdef ZDICTLIB_VISIBILITY 26 1.1 christos # define ZDICTLIB_VISIBLE ZDICTLIB_VISIBILITY 27 1.1 christos # elif defined(__GNUC__) && (__GNUC__ >= 4) && !defined(__MINGW32__) 28 1.1 christos # define ZDICTLIB_VISIBLE __attribute__ ((visibility ("default"))) 29 1.1 christos # else 30 1.1 christos # define ZDICTLIB_VISIBLE 31 1.1 christos # endif 32 1.1 christos #endif 33 1.1 christos 34 1.1 christos #ifndef ZDICTLIB_HIDDEN 35 1.1 christos # if defined(__GNUC__) && (__GNUC__ >= 4) && !defined(__MINGW32__) 36 1.1 christos # define ZDICTLIB_HIDDEN __attribute__ ((visibility ("hidden"))) 37 1.1 christos # else 38 1.1 christos # define ZDICTLIB_HIDDEN 39 1.1 christos # endif 40 1.1 christos #endif 41 1.1 christos 42 1.1 christos #if defined(ZSTD_DLL_EXPORT) && (ZSTD_DLL_EXPORT==1) 43 1.1 christos # define ZDICTLIB_API __declspec(dllexport) ZDICTLIB_VISIBLE 44 1.1 christos #elif defined(ZSTD_DLL_IMPORT) && (ZSTD_DLL_IMPORT==1) 45 1.1 christos # define ZDICTLIB_API __declspec(dllimport) ZDICTLIB_VISIBLE /* It isn't required but allows to generate better code, saving a function pointer load from the IAT and an indirect jump.*/ 46 1.1 christos #else 47 1.1 christos # define ZDICTLIB_API ZDICTLIB_VISIBLE 48 1.1 christos #endif 49 1.1 christos 50 1.1 christos /******************************************************************************* 51 1.1 christos * Zstd dictionary builder 52 1.1 christos * 53 1.1 christos * FAQ 54 1.1 christos * === 55 1.1 christos * Why should I use a dictionary? 56 1.1 christos * ------------------------------ 57 1.1 christos * 58 1.1 christos * Zstd can use dictionaries to improve compression ratio of small data. 59 1.1 christos * Traditionally small files don't compress well because there is very little 60 1.1 christos * repetition in a single sample, since it is small. But, if you are compressing 61 1.1 christos * many similar files, like a bunch of JSON records that share the same 62 1.1 christos * structure, you can train a dictionary on ahead of time on some samples of 63 1.1 christos * these files. Then, zstd can use the dictionary to find repetitions that are 64 1.1 christos * present across samples. This can vastly improve compression ratio. 65 1.1 christos * 66 1.1 christos * When is a dictionary useful? 67 1.1 christos * ---------------------------- 68 1.1 christos * 69 1.1 christos * Dictionaries are useful when compressing many small files that are similar. 70 1.1 christos * The larger a file is, the less benefit a dictionary will have. Generally, 71 1.1 christos * we don't expect dictionary compression to be effective past 100KB. And the 72 1.1 christos * smaller a file is, the more we would expect the dictionary to help. 73 1.1 christos * 74 1.1 christos * How do I use a dictionary? 75 1.1 christos * -------------------------- 76 1.1 christos * 77 1.1 christos * Simply pass the dictionary to the zstd compressor with 78 1.1 christos * `ZSTD_CCtx_loadDictionary()`. The same dictionary must then be passed to 79 1.1 christos * the decompressor, using `ZSTD_DCtx_loadDictionary()`. There are other 80 1.1 christos * more advanced functions that allow selecting some options, see zstd.h for 81 1.1 christos * complete documentation. 82 1.1 christos * 83 1.1 christos * What is a zstd dictionary? 84 1.1 christos * -------------------------- 85 1.1 christos * 86 1.1 christos * A zstd dictionary has two pieces: Its header, and its content. The header 87 1.1 christos * contains a magic number, the dictionary ID, and entropy tables. These 88 1.1 christos * entropy tables allow zstd to save on header costs in the compressed file, 89 1.1 christos * which really matters for small data. The content is just bytes, which are 90 1.1 christos * repeated content that is common across many samples. 91 1.1 christos * 92 1.1 christos * What is a raw content dictionary? 93 1.1 christos * --------------------------------- 94 1.1 christos * 95 1.1 christos * A raw content dictionary is just bytes. It doesn't have a zstd dictionary 96 1.1 christos * header, a dictionary ID, or entropy tables. Any buffer is a valid raw 97 1.1 christos * content dictionary. 98 1.1 christos * 99 1.1 christos * How do I train a dictionary? 100 1.1 christos * ---------------------------- 101 1.1 christos * 102 1.1 christos * Gather samples from your use case. These samples should be similar to each 103 1.1 christos * other. If you have several use cases, you could try to train one dictionary 104 1.1 christos * per use case. 105 1.1 christos * 106 1.1 christos * Pass those samples to `ZDICT_trainFromBuffer()` and that will train your 107 1.1 christos * dictionary. There are a few advanced versions of this function, but this 108 1.1 christos * is a great starting point. If you want to further tune your dictionary 109 1.1 christos * you could try `ZDICT_optimizeTrainFromBuffer_cover()`. If that is too slow 110 1.1 christos * you can try `ZDICT_optimizeTrainFromBuffer_fastCover()`. 111 1.1 christos * 112 1.1 christos * If the dictionary training function fails, that is likely because you 113 1.1 christos * either passed too few samples, or a dictionary would not be effective 114 1.1 christos * for your data. Look at the messages that the dictionary trainer printed, 115 1.1 christos * if it doesn't say too few samples, then a dictionary would not be effective. 116 1.1 christos * 117 1.1 christos * How large should my dictionary be? 118 1.1 christos * ---------------------------------- 119 1.1 christos * 120 1.1 christos * A reasonable dictionary size, the `dictBufferCapacity`, is about 100KB. 121 1.1 christos * The zstd CLI defaults to a 110KB dictionary. You likely don't need a 122 1.1 christos * dictionary larger than that. But, most use cases can get away with a 123 1.1 christos * smaller dictionary. The advanced dictionary builders can automatically 124 1.1 christos * shrink the dictionary for you, and select the smallest size that doesn't 125 1.1 christos * hurt compression ratio too much. See the `shrinkDict` parameter. 126 1.1 christos * A smaller dictionary can save memory, and potentially speed up 127 1.1 christos * compression. 128 1.1 christos * 129 1.1 christos * How many samples should I provide to the dictionary builder? 130 1.1 christos * ------------------------------------------------------------ 131 1.1 christos * 132 1.1 christos * We generally recommend passing ~100x the size of the dictionary 133 1.1 christos * in samples. A few thousand should suffice. Having too few samples 134 1.1 christos * can hurt the dictionaries effectiveness. Having more samples will 135 1.1 christos * only improve the dictionaries effectiveness. But having too many 136 1.1 christos * samples can slow down the dictionary builder. 137 1.1 christos * 138 1.1 christos * How do I determine if a dictionary will be effective? 139 1.1 christos * ----------------------------------------------------- 140 1.1 christos * 141 1.1 christos * Simply train a dictionary and try it out. You can use zstd's built in 142 1.1 christos * benchmarking tool to test the dictionary effectiveness. 143 1.1 christos * 144 1.1 christos * # Benchmark levels 1-3 without a dictionary 145 1.1 christos * zstd -b1e3 -r /path/to/my/files 146 1.1 christos * # Benchmark levels 1-3 with a dictionary 147 1.1 christos * zstd -b1e3 -r /path/to/my/files -D /path/to/my/dictionary 148 1.1 christos * 149 1.1 christos * When should I retrain a dictionary? 150 1.1 christos * ----------------------------------- 151 1.1 christos * 152 1.1 christos * You should retrain a dictionary when its effectiveness drops. Dictionary 153 1.1 christos * effectiveness drops as the data you are compressing changes. Generally, we do 154 1.1 christos * expect dictionaries to "decay" over time, as your data changes, but the rate 155 1.1 christos * at which they decay depends on your use case. Internally, we regularly 156 1.1 christos * retrain dictionaries, and if the new dictionary performs significantly 157 1.1 christos * better than the old dictionary, we will ship the new dictionary. 158 1.1 christos * 159 1.1 christos * I have a raw content dictionary, how do I turn it into a zstd dictionary? 160 1.1 christos * ------------------------------------------------------------------------- 161 1.1 christos * 162 1.1 christos * If you have a raw content dictionary, e.g. by manually constructing it, or 163 1.1 christos * using a third-party dictionary builder, you can turn it into a zstd 164 1.1 christos * dictionary by using `ZDICT_finalizeDictionary()`. You'll also have to 165 1.1 christos * provide some samples of the data. It will add the zstd header to the 166 1.1 christos * raw content, which contains a dictionary ID and entropy tables, which 167 1.1 christos * will improve compression ratio, and allow zstd to write the dictionary ID 168 1.1 christos * into the frame, if you so choose. 169 1.1 christos * 170 1.1 christos * Do I have to use zstd's dictionary builder? 171 1.1 christos * ------------------------------------------- 172 1.1 christos * 173 1.1 christos * No! You can construct dictionary content however you please, it is just 174 1.1 christos * bytes. It will always be valid as a raw content dictionary. If you want 175 1.1 christos * a zstd dictionary, which can improve compression ratio, use 176 1.1 christos * `ZDICT_finalizeDictionary()`. 177 1.1 christos * 178 1.1 christos * What is the attack surface of a zstd dictionary? 179 1.1 christos * ------------------------------------------------ 180 1.1 christos * 181 1.1 christos * Zstd is heavily fuzz tested, including loading fuzzed dictionaries, so 182 1.1 christos * zstd should never crash, or access out-of-bounds memory no matter what 183 1.1 christos * the dictionary is. However, if an attacker can control the dictionary 184 1.1 christos * during decompression, they can cause zstd to generate arbitrary bytes, 185 1.1 christos * just like if they controlled the compressed data. 186 1.1 christos * 187 1.1 christos ******************************************************************************/ 188 1.1 christos 189 1.1 christos 190 1.1 christos /*! ZDICT_trainFromBuffer(): 191 1.1 christos * Train a dictionary from an array of samples. 192 1.1 christos * Redirect towards ZDICT_optimizeTrainFromBuffer_fastCover() single-threaded, with d=8, steps=4, 193 1.1 christos * f=20, and accel=1. 194 1.1 christos * Samples must be stored concatenated in a single flat buffer `samplesBuffer`, 195 1.1 christos * supplied with an array of sizes `samplesSizes`, providing the size of each sample, in order. 196 1.1 christos * The resulting dictionary will be saved into `dictBuffer`. 197 1.1 christos * @return: size of dictionary stored into `dictBuffer` (<= `dictBufferCapacity`) 198 1.1 christos * or an error code, which can be tested with ZDICT_isError(). 199 1.1 christos * Note: Dictionary training will fail if there are not enough samples to construct a 200 1.1 christos * dictionary, or if most of the samples are too small (< 8 bytes being the lower limit). 201 1.1 christos * If dictionary training fails, you should use zstd without a dictionary, as the dictionary 202 1.1 christos * would've been ineffective anyways. If you believe your samples would benefit from a dictionary 203 1.1 christos * please open an issue with details, and we can look into it. 204 1.1 christos * Note: ZDICT_trainFromBuffer()'s memory usage is about 6 MB. 205 1.1 christos * Tips: In general, a reasonable dictionary has a size of ~ 100 KB. 206 1.1 christos * It's possible to select smaller or larger size, just by specifying `dictBufferCapacity`. 207 1.1 christos * In general, it's recommended to provide a few thousands samples, though this can vary a lot. 208 1.1 christos * It's recommended that total size of all samples be about ~x100 times the target size of dictionary. 209 1.1 christos */ 210 1.1 christos ZDICTLIB_API size_t ZDICT_trainFromBuffer(void* dictBuffer, size_t dictBufferCapacity, 211 1.1 christos const void* samplesBuffer, 212 1.1 christos const size_t* samplesSizes, unsigned nbSamples); 213 1.1 christos 214 1.1 christos typedef struct { 215 1.1 christos int compressionLevel; /**< optimize for a specific zstd compression level; 0 means default */ 216 1.1 christos unsigned notificationLevel; /**< Write log to stderr; 0 = none (default); 1 = errors; 2 = progression; 3 = details; 4 = debug; */ 217 1.1 christos unsigned dictID; /**< force dictID value; 0 means auto mode (32-bits random value) 218 1.1 christos * NOTE: The zstd format reserves some dictionary IDs for future use. 219 1.1 christos * You may use them in private settings, but be warned that they 220 1.1 christos * may be used by zstd in a public dictionary registry in the future. 221 1.1 christos * These dictionary IDs are: 222 1.1 christos * - low range : <= 32767 223 1.1 christos * - high range : >= (2^31) 224 1.1 christos */ 225 1.1 christos } ZDICT_params_t; 226 1.1 christos 227 1.1 christos /*! ZDICT_finalizeDictionary(): 228 1.1 christos * Given a custom content as a basis for dictionary, and a set of samples, 229 1.1 christos * finalize dictionary by adding headers and statistics according to the zstd 230 1.1 christos * dictionary format. 231 1.1 christos * 232 1.1 christos * Samples must be stored concatenated in a flat buffer `samplesBuffer`, 233 1.1 christos * supplied with an array of sizes `samplesSizes`, providing the size of each 234 1.1 christos * sample in order. The samples are used to construct the statistics, so they 235 1.1 christos * should be representative of what you will compress with this dictionary. 236 1.1 christos * 237 1.1 christos * The compression level can be set in `parameters`. You should pass the 238 1.1 christos * compression level you expect to use in production. The statistics for each 239 1.1 christos * compression level differ, so tuning the dictionary for the compression level 240 1.1 christos * can help quite a bit. 241 1.1 christos * 242 1.1 christos * You can set an explicit dictionary ID in `parameters`, or allow us to pick 243 1.1 christos * a random dictionary ID for you, but we can't guarantee no collisions. 244 1.1 christos * 245 1.1 christos * The dstDictBuffer and the dictContent may overlap, and the content will be 246 1.1 christos * appended to the end of the header. If the header + the content doesn't fit in 247 1.1 christos * maxDictSize the beginning of the content is truncated to make room, since it 248 1.1 christos * is presumed that the most profitable content is at the end of the dictionary, 249 1.1 christos * since that is the cheapest to reference. 250 1.1 christos * 251 1.1.1.2 christos * `maxDictSize` must be >= max(dictContentSize, ZDICT_DICTSIZE_MIN). 252 1.1 christos * 253 1.1 christos * @return: size of dictionary stored into `dstDictBuffer` (<= `maxDictSize`), 254 1.1 christos * or an error code, which can be tested by ZDICT_isError(). 255 1.1 christos * Note: ZDICT_finalizeDictionary() will push notifications into stderr if 256 1.1 christos * instructed to, using notificationLevel>0. 257 1.1 christos * NOTE: This function currently may fail in several edge cases including: 258 1.1 christos * * Not enough samples 259 1.1 christos * * Samples are uncompressible 260 1.1 christos * * Samples are all exactly the same 261 1.1 christos */ 262 1.1 christos ZDICTLIB_API size_t ZDICT_finalizeDictionary(void* dstDictBuffer, size_t maxDictSize, 263 1.1 christos const void* dictContent, size_t dictContentSize, 264 1.1 christos const void* samplesBuffer, const size_t* samplesSizes, unsigned nbSamples, 265 1.1 christos ZDICT_params_t parameters); 266 1.1 christos 267 1.1 christos 268 1.1 christos /*====== Helper functions ======*/ 269 1.1 christos ZDICTLIB_API unsigned ZDICT_getDictID(const void* dictBuffer, size_t dictSize); /**< extracts dictID; @return zero if error (not a valid dictionary) */ 270 1.1 christos ZDICTLIB_API size_t ZDICT_getDictHeaderSize(const void* dictBuffer, size_t dictSize); /* returns dict header size; returns a ZSTD error code on failure */ 271 1.1 christos ZDICTLIB_API unsigned ZDICT_isError(size_t errorCode); 272 1.1 christos ZDICTLIB_API const char* ZDICT_getErrorName(size_t errorCode); 273 1.1 christos 274 1.1.1.2 christos #if defined (__cplusplus) 275 1.1.1.2 christos } 276 1.1.1.2 christos #endif 277 1.1.1.2 christos 278 1.1 christos #endif /* ZSTD_ZDICT_H */ 279 1.1 christos 280 1.1 christos #if defined(ZDICT_STATIC_LINKING_ONLY) && !defined(ZSTD_ZDICT_H_STATIC) 281 1.1 christos #define ZSTD_ZDICT_H_STATIC 282 1.1 christos 283 1.1.1.2 christos #if defined (__cplusplus) 284 1.1.1.2 christos extern "C" { 285 1.1.1.2 christos #endif 286 1.1.1.2 christos 287 1.1 christos /* This can be overridden externally to hide static symbols. */ 288 1.1 christos #ifndef ZDICTLIB_STATIC_API 289 1.1 christos # if defined(ZSTD_DLL_EXPORT) && (ZSTD_DLL_EXPORT==1) 290 1.1 christos # define ZDICTLIB_STATIC_API __declspec(dllexport) ZDICTLIB_VISIBLE 291 1.1 christos # elif defined(ZSTD_DLL_IMPORT) && (ZSTD_DLL_IMPORT==1) 292 1.1 christos # define ZDICTLIB_STATIC_API __declspec(dllimport) ZDICTLIB_VISIBLE 293 1.1 christos # else 294 1.1 christos # define ZDICTLIB_STATIC_API ZDICTLIB_VISIBLE 295 1.1 christos # endif 296 1.1 christos #endif 297 1.1 christos 298 1.1 christos /* ==================================================================================== 299 1.1 christos * The definitions in this section are considered experimental. 300 1.1 christos * They should never be used with a dynamic library, as they may change in the future. 301 1.1 christos * They are provided for advanced usages. 302 1.1 christos * Use them only in association with static linking. 303 1.1 christos * ==================================================================================== */ 304 1.1 christos 305 1.1 christos #define ZDICT_DICTSIZE_MIN 256 306 1.1 christos /* Deprecated: Remove in v1.6.0 */ 307 1.1 christos #define ZDICT_CONTENTSIZE_MIN 128 308 1.1 christos 309 1.1 christos /*! ZDICT_cover_params_t: 310 1.1 christos * k and d are the only required parameters. 311 1.1 christos * For others, value 0 means default. 312 1.1 christos */ 313 1.1 christos typedef struct { 314 1.1 christos unsigned k; /* Segment size : constraint: 0 < k : Reasonable range [16, 2048+] */ 315 1.1 christos unsigned d; /* dmer size : constraint: 0 < d <= k : Reasonable range [6, 16] */ 316 1.1 christos unsigned steps; /* Number of steps : Only used for optimization : 0 means default (40) : Higher means more parameters checked */ 317 1.1 christos unsigned nbThreads; /* Number of threads : constraint: 0 < nbThreads : 1 means single-threaded : Only used for optimization : Ignored if ZSTD_MULTITHREAD is not defined */ 318 1.1 christos double splitPoint; /* Percentage of samples used for training: Only used for optimization : the first nbSamples * splitPoint samples will be used to training, the last nbSamples * (1 - splitPoint) samples will be used for testing, 0 means default (1.0), 1.0 when all samples are used for both training and testing */ 319 1.1 christos unsigned shrinkDict; /* Train dictionaries to shrink in size starting from the minimum size and selects the smallest dictionary that is shrinkDictMaxRegression% worse than the largest dictionary. 0 means no shrinking and 1 means shrinking */ 320 1.1 christos unsigned shrinkDictMaxRegression; /* Sets shrinkDictMaxRegression so that a smaller dictionary can be at worse shrinkDictMaxRegression% worse than the max dict size dictionary. */ 321 1.1 christos ZDICT_params_t zParams; 322 1.1 christos } ZDICT_cover_params_t; 323 1.1 christos 324 1.1 christos typedef struct { 325 1.1 christos unsigned k; /* Segment size : constraint: 0 < k : Reasonable range [16, 2048+] */ 326 1.1 christos unsigned d; /* dmer size : constraint: 0 < d <= k : Reasonable range [6, 16] */ 327 1.1 christos unsigned f; /* log of size of frequency array : constraint: 0 < f <= 31 : 1 means default(20)*/ 328 1.1 christos unsigned steps; /* Number of steps : Only used for optimization : 0 means default (40) : Higher means more parameters checked */ 329 1.1 christos unsigned nbThreads; /* Number of threads : constraint: 0 < nbThreads : 1 means single-threaded : Only used for optimization : Ignored if ZSTD_MULTITHREAD is not defined */ 330 1.1 christos double splitPoint; /* Percentage of samples used for training: Only used for optimization : the first nbSamples * splitPoint samples will be used to training, the last nbSamples * (1 - splitPoint) samples will be used for testing, 0 means default (0.75), 1.0 when all samples are used for both training and testing */ 331 1.1 christos unsigned accel; /* Acceleration level: constraint: 0 < accel <= 10, higher means faster and less accurate, 0 means default(1) */ 332 1.1 christos unsigned shrinkDict; /* Train dictionaries to shrink in size starting from the minimum size and selects the smallest dictionary that is shrinkDictMaxRegression% worse than the largest dictionary. 0 means no shrinking and 1 means shrinking */ 333 1.1 christos unsigned shrinkDictMaxRegression; /* Sets shrinkDictMaxRegression so that a smaller dictionary can be at worse shrinkDictMaxRegression% worse than the max dict size dictionary. */ 334 1.1 christos 335 1.1 christos ZDICT_params_t zParams; 336 1.1 christos } ZDICT_fastCover_params_t; 337 1.1 christos 338 1.1 christos /*! ZDICT_trainFromBuffer_cover(): 339 1.1 christos * Train a dictionary from an array of samples using the COVER algorithm. 340 1.1 christos * Samples must be stored concatenated in a single flat buffer `samplesBuffer`, 341 1.1 christos * supplied with an array of sizes `samplesSizes`, providing the size of each sample, in order. 342 1.1 christos * The resulting dictionary will be saved into `dictBuffer`. 343 1.1 christos * @return: size of dictionary stored into `dictBuffer` (<= `dictBufferCapacity`) 344 1.1 christos * or an error code, which can be tested with ZDICT_isError(). 345 1.1 christos * See ZDICT_trainFromBuffer() for details on failure modes. 346 1.1 christos * Note: ZDICT_trainFromBuffer_cover() requires about 9 bytes of memory for each input byte. 347 1.1 christos * Tips: In general, a reasonable dictionary has a size of ~ 100 KB. 348 1.1 christos * It's possible to select smaller or larger size, just by specifying `dictBufferCapacity`. 349 1.1 christos * In general, it's recommended to provide a few thousands samples, though this can vary a lot. 350 1.1 christos * It's recommended that total size of all samples be about ~x100 times the target size of dictionary. 351 1.1 christos */ 352 1.1 christos ZDICTLIB_STATIC_API size_t ZDICT_trainFromBuffer_cover( 353 1.1 christos void *dictBuffer, size_t dictBufferCapacity, 354 1.1 christos const void *samplesBuffer, const size_t *samplesSizes, unsigned nbSamples, 355 1.1 christos ZDICT_cover_params_t parameters); 356 1.1 christos 357 1.1 christos /*! ZDICT_optimizeTrainFromBuffer_cover(): 358 1.1 christos * The same requirements as above hold for all the parameters except `parameters`. 359 1.1 christos * This function tries many parameter combinations and picks the best parameters. 360 1.1 christos * `*parameters` is filled with the best parameters found, 361 1.1 christos * dictionary constructed with those parameters is stored in `dictBuffer`. 362 1.1 christos * 363 1.1 christos * All of the parameters d, k, steps are optional. 364 1.1 christos * If d is non-zero then we don't check multiple values of d, otherwise we check d = {6, 8}. 365 1.1 christos * if steps is zero it defaults to its default value. 366 1.1 christos * If k is non-zero then we don't check multiple values of k, otherwise we check steps values in [50, 2000]. 367 1.1 christos * 368 1.1 christos * @return: size of dictionary stored into `dictBuffer` (<= `dictBufferCapacity`) 369 1.1 christos * or an error code, which can be tested with ZDICT_isError(). 370 1.1 christos * On success `*parameters` contains the parameters selected. 371 1.1 christos * See ZDICT_trainFromBuffer() for details on failure modes. 372 1.1 christos * Note: ZDICT_optimizeTrainFromBuffer_cover() requires about 8 bytes of memory for each input byte and additionally another 5 bytes of memory for each byte of memory for each thread. 373 1.1 christos */ 374 1.1 christos ZDICTLIB_STATIC_API size_t ZDICT_optimizeTrainFromBuffer_cover( 375 1.1 christos void* dictBuffer, size_t dictBufferCapacity, 376 1.1 christos const void* samplesBuffer, const size_t* samplesSizes, unsigned nbSamples, 377 1.1 christos ZDICT_cover_params_t* parameters); 378 1.1 christos 379 1.1 christos /*! ZDICT_trainFromBuffer_fastCover(): 380 1.1 christos * Train a dictionary from an array of samples using a modified version of COVER algorithm. 381 1.1 christos * Samples must be stored concatenated in a single flat buffer `samplesBuffer`, 382 1.1 christos * supplied with an array of sizes `samplesSizes`, providing the size of each sample, in order. 383 1.1 christos * d and k are required. 384 1.1 christos * All other parameters are optional, will use default values if not provided 385 1.1 christos * The resulting dictionary will be saved into `dictBuffer`. 386 1.1 christos * @return: size of dictionary stored into `dictBuffer` (<= `dictBufferCapacity`) 387 1.1 christos * or an error code, which can be tested with ZDICT_isError(). 388 1.1 christos * See ZDICT_trainFromBuffer() for details on failure modes. 389 1.1 christos * Note: ZDICT_trainFromBuffer_fastCover() requires 6 * 2^f bytes of memory. 390 1.1 christos * Tips: In general, a reasonable dictionary has a size of ~ 100 KB. 391 1.1 christos * It's possible to select smaller or larger size, just by specifying `dictBufferCapacity`. 392 1.1 christos * In general, it's recommended to provide a few thousands samples, though this can vary a lot. 393 1.1 christos * It's recommended that total size of all samples be about ~x100 times the target size of dictionary. 394 1.1 christos */ 395 1.1 christos ZDICTLIB_STATIC_API size_t ZDICT_trainFromBuffer_fastCover(void *dictBuffer, 396 1.1 christos size_t dictBufferCapacity, const void *samplesBuffer, 397 1.1 christos const size_t *samplesSizes, unsigned nbSamples, 398 1.1 christos ZDICT_fastCover_params_t parameters); 399 1.1 christos 400 1.1 christos /*! ZDICT_optimizeTrainFromBuffer_fastCover(): 401 1.1 christos * The same requirements as above hold for all the parameters except `parameters`. 402 1.1 christos * This function tries many parameter combinations (specifically, k and d combinations) 403 1.1 christos * and picks the best parameters. `*parameters` is filled with the best parameters found, 404 1.1 christos * dictionary constructed with those parameters is stored in `dictBuffer`. 405 1.1 christos * All of the parameters d, k, steps, f, and accel are optional. 406 1.1 christos * If d is non-zero then we don't check multiple values of d, otherwise we check d = {6, 8}. 407 1.1 christos * if steps is zero it defaults to its default value. 408 1.1 christos * If k is non-zero then we don't check multiple values of k, otherwise we check steps values in [50, 2000]. 409 1.1 christos * If f is zero, default value of 20 is used. 410 1.1 christos * If accel is zero, default value of 1 is used. 411 1.1 christos * 412 1.1 christos * @return: size of dictionary stored into `dictBuffer` (<= `dictBufferCapacity`) 413 1.1 christos * or an error code, which can be tested with ZDICT_isError(). 414 1.1 christos * On success `*parameters` contains the parameters selected. 415 1.1 christos * See ZDICT_trainFromBuffer() for details on failure modes. 416 1.1 christos * Note: ZDICT_optimizeTrainFromBuffer_fastCover() requires about 6 * 2^f bytes of memory for each thread. 417 1.1 christos */ 418 1.1 christos ZDICTLIB_STATIC_API size_t ZDICT_optimizeTrainFromBuffer_fastCover(void* dictBuffer, 419 1.1 christos size_t dictBufferCapacity, const void* samplesBuffer, 420 1.1 christos const size_t* samplesSizes, unsigned nbSamples, 421 1.1 christos ZDICT_fastCover_params_t* parameters); 422 1.1 christos 423 1.1 christos typedef struct { 424 1.1 christos unsigned selectivityLevel; /* 0 means default; larger => select more => larger dictionary */ 425 1.1 christos ZDICT_params_t zParams; 426 1.1 christos } ZDICT_legacy_params_t; 427 1.1 christos 428 1.1 christos /*! ZDICT_trainFromBuffer_legacy(): 429 1.1 christos * Train a dictionary from an array of samples. 430 1.1 christos * Samples must be stored concatenated in a single flat buffer `samplesBuffer`, 431 1.1 christos * supplied with an array of sizes `samplesSizes`, providing the size of each sample, in order. 432 1.1 christos * The resulting dictionary will be saved into `dictBuffer`. 433 1.1 christos * `parameters` is optional and can be provided with values set to 0 to mean "default". 434 1.1 christos * @return: size of dictionary stored into `dictBuffer` (<= `dictBufferCapacity`) 435 1.1 christos * or an error code, which can be tested with ZDICT_isError(). 436 1.1 christos * See ZDICT_trainFromBuffer() for details on failure modes. 437 1.1 christos * Tips: In general, a reasonable dictionary has a size of ~ 100 KB. 438 1.1 christos * It's possible to select smaller or larger size, just by specifying `dictBufferCapacity`. 439 1.1 christos * In general, it's recommended to provide a few thousands samples, though this can vary a lot. 440 1.1 christos * It's recommended that total size of all samples be about ~x100 times the target size of dictionary. 441 1.1 christos * Note: ZDICT_trainFromBuffer_legacy() will send notifications into stderr if instructed to, using notificationLevel>0. 442 1.1 christos */ 443 1.1 christos ZDICTLIB_STATIC_API size_t ZDICT_trainFromBuffer_legacy( 444 1.1 christos void* dictBuffer, size_t dictBufferCapacity, 445 1.1 christos const void* samplesBuffer, const size_t* samplesSizes, unsigned nbSamples, 446 1.1 christos ZDICT_legacy_params_t parameters); 447 1.1 christos 448 1.1 christos 449 1.1 christos /* Deprecation warnings */ 450 1.1 christos /* It is generally possible to disable deprecation warnings from compiler, 451 1.1 christos for example with -Wno-deprecated-declarations for gcc 452 1.1 christos or _CRT_SECURE_NO_WARNINGS in Visual. 453 1.1 christos Otherwise, it's also possible to manually define ZDICT_DISABLE_DEPRECATE_WARNINGS */ 454 1.1 christos #ifdef ZDICT_DISABLE_DEPRECATE_WARNINGS 455 1.1 christos # define ZDICT_DEPRECATED(message) /* disable deprecation warnings */ 456 1.1 christos #else 457 1.1 christos # define ZDICT_GCC_VERSION (__GNUC__ * 100 + __GNUC_MINOR__) 458 1.1 christos # if defined (__cplusplus) && (__cplusplus >= 201402) /* C++14 or greater */ 459 1.1 christos # define ZDICT_DEPRECATED(message) [[deprecated(message)]] 460 1.1 christos # elif defined(__clang__) || (ZDICT_GCC_VERSION >= 405) 461 1.1 christos # define ZDICT_DEPRECATED(message) __attribute__((deprecated(message))) 462 1.1 christos # elif (ZDICT_GCC_VERSION >= 301) 463 1.1 christos # define ZDICT_DEPRECATED(message) __attribute__((deprecated)) 464 1.1 christos # elif defined(_MSC_VER) 465 1.1 christos # define ZDICT_DEPRECATED(message) __declspec(deprecated(message)) 466 1.1 christos # else 467 1.1 christos # pragma message("WARNING: You need to implement ZDICT_DEPRECATED for this compiler") 468 1.1 christos # define ZDICT_DEPRECATED(message) 469 1.1 christos # endif 470 1.1 christos #endif /* ZDICT_DISABLE_DEPRECATE_WARNINGS */ 471 1.1 christos 472 1.1 christos ZDICT_DEPRECATED("use ZDICT_finalizeDictionary() instead") 473 1.1 christos ZDICTLIB_STATIC_API 474 1.1 christos size_t ZDICT_addEntropyTablesFromBuffer(void* dictBuffer, size_t dictContentSize, size_t dictBufferCapacity, 475 1.1 christos const void* samplesBuffer, const size_t* samplesSizes, unsigned nbSamples); 476 1.1 christos 477 1.1 christos #if defined (__cplusplus) 478 1.1 christos } 479 1.1 christos #endif 480 1.1.1.2 christos 481 1.1.1.2 christos #endif /* ZSTD_ZDICT_H_STATIC */ 482