LZMA SDK - How to Use

Warning! Some information on this page is older than 6 years now. I keep it for reference, but it probably doesn't reflect my current knowledge and beliefs.

Sat
08
May 2010

What do you think about when I tell a word "compression"? If you currently study computer science, you probably think about some details of algorithms like RLE, Huffman coding or Burrows-Wheeler transform. If not, then you surely associate compression with archive file formats such as ZIP and RAR. But there is something in between - a kind of libraries that let you compress some data - implement a compression algorithm but do not provide ready file format to pack multiple files and directories into one archive. Such library is useful in gamedev for creating VFS (Virtual File System). Probably the most popular one is zlib - a free C library that implements Deflate algorithm. I've recently discovered another one - LZMA. Its SDK is also free (public domain) and the basic version is a small C library (C++, C# and Java API-s are also available, as well as some additional tools). The library uses LZMA algorithm (Lempel–Ziv–Markov chain algorithm, same as in 7z archive format), which has better compression ratio than Deflate AFAIK. So I've decided to start using it. Here is what I've learned:

If you decide to use only the C API, it's enough to add some C and H files to your project - the ones from LZMASDK\C directory (without subdirectories). Alternatively you can compile them as a static library.

There is a little bit of theory behind the LZMA SDK API. First, the term props means a 5-byte header where the library stores some settings. It must be saved with compressed data to be given to the library before decompression.

Next, the dictionary size. It is the size of a temporary memory block used during compression and decompression. Dictionary size can be set during compression and is then saved inside props. Library uses dictionary of same size during decompression. Default dictionary size is 16 MB so IMHO it's worth changing, especially as I haven't noticed any meaninful drop in compression rate when set it to 64 KB.

And finally, end mark can be saved at the end of compressed data. You can use it so decompression code can determine the end of data. Alternatively you can decide not to use the end mark, but you must remember the exact size of uncompressed data somewhere. I prefer the second method, because remembering data size takes only 4 bytes (for the 4 GB limit) and can be useful anyway, while compressed data finished with end mark are about 6 bytes longer than without it.

Compressing full block of data with single call is simple. You can find appropriate functions in LzmaLib.h header. Here is how you can compress a vector of bytes using LzmaCompress function:

void Compress1(
  std::vector<unsigned char> &outBuf,
  const std::vector<unsigned char> &inBuf)
{
  unsigned propsSize = LZMA_PROPS_SIZE;
  unsigned destLen = inBuf.size() + inBuf.size() / 3 + 128;
  outBuf.resize(propsSize + destLen);
  
  int res = LzmaCompress(
    &outBuf[LZMA_PROPS_SIZE], &destLen,
    &inBuf[0], inBuf.size(),
    &outBuf[0], &propsSize,
    -1, 0, -1, -1, -1, -1, -1);
  
  assert(propsSize == LZMA_PROPS_SIZE);
  assert(res == SZ_OK);
  
  outBuf.resize(propsSize + destLen);
}

Destination buffer - outBuf - will contain props in its first 5 bytes (LZMA_PROPS_SIZE) and compressed data in the remaining part (it's just my idea to store it this way). Starting size of outBuf (destLen) is set to some arbitrary quantity big enough to fit all compressed data. The buffer is then trimmed to its real size at the end. LzmaCompress function always saves compressed data without the end mark. Its last 7 parameters are some compression settings (including dictionary size), here set to special values meaning defaults.

A little more advanced example is also a compression of all data with one call, but with props structure explicitly declared and filled. This example uses LzmaEnc.h header and LzmaEncode function.

void Compress2(
  std::vector<unsigned char> &outBuf,
  const std::vector<unsigned char> &inBuf)
{
  unsigned propsSize = LZMA_PROPS_SIZE;
  unsigned destLen = inBuf.size() + inBuf.size() / 3 + 128;
  outBuf.resize(propsSize + destLen);

  CLzmaEncProps props;
  LzmaEncProps_Init(&props);
  props.dictSize = 1 << 16; // 64 KB
  props.writeEndMark = 1; // 0 or 1

  int res = LzmaEncode(
    &outBuf[LZMA_PROPS_SIZE], &destLen,
    &inBuf[0], inBuf.size(),
    &props, &outBuf[0], &propsSize, props.writeEndMark,
    &g_ProgressCallback, &SzAllocForLzma, &SzAllocForLzma);
  assert(res == SZ_OK && propsSize == LZMA_PROPS_SIZE);
  
  outBuf.resize(propsSize + destLen);
}

Here we first define props structure of type CLzmaEncProps, initialize it with default values and then change some fields. We can change here not only dictionary size, but also a flag telling whether compressed data should end with the end mark. The props are encoded to &inBuf[0] and data are compressed to the location 5 bytes further (&outBuf[LZMA_PROPS_SIZE]) by the LzmaEncode function.

g_ProgressCallback is a pointer to a C "interface" (let's call it this way ;P) where you can place your callback with progress notification.

SRes OnProgress(void *p, UInt64 inSize, UInt64 outSize)
{
  // Update progress bar.
  return SZ_OK;
}
static ICompressProgress g_ProgressCallback = { &OnProgress };

SzAllocForLzma is another interface which gives LZMA library pointers to the memory allocation and deallocation functions. To just use standard malloc and free functions, you can copy this code:

static void * AllocForLzma(void *p, size_t size) { return malloc(size); }
static void FreeForLzma(void *p, void *address) { free(address); }
static ISzAlloc SzAllocForLzma = { &AllocForLzma, &FreeForLzma };

Decompression with single call is also quite simple. You #include "LzmaLib.h" and use LzmaUncompress function.

static void Uncompress1(
  std::vector<unsigned char> &outBuf,
  const std::vector<unsigned char> &inBuf)
{
  outBuf.resize(UNCOMPRESSED_SIZE);
  unsigned dstLen = outBuf.size();
  unsigned srcLen = inBuf.size() - LZMA_PROPS_SIZE;
  SRes res = LzmaUncompress(
    &outBuf[0], &dstLen,
    &inBuf[LZMA_PROPS_SIZE], &srcLen,
    &inBuf[0], LZMA_PROPS_SIZE);
  assert(res == SZ_OK);
  outBuf.resize(dstLen); // If uncompressed data can be smaller
}

You can request decompression of the exact number of bytes if you know the uncompressed data size or alternatively you can try to uncompress more bytes when compressed data are finished with the end mark. In the second case, you should check for new value of dstLen and trim the output buffer to that size. Another function for single-call decompression is LzmaDecode from LzmaDec.h file.

Incremental compression and decompression is a little more complicated. Unfortunately there is no push-like interface for incremental compression. To do that, you have to make a single call to LzmaEnc_Encode function from LzmaEnc.h and pass your implementation of ISeqInStream and ISeqOutStream "interfaces" that read uncompressed and save compressed data. Here is an example:

typedef struct
{
  ISeqInStream SeqInStream;
  const std::vector<unsigned char> *Buf;
  unsigned BufPos;
} VectorInStream;

SRes VectorInStream_Read(void *p, void *buf, size_t *size)
{
  VectorInStream *ctx = (VectorInStream*)p;
  *size = min(*size, ctx->Buf->size() - ctx->BufPos);
  if (*size)
    memcpy(buf, &(*ctx->Buf)[ctx->BufPos], *size);
  ctx->BufPos += *size;
  return SZ_OK;
}

typedef struct
{
  ISeqOutStream SeqOutStream;
  std::vector<unsigned char> *Buf;
} VectorOutStream;

size_t VectorOutStream_Write(void *p, const void *buf, size_t size)
{
  VectorOutStream *ctx = (VectorOutStream*)p;
  if (size)
  {
    unsigned oldSize = ctx->Buf->size();
    ctx->Buf->resize(oldSize + size);
    memcpy(&(*ctx->Buf)[oldSize], buf, size);
  }
  return size;
}

static void CompressInc(
  std::vector<unsigned char> &outBuf,
  const std::vector<unsigned char> &inBuf)
{
  CLzmaEncHandle enc = LzmaEnc_Create(&SzAllocForLzma);
  assert(enc);

  CLzmaEncProps props;
  LzmaEncProps_Init(&props);
  props.writeEndMark = 1; // 0 or 1
  
  SRes res = LzmaEnc_SetProps(enc, &props);
  assert(res == SZ_OK);

  unsigned propsSize = LZMA_PROPS_SIZE;
  outBuf.resize(propsSize);

  res = LzmaEnc_WriteProperties(enc, &outBuf[0], &propsSize);
  assert(res == SZ_OK && propsSize == LZMA_PROPS_SIZE);

  VectorInStream inStream = { &VectorInStream_Read, &inBuf, 0 };
  VectorOutStream outStream = { &VectorOutStream_Write, &outBuf };

  res = LzmaEnc_Encode(enc,
    (ISeqOutStream*)&outStream, (ISeqInStream*)&inStream,
    0, &SzAllocForLzma, &SzAllocForLzma);
  assert(res == SZ_OK);

  LzmaEnc_Destroy(enc, &SzAllocForLzma, &SzAllocForLzma);
}

Incremental decompression is even more complicated, although here we have an interface to just process a piece of data. We use LzmaDec_DecodeToBuf function from LzmaDec.h. For example:

static void UncompressInc(
  std::vector<unsigned char> &outBuf,
  const std::vector<unsigned char> &inBuf)
{
  CLzmaDec dec;  
  LzmaDec_Construct(&dec);
  
  SRes res = LzmaDec_Allocate(&dec, &inBuf[0], LZMA_PROPS_SIZE, &SzAllocForLzma);
  assert(res == SZ_OK);

  LzmaDec_Init(&dec);

  outBuf.resize(UNCOMPRESSED_SIZE);
  unsigned outPos = 0, inPos = LZMA_PROPS_SIZE;
  ELzmaStatus status;
  const unsigned BUF_SIZE = 10240;
  while (outPos < outBuf.size())
  {
    unsigned destLen = min(BUF_SIZE, outBuf.size() - outPos);
    unsigned srcLen  = min(BUF_SIZE, inBuf.size() - inPos);
    unsigned srcLenOld = srcLen, destLenOld = destLen;
    res = LzmaDec_DecodeToBuf(&dec,
      &outBuf[outPos], &destLen,
      &inBuf[inPos], &srcLen,
      (outPos + destLen == outBuf.size())
      ? LZMA_FINISH_END : LZMA_FINISH_ANY, &status);
    assert(res == SZ_OK);
    inPos += srcLen;
    outPos += destLen;
    if (status == LZMA_STATUS_FINISHED_WITH_MARK)
      break;
  }

  LzmaDec_Free(&dec, &SzAllocForLzma);
  outBuf.resize(outPos);
}

LzmaDec_DecodeToBuf has several possible status codes returned via last parameter. LZMA_STATUS_FINISHED_WITH_MARK means decompression finished with end mark met inside compressed data. Look for it if you use use end mark. LZMA_STATUS_MAYBE_FINISHED_WITHOUT_MARK means it's possible that compressed data finished in this place but it's your task to be sure whether you already have as much uncompressed data as you expect. Check for this code when you don't use end mark. Finally LZMA_STATUS_NOT_FINISHED and LZMA_STATUS_NEEDS_MORE_INPUT mean you are in the middle of decompression and need to call the function again to process more data. Of course you should check if res == SZ_OK first to ensure there is no error.

That was just some test code from my learning and now it's time for the real one. It's not a standalone code snippet, but it's finished, well tested and will be included in the next version of my CommonLib libary. You can preview it (and use it however you want) here:

LzmaUtils.hpp
LzmaUtils.cpp

Comments | #commonlib #libraries #algorithms Share

Comments

[Download] [Dropbox] [pub] [Mirror] [Privacy policy]
Copyright © 2004-2024