Researching file formats 20: gzip

12 Jan 2024

This blog post is part of a series on file formats research. See this introduction post for more information.

Update: The official format definition is now online here: GZIP. Comments welcome directly to the Library of Congress.

Am I running out of steam or is there not that much to say about gzip? I think the biggest struggle in writing about the sustainability of this format is ensuring there isn’t conflation between the format itself (compression algorithm, in this case) and the software tool of the same name that creates this format. They are tightly linked together, but different, so I had to make my research notes explicit in every case so it’s easier for when I turn it over to the writer/editor on this project.

A gzip file contains:

a 10-byte header. This header contains its magic number (“1F8B08”), a version number and a time stamp,
optional extra headers, such as the original file name,
a body, containing a DEFLATE-compressed payload,
and an 8-byte footer, containing a CRC-32 checksum and the length of the original uncompressed data.

I did come across this very thorough StackOverflow answer about gzip (and its relation/difference to other formats of similar names), and the comments are funny, because someone is like “It’d be great if this had cited sources” and the answer, which was written by one of the original authors of gzip, was “I am the reference, having been part of all of that.”

Ashley Blewer

Researching file formats 20: gzip