Transparent Decompression

If you have used Firefox on Android, you will be quite impressed to know it loads compressed code on demand into memory. In order to do this however a custom linker was required, which would handle a SIGSEGV and then retrieve compressed data from disk and then decompress it in the area reserved for it.

For the past few days, I have been working on getting the filesystem to do that for me transparently. What this means is that now we just SIGSEGV, and load uncompressed data directly. No messing around in userspace and complex error handling. I am almost ready to post v1 of the patchset.

The first step was to pick a compression scheme, which would lend itself easily to seeking and the other good properties we all like. Since one of our primary targets is firefox, we decided to use the same compression scheme as that. However, the implementation needs to not be tied to a compression type.

The seekable zip format is just zlib chunked into 16k chunks, each compressed individually (using zlib), and finally a custom header which has all of this information stored. There is a tool available to generate these here. Just build and use the szip tool to make compressed files.

The big challenge was how to modify the filesystem to handle these cases. For v1, complexity is evil. I have just modified the read path, where we do nothing different with an uncompressed file (except add another level of indirection). The fun starts once we know a file is compressed. First we check if we already have the header information. If not, retrieve it from disk and initialize it. Once it is setup, we then correct the offsets to point to the right chunk, retrieve the data from disk, uncompress and then copy into the userspace buffers. Userspace is not aware anything happened underneath.

The implementation is still tied to the szip format in this version, but it is just a question of code refactoring and designing interfaces which would make it compression scheme agnostic.

The bigger issue imo is that we are too high in the stack. Ideally one would want to be decompressing soon after getting data off the disk. One possible way is to map compressed and uncompressed data to different pages and expose only the uncompressed data. I will tackle these issues in the next version.