Concurrent Gzip in Python

Friday, Oct 13, 2017

Comment

Tuure Laurinolli October 16, 2017

What was the size of the largest file in each of the input data sets? Since you don't parallelize compression of individual files, the distribution of file sizes (as well as luck re: which executor gets which file, since there may be less executors than files) has a large effect on the compression duration. Especially in the second data set I suspect a single large file dominates the compression time.

It would be interesting to see a speed comparison with tar + pigz (https://zlib.net/pigz/) and maybe tar + pixz (https://github.com/vasi/pixz)

Replies

Peter Bengtsson October 16, 2017

Largest file, in the largest .zip file, was 219MB.
That file contained some few HUGE ones (see above), the mean 3.2MB but the median 195KB.

Will take a look at pigz but I'm probably not going to consider it seriously since it's not on PyPI.

Peter Bengtsson October 16, 2017

Oh, pigz isn't a Python module. It compiles to a binary executable. So, without actually attempting to write and new benchmarks I can fairly confidently predict that it won't be a success. The overheads of creating and managing subprocesses will just take too much precious time.