Comment

Peter Bengtsson

Largest file, in the largest .zip file, was 219MB.
That file contained some few HUGE ones (see above), the mean 3.2MB but the median 195KB.

Will take a look at pigz but I'm probably not going to consider it seriously since it's not on PyPI.

Parent comment

Tuure Laurinolli

What was the size of the largest file in each of the input data sets? Since you don't parallelize compression of individual files, the distribution of file sizes (as well as luck re: which executor gets which file, since there may be less executors than files) has a large effect on the compression duration. Especially in the second data set I suspect a single large file dominates the compression time. It would be interesting to see a speed comparison with tar + pigz (https://zlib.net/pigz/) and maybe tar + pixz (https://github.com/vasi/pixz)