Concurrent Gzip in Python

Friday, Oct 13, 2017

Comment

Peter Bengtsson October 16, 2017

Oh, pigz isn't a Python module. It compiles to a binary executable. So, without actually attempting to write and new benchmarks I can fairly confidently predict that it won't be a success. The overheads of creating and managing subprocesses will just take too much precious time.

Parent comment

Tuure Laurinolli October 16, 2017

What was the size of the largest file in each of the input data sets? Since you don't parallelize compression of individual files, the distribution of file sizes (as well as luck re: which executor gets which file, since there may be less executors than files) has a large effect on the compression duration. Especially in the second data set I suspect a single large file dominates the compression time. It would be interesting to see a speed comparison with tar + pigz (https://zlib.net/pigz/) and maybe tar + pixz (https://github.com/vasi/pixz)