ParallelGZIPInputStream #1

simlu · 2016-04-30T04:38:42Z

I'm currently working on parallelizing this in my own fork. Do you have any advice or resources to share? Didn't get very far unfortunately.

How do you detect how many bytes to "chop off" and pass to a thread for decompression?

simlu · 2016-05-02T04:12:31Z

Making progress.. I'm thinking we can store the segment sizes for splitting up the inflation in the FEXTRA header field. The problem is that we can only write them once the file has been written... Wish there was a better way to append information.

My testing works nicely so far, just have to find a better way to communicate the block sizes. Any suggestions?

Edit: This article uses a similar method: http://www.ebaytechblog.com/2015/10/09/gzinga-seekable-and-splittable-gzip/

shevek · 2016-05-09T01:34:41Z

I don't actually know a good answer to your question. TBH, I wrote this as a tutorial exercise for some other programmers in how to write and profile good code in the JUC APIs. However, I have been using it in anger in various projects, and it's been a great boon. Those projects tend to be typified by very large data sets, so one option is just to hand off multimegabyte blocks to something like an FJP and pay the cost of passing the leading/trailing bits between threads... although can a thread "resync" if it's given the second megabyte at random?

Again, I confess, I didn't read the spec, I didn't do dictionary pre-seeding or anything. This was really about 30 minutes' work as a tutorial exercise which turned out to be publishable. I'll help as much as I can.

simlu · 2016-05-09T04:15:56Z

Makes sense. I really like this paper on the topic:
http://prof.icc.skku.ac.kr/~jaewlee/pubs/lctes13_vld.pdf

shevek · 2016-05-09T20:17:13Z

Right now, gzip decompression is costing me 45 seconds per unit test in one of my products, and my system has 4 cores. The data might or might not have been compressed using this parallelgzipoutpustream, but either way, would I love to get that down to 12 seconds: "yes!"

simlu · 2016-05-09T20:59:04Z

Same, I will save many hours run time on my project... I'll try to get the parallel part going next weekend. Boundary guessing is then a separate task.

shevek · 2016-05-10T16:34:54Z

Also, for linear scaling, the largest boxes I have are a 4(8 w/HT) core E5620 and a similar Core2. It seems not to get much benefit from the HT cores. I'd be very interested in any results from larger boxes. I did get some suggestions from concurrency-interest at some point. That said, my primary use case is saving human time on 4-core laptops, not saving real-time on 64-core servers.

axelfontaine · 2016-07-11T15:11:05Z

👍

marcadella · 2020-01-29T07:57:03Z

Hi, Any progress on this task?

shevek · 2020-01-29T23:34:33Z

I have solid 64-core hardware now. That's all.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ParallelGZIPInputStream #1

ParallelGZIPInputStream #1

simlu commented Apr 30, 2016

simlu commented May 2, 2016 •

edited

Loading

shevek commented May 9, 2016

simlu commented May 9, 2016

shevek commented May 9, 2016

simlu commented May 9, 2016

shevek commented May 10, 2016

axelfontaine commented Jul 11, 2016

marcadella commented Jan 29, 2020

shevek commented Jan 29, 2020

ParallelGZIPInputStream #1

ParallelGZIPInputStream #1

Comments

simlu commented Apr 30, 2016

simlu commented May 2, 2016 • edited Loading

shevek commented May 9, 2016

simlu commented May 9, 2016

shevek commented May 9, 2016

simlu commented May 9, 2016

shevek commented May 10, 2016

axelfontaine commented Jul 11, 2016

marcadella commented Jan 29, 2020

shevek commented Jan 29, 2020

simlu commented May 2, 2016 •

edited

Loading