-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ParallelGZIPInputStream #1
Comments
Making progress.. I'm thinking we can store the segment sizes for splitting up the inflation in the FEXTRA header field. The problem is that we can only write them once the file has been written... Wish there was a better way to append information. My testing works nicely so far, just have to find a better way to communicate the block sizes. Any suggestions? Edit: This article uses a similar method: http://www.ebaytechblog.com/2015/10/09/gzinga-seekable-and-splittable-gzip/ |
I don't actually know a good answer to your question. TBH, I wrote this as a tutorial exercise for some other programmers in how to write and profile good code in the JUC APIs. However, I have been using it in anger in various projects, and it's been a great boon. Those projects tend to be typified by very large data sets, so one option is just to hand off multimegabyte blocks to something like an FJP and pay the cost of passing the leading/trailing bits between threads... although can a thread "resync" if it's given the second megabyte at random? Again, I confess, I didn't read the spec, I didn't do dictionary pre-seeding or anything. This was really about 30 minutes' work as a tutorial exercise which turned out to be publishable. I'll help as much as I can. |
Makes sense. I really like this paper on the topic: |
Right now, gzip decompression is costing me 45 seconds per unit test in one of my products, and my system has 4 cores. The data might or might not have been compressed using this parallelgzipoutpustream, but either way, would I love to get that down to 12 seconds: "yes!" |
Same, I will save many hours run time on my project... I'll try to get the parallel part going next weekend. Boundary guessing is then a separate task. |
Also, for linear scaling, the largest boxes I have are a 4(8 w/HT) core E5620 and a similar Core2. It seems not to get much benefit from the HT cores. I'd be very interested in any results from larger boxes. I did get some suggestions from concurrency-interest at some point. That said, my primary use case is saving human time on 4-core laptops, not saving real-time on 64-core servers. |
👍 |
Hi, Any progress on this task? |
I have solid 64-core hardware now. That's all. |
I'm currently working on parallelizing this in my own fork. Do you have any advice or resources to share? Didn't get very far unfortunately.
How do you detect how many bytes to "chop off" and pass to a thread for decompression?
The text was updated successfully, but these errors were encountered: