Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParallelGZIPInputStream #1

Open
simlu opened this issue Apr 30, 2016 · 9 comments
Open

ParallelGZIPInputStream #1

simlu opened this issue Apr 30, 2016 · 9 comments

Comments

@simlu
Copy link
Contributor

simlu commented Apr 30, 2016

I'm currently working on parallelizing this in my own fork. Do you have any advice or resources to share? Didn't get very far unfortunately.

How do you detect how many bytes to "chop off" and pass to a thread for decompression?

@simlu
Copy link
Contributor Author

simlu commented May 2, 2016

Making progress.. I'm thinking we can store the segment sizes for splitting up the inflation in the FEXTRA header field. The problem is that we can only write them once the file has been written... Wish there was a better way to append information.

My testing works nicely so far, just have to find a better way to communicate the block sizes. Any suggestions?

Edit: This article uses a similar method: http://www.ebaytechblog.com/2015/10/09/gzinga-seekable-and-splittable-gzip/

@shevek
Copy link
Owner

shevek commented May 9, 2016

I don't actually know a good answer to your question. TBH, I wrote this as a tutorial exercise for some other programmers in how to write and profile good code in the JUC APIs. However, I have been using it in anger in various projects, and it's been a great boon. Those projects tend to be typified by very large data sets, so one option is just to hand off multimegabyte blocks to something like an FJP and pay the cost of passing the leading/trailing bits between threads... although can a thread "resync" if it's given the second megabyte at random?

Again, I confess, I didn't read the spec, I didn't do dictionary pre-seeding or anything. This was really about 30 minutes' work as a tutorial exercise which turned out to be publishable. I'll help as much as I can.

@simlu
Copy link
Contributor Author

simlu commented May 9, 2016

Makes sense. I really like this paper on the topic:
http://prof.icc.skku.ac.kr/~jaewlee/pubs/lctes13_vld.pdf

@shevek
Copy link
Owner

shevek commented May 9, 2016

Right now, gzip decompression is costing me 45 seconds per unit test in one of my products, and my system has 4 cores. The data might or might not have been compressed using this parallelgzipoutpustream, but either way, would I love to get that down to 12 seconds: "yes!"

@simlu
Copy link
Contributor Author

simlu commented May 9, 2016

Same, I will save many hours run time on my project... I'll try to get the parallel part going next weekend. Boundary guessing is then a separate task.

@shevek
Copy link
Owner

shevek commented May 10, 2016

Also, for linear scaling, the largest boxes I have are a 4(8 w/HT) core E5620 and a similar Core2. It seems not to get much benefit from the HT cores. I'd be very interested in any results from larger boxes. I did get some suggestions from concurrency-interest at some point. That said, my primary use case is saving human time on 4-core laptops, not saving real-time on 64-core servers.

@axelfontaine
Copy link

👍

@marcadella
Copy link

Hi, Any progress on this task?

@shevek
Copy link
Owner

shevek commented Jan 29, 2020

I have solid 64-core hardware now. That's all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants