Skip to content

Commit

Permalink
Proof of Concept of new gunzip/gunzipOne impl
Browse files Browse the repository at this point in the history
Needs more testing, review, and work:

1.  Update the test suite
2.  Research the issue of concatinating zlib streams
3.  Review various semantic considerations, especially in
    relation to the unix command "gunzip"

But, it works for me
  • Loading branch information
lpsmith committed Jun 12, 2016
1 parent a0e88e1 commit f39178b
Show file tree
Hide file tree
Showing 2 changed files with 99 additions and 39 deletions.
1 change: 1 addition & 0 deletions io-streams.cabal
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,7 @@ Library
time >= 1.2 && <1.7,
transformers >= 0.2 && <0.6,
vector >= 0.7 && <0.12,
zlib >= 0.6 && <0.7,
zlib-bindings >= 0.1 && <0.2

if impl(ghc >= 7.2)
Expand Down
137 changes: 98 additions & 39 deletions src/System/IO/Streams/Zlib.hs
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
module System.IO.Streams.Zlib
( -- * ByteString decompression
gunzip
, gunzipOne
, decompress
-- * ByteString compression
, gzip
Expand All @@ -18,18 +19,22 @@ module System.IO.Streams.Zlib
) where

------------------------------------------------------------------------------
import Control.Exception (throwIO)
import Control.Monad (join)
import Data.ByteString (ByteString)
import qualified Data.ByteString as S
import Data.IORef (newIORef, readIORef, writeIORef)
import Data.Word (Word8)
import Prelude hiding (read)
------------------------------------------------------------------------------
import Codec.Zlib (Deflate, Inflate, Popper, WindowBits (..), feedDeflate, feedInflate, finishDeflate, finishInflate, flushDeflate, flushInflate, initDeflate, initInflate)
import Codec.Compression.Zlib.Internal (Format, DecompressParams, DecompressStream(..), decompressIO, zlibFormat, gzipFormat, defaultDecompressParams)
import Codec.Zlib (Deflate, WindowBits (..), feedDeflate, finishDeflate, flushDeflate, initDeflate)
import Data.ByteString.Builder (Builder, byteString)
import Data.ByteString.Builder.Extra (defaultChunkSize, flush)
import Data.ByteString.Builder.Internal (newBuffer)
------------------------------------------------------------------------------
import System.IO.Streams.Builder (unsafeBuilderStream)
import System.IO.Streams.Internal (InputStream, OutputStream, makeInputStream, makeOutputStream, read, write)
import System.IO.Streams.Internal (InputStream, OutputStream, makeInputStream, makeOutputStream, read, write, unRead)


------------------------------------------------------------------------------
Expand All @@ -45,54 +50,108 @@ compressBits = WindowBits 15
------------------------------------------------------------------------------
-- | Decompress an 'InputStream' of strict 'ByteString's from the @gzip@ format
gunzip :: InputStream ByteString -> IO (InputStream ByteString)
gunzip input = initInflate gzipBits >>= inflate input
gunzip = inflateMulti 0x1F 0x8B gzipFormat defaultDecompressParams


------------------------------------------------------------------------------
-- | Decompress a single gzip stream from a an 'InputStream'.
gunzipOne :: InputStream ByteString -> IO (InputStream ByteString)
gunzipOne = inflateOne gzipFormat defaultDecompressParams


------------------------------------------------------------------------------
-- | Decompress an 'InputStream' of strict 'ByteString's from the @zlib@ format
decompress :: InputStream ByteString -> IO (InputStream ByteString)
decompress input = initInflate compressBits >>= inflate input
decompress = inflateOne zlibFormat defaultDecompressParams


------------------------------------------------------------------------------
-- Note: bytes pushed back to this input stream are not propagated back to the
-- source InputStream.
data IS = Input
| Popper Popper
| Done

inflate :: InputStream ByteString -> Inflate -> IO (InputStream ByteString)
inflate input state = do
ref <- newIORef Input
-- | Decompress a single compressed stream
inflateOne :: Format -> DecompressParams
-> InputStream ByteString -> IO (InputStream ByteString)
inflateOne fmt params input = do
ref <- newIORef (return $ decompressIO fmt params)
makeInputStream $ stream ref

where
stream ref = go
stream ref = join (readIORef ref) >>= go
where
go st =
case st of
DecompressInputRequired feed -> do
compressed <- readNonEmpty input
feed (maybe S.empty id compressed) >>= go
DecompressOutputAvailable out next -> do
writeIORef ref next
return (Just out)
DecompressStreamEnd crumb -> do
unRead crumb input
return Nothing
DecompressStreamError err -> do
throwIO err


------------------------------------------------------------------------------
-- | Decompress one or more compressed streams
inflateMulti :: Word8 -> Word8 -> Format -> DecompressParams
-> InputStream ByteString -> IO (InputStream ByteString)
inflateMulti magic0 magic1 fmt params input = do
ref <- newIORef undefined
writeIORef ref (initStream ref)
makeInputStream $ join (readIORef ref)
where
initStream ref = init
where
go = readIORef ref >>= \st ->
case st of
Input -> read input >>= maybe eof chunk
Popper p -> pop p
Done -> return Nothing

eof = do
x <- finishInflate state
writeIORef ref Done
if (not $ S.null x)
then return $! Just x
else return Nothing

chunk s =
if S.null s
then do
out <- flushInflate state
return $! Just out
else feedInflate state s >>= \popper -> do
writeIORef ref $ Popper popper
pop popper

pop popper = popper >>= maybe backToInput (return . Just)
backToInput = writeIORef ref Input >> read input >>= maybe eof chunk
init = go (decompressIO fmt params)
go st =
case st of
DecompressInputRequired feed -> do
compressed <- readNonEmpty input
feed (maybe S.empty id compressed) >>= go
DecompressOutputAvailable out next -> do
writeIORef ref (next >>= go)
return (Just out)
DecompressStreamEnd crumb -> do
unRead crumb input
continue <- checkMagicBytes magic0 magic1 input
if continue
then init
else return Nothing
DecompressStreamError err -> do
throwIO err

readNonEmpty :: InputStream ByteString -> IO (Maybe ByteString)
readNonEmpty input = do
ma <- read input
case ma of
Just a | S.null a -> readNonEmpty input
_ -> return ma

checkMagicBytes :: Word8 -> Word8 -> InputStream ByteString -> IO Bool
checkMagicBytes magic0 magic1 input = do
ma <- readNonEmpty input
case ma of
Nothing -> return False
Just a
| S.length a > 1 ->
do
unRead a input
return (S.index a 0 == magic0 && S.index a 1 == magic1)
| otherwise {- S.length a == 1 -} ->
do
if S.index a 0 /= magic0
then do
unRead a input
return False
else do
mb <- readNonEmpty input
case mb of
Nothing -> do
unRead a input
return False
Just b -> do
unRead b input
unRead a input
return (S.index b 0 == magic1)


------------------------------------------------------------------------------
Expand Down

8 comments on commit f39178b

@gregorycollins
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we still need zlib-bindings if we merged that patch?

@lpsmith
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think zlib-bindings may still be needed for compression, but that can be fixed.

@lpsmith
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, another thing I am unclear on, is the intended meaning of null bytestrings (if allowed), and whether or not they should somehow be preserved. Also there was some discussion of upgrading the zlib interface I am using to a public one, possibly with revisions.

@gregorycollins
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lpsmith: on compression, receiving a null bytestring means "flush the output" -- this is a convention of the library that we need to maintain, otherwise you'll break e.g. http framing.

@gregorycollins
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lpsmith: another data point -- I maintain zlib-bindings now, so if anything needs to be exposed there we can do it

@lpsmith
Copy link
Owner Author

@lpsmith lpsmith commented on f39178b Aug 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gregorycollins, I really like the new zlib interface that this revised decompression code is using. So I'm not sure that there's really any reason to update zlib-bindings at least for the decompression side.

Another data point is this behavior of gunzip/zcat:

$ echo Hello | gzip > test.gz
$ echo World | gzip >> test.gz
$ echo Some Junk >> test.gz 
$ zcat test.gz 
Hello
World

gzip: test.gz: decompression OK, trailing garbage ignored

So, this revised Streams.gunzip is consistent-ish with the behavior of the command line gunzip, though it might be worth investigating gunzip's behavior in other cases of trailing junk. (e.g. what happens if the trailing junk starts with the magic bytes, but isn't a valid gz stream. Also, are there any other magic bytes that we might want to check for?)

Since this behavior of zcat/gunzip is relatively obscure, however, I'm wondering if maybe it would be better from a UX/expectations perspective to have a gunzip that is strict about trailing "garbage", and a gunzipMany that exhibits the same behavior implemented in this proof of concept.

Flushing is kind of what I expected. It would seem that this code probably ought to be revised to preserve null bytestrings (or at least preserve one for every consecutive sequence of null bytestrings) instead of effectively filtering them out. I don't know if there's any further flushing-related integration issues with zlib we need to consider on the decompression side. I seem to recall that @hvr was working on flushing issues with the new zlib interfaces that I'm using here, at least on the compression side.

@lpsmith
Copy link
Owner Author

@lpsmith lpsmith commented on f39178b Aug 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But really, it's worth keeping in mind that the only thing that snapframework#56 needs in order to be resolved is this patch, (which may contain regressions, or may not be as correct as it should be).

HVR then brought up the possibility of eliminating the need for zlib-bindings entirely by also changing the compression code, which I personally haven't noticed any problems with (but I also haven't been using it directly myself very much.)

@lpsmith
Copy link
Owner Author

@lpsmith lpsmith commented on f39178b Aug 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, zcat in the example above does return a non-zero return value:

$ zcat test.gz
Hello
World

gzip: test.gz: decompression OK, trailing garbage ignored

$ echo $?
2

Please sign in to comment.