-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible way to improve Kosinski+'s speed #8
Comments
Re: Chameleon: It should be easy to adapt the framework for this case. I could add a pair of classes that are selected in the format adaptor ("using stream_kind = xxx") and controls the behavior of LZSS(I|O)Stream:
LZSS(I|O)Stream already encapsulate a lot of this behavior: LZSSOstream, for example, has an internal buffer for non-descriptor data that is used until the current descriptor is filled. Then, the descriptor is written to the stream, and the buffer is flushed onto the stream. This one could be easily modified to write the descriptors into a buffer, and deferring the flushes to the stream until after all data has been written. LZSSIstream would require slightly more work. But this work would be useful in any case: I am planning on adding some non-LZSS formats that would use LZSS(I|O)Stream (which have nothing exclusive to LZSS), and some of them could use a similar idea of splitting the streams. As a bonus, this could even be adapted to be usable by the S2 Special Stage mapping compression format (which could really use a name... and I don't want to put my name on it, despite being the one that figured it out). Re: file size limit due to 2-byte offset: in the worst case scenario, all descriptor bits correspond to literals, and the maximum file size is 64kB. This is already equal to the total RAM size for the Mega Drive, and the file is actually bigger when compressed. For a more typical case, the maximum file size would be a lot bigger, because descriptors correspond to a lot more data. I am skeptical that this would give a significant boost of speed for Kosinski+ because descriptor reads are infrequent compared to other reads: the best possible gains come from the worst possible scenario in terms of compression (every descriptor correspond to a literal byte). In this case, 1 descriptor byte is read every 8 literal bytes, so that the savings of reading 2 descriptor bytes in one instruction are 16 cycles (for one less Comper gets a benefit from 16-bit memory access because every memory operation it does is a word operation. Thus, the gains scale in direct proportion to the length of dictionary matches, instead of being diluted by it. But there is one potential additional benefit that you did not consider: that of compression ratio. The tail end of the descriptor stream could potentially be equal to the start of the non-descriptor data stream. Thus, the starting offset could point to somewhere before the end of the descriptor stream, and save a few bytes. And it does not even have to point to a word, because the non-descriptor data is bytes anyway. I will study to see if there is a benefit from this in terms of compression ratio for Kosinski+. But as for extending the framework to be able to handle separate streams for descriptors and data, it is a good idea and I will definitely do it. Possibly at the same time I optimize the whole thing for speed by getting rid of most of the c++ stream APIs for the internal compression classes. |
One more advantage of splitting the descriptor and data streams which I did not realize before: because there is an explicit end-of-stream code in the format, the compressor can treat the descriptor stream as being composed of bytes, and the 68k decompressor can just pretend that the descriptor stream is composed of words/longwords. This means that the descriptor stream will not have wasted bytes even if there is no useful overlap between the end of the descriptor stream and the start of the data stream. |
Another way to increase speed without a format change is @vladikcomper's RLE mode from ComperX (pull #21): if the displament is -1, cache the value into a register (say, A tweak on that might be to use movep.w 0(a1),d0
move.b (a1)+,d0
move.w d0,d1
swap d0
move.w d1,d0 And the unrolled loop could be something like: movep.l d0,0(a1)
movep.l d0,1(a1)
movep.l d0,8(a1)
movep.l d0,9(a1)
movep.l d0,16(a1)
movep.l d0,17(a1)
movep.l d0,24(a1)
movep.l d0,25(a1)
lea 32(a1),a1 with a suitable Duff's device before the loop body. |
Blocked by #25 Also, I think I will make this into a new format and call it KosX. I will also be appropriating some ideas from ComperX, namely the recognition that copying from an offset of -1 can be done faster, and probably changing the way offset -1 is encoded for faster checking. |
So I've been working on my own set of graph-based LZSS compressors, and recently added support for Chameleon - the compression used by Kid Chameleon and Sonic 2 Nick Arcade (only for GHZ's chunks).
Chameleon has a quirk that makes it impossible to support in your framework (at least it did a year or two ago, when I last tried): it stores its descriptors separately from the rest of the compressed data.
This gave me an idea: Comper uses 16-bit descriptors, and can read them with a single instruction thanks to everything else being 16 bits long. This isn't the case with Kosinski, which had to read each byte separately. However, bunching the descriptors together would guarantee that each one starts on an even address, allowing faster reads.
The downside to this is needing to store an offset to the end of the descriptor data in the compressed file (assuming the file starts with the descriptors, like Chameleon does), which in turn creates a file-size limit.
The text was updated successfully, but these errors were encountered: