Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a compression algorithm without ZERO width char ? #126

Open
finscn opened this issue Jan 18, 2019 · 6 comments
Open

Is there a compression algorithm without ZERO width char ? #126

finscn opened this issue Jan 18, 2019 · 6 comments

Comments

@finscn
Copy link

finscn commented Jan 18, 2019

In some android browser , it can't run LZString.decompress correctly when the string include ZERO WIDTH char (e.g. 0x80 0x86 ...)

Is there a method of LZString could do compress without ZERO WIDTH char ?
Thanks

@pieroxy
Copy link
Owner

pieroxy commented Jan 18, 2019

Try LZString.compressToUTF16

@finscn
Copy link
Author

finscn commented Jan 19, 2019

@pieroxy , I tried . no use.
And I found the key point is not ZERO width char.
It's \u2028 & \u2029.
if string includes these 2 chars, the JSON.parse will take some problems.

Is there any method make the result of LZString.compress without \u2028 & \u2029 ?

test case (in some android browsers , You know ,in China there are many brands of Mobile Phone, and they use many custom browsers )

var compressed = LZString.compressToUTF16(str);
var json = {
    "data": compressed;
}

var jsonStr = JSON.stringify(json);

// CAN'T parse
var obj = JSON.parse(jsonStr);

if no \u2028 & \u2029 , everything is ok.

@pieroxy
Copy link
Owner

pieroxy commented Feb 1, 2019

If you look at compressToUTF16, it is easy to adapt to your use case:

compressToUTF16 : function (input) {
    if (input == null) return "";
    return LZString._compress(input, 15, function(a){return f(a+32);}) + " ";
  }

You can add a simple check in the lambda provided ( return f(a+32) ) to map both characters you want to avoid to other characters (for example 70001 and 70002).
You will have to implement the reverse lambda in the corresponding decompress method.

@Blanen
Copy link

Blanen commented Apr 3, 2019

compressToBase64

@JobLeonard
Copy link
Collaborator

Here you go: https://gist.github.com/JobLeonard/7a49b8e5adf17d9a3783ffcfa21eec3f

I also removed the string-characters, so:

`
"
'

So copying/pasting a compressed string won't suddenly lead to strange broken string behavior.

Did a quick test round here: https://observablehq.com/d/d223e05380aa85c9/

@cyfung1031
Copy link

cyfung1031 commented Jan 8, 2024

You can do some easy tricks.

For example, add a fixed 4 chars to beginning of every compressed string, which is not found in the compressed string.
Then change your problem chars \u2028 & \u2029 to M\u1E28 M\u1E29.
So you can decode it back by M\u1E28 M\u1E29 to \u2028 & \u2029

choose of chars in M:

  • above 0xFF
  • random range is large
  • excluded from [\u2000-\u200A\u202F\u205F\u3000\u200B-\u200D\u2060\uFEFF\u180E\u2800\u3164\u2800]
  • do no conflict with offset

so, \u3165 - \uFEFE should be a good range.
consider negative offset 0x200, then \u3165 - \uFCFE

function rndInt(a, b) {
  return Math.floor(Math.random() * (b - a + 1)) + a;
}

function compress(s) {
  let z = LZString.compress(s);
  let m = '';
  do {
    m = Array.from({ length: 4 }, () => String.fromCharCode(rndInt(0x3165, 0xFCFE))).join('') // generate a fixed length prefix
  } while (z.includes(m)) // ensure it is not used in the compression
// breaking: \u2028\u2029
// zero-width: \u200B-\u200D\u2060\uFEFF
// problem chars: \u180E\u2800\u3164
// other white spaces: \u2000-\u200A\u202F\u205F\u3000 
  z = z.replace(/[\u2028\u2029\u200B-\u200D\u2060\uFEFF\u180E\u2800\u3164\u2000-\u200A\u202F\u205F\u3000]/g, e => m + String.fromCharCode(e.charCodeAt() - 0x200)); // offset -0x200
  return m + z;
}

function decompress(z) {
  let m = z.substring(0, 4);
  z = z.substring(4);
  z = z.replace(new RegExp(`${m}(.)`, 'g'), (e, f) => String.fromCharCode(f.charCodeAt() + 0x200)); // offset +0x200
  const s = LZString.decompress(z);
  return s;
}

Note:

  • Before offset ['2028', '2029', '200B', '200D', '2060', 'FEFF', '180E', '2800', '3164', '2000', '200A', '202F', '205F', '3000']
  • After offset: ["1E28","1E29","1E0B","1E0D","1E60","FCFF","160E","2600","2F64","1E00","1E0A","1E2F","1E5F","2E00"]

Another approach

like escape unescape. instead of slash, do with the least occurence char t to ensure min output.

  1. find the least occurrence char t

  2. replace all of them to t0

  3. replace your unwanted char to t1 t2 t3 ... t9 ta tb... etc

  4. put the chat t in the prefix.

then you can reverse back to decode it. first char is t. t1 t2 t3 ... at the end. replace t0 to t

this should be very efficient and reliable.

function findLeastFrequentChar(text) {
    const charCount = {};
    let minCount = Infinity;
    let leastFrequentChar = '';

    // Count the occurrences of each character
    for (let char of text) {
        if (!charCount[char]) {
            charCount[char] = 0;
        }
        charCount[char]++;
    }

    // Find the character with the least occurrences
    for (let char in charCount) {
        if (charCount[char] < minCount) {
            minCount = charCount[char];
            leastFrequentChar = char;
        }
    }

    return leastFrequentChar;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants