Util::normalizeLineEndings breaks UTF-8 #15

andrey-yantsen · 2021-12-15T16:51:58Z

Hi!

First of all, thank you for all the libraries you've built, they're magnificent! :)

I've encountered an error when working with \SebastianFeldmann\Git\Command\Diff\Compare\FullDiffList: it goes crazy when the diff has binary/UTF-8 data. After some digging I found out the reason: it's all because of \SebastianFeldmann\Cli\Output\Util::normalizeLineEndings. For example, the following text will be processed incorrectly: text: хо — the last two symbols are Cyrillic letters, but here's the result of the method:

php > var_dump(\SebastianFeldmann\Cli\Output\Util::normalizeLineEndings('text: хо'));
string(10) "text: �
о"

It could be enough to replace the regex pattern with ~(BSR_ANYCRLF)*\R~u — at least it fixes my case, but I'm not sure about the possible side effects.

The text was updated successfully, but these errors were encountered:

sebastianfeldmann · 2021-12-20T13:58:50Z

Oh you are absolutely right the u modifier should be there. If I'm not mistaken this should not have any side effects.
I'll release a new version in a couple of minutes

andrey-yantsen · 2021-12-20T14:20:32Z

Awesome, thanks! I think by the side effects I mean how it will modify the behaviour of the code, and whether it should be treated as a breaking change or not.

sebastianfeldmann · 2021-12-20T14:35:56Z

And as always, it seems simple but it isn't especially with encoding. Of course adding the u breaks some other tests.

I'll investigate and see what I can find ;)

This fixes issue #15 in a very hacky way. The pure cyrillic detection must be replaced with a more generic approach. But for now I could not figure it out. So this hack must do for now.

andrey-yantsen · 2021-12-20T15:11:20Z

Just in case — I had the same problem not only with Cyrillic, but with binary data as well. The first time I noticed the problem was when I run the FullDiffList over a vendored captainhook — the diff went crazy over this file: https://github.com/captainhookphp/captainhook/blob/main/tools/phive. But that time I didn't really paid attention, because, well, it was binary data :)

sebastianfeldmann · 2021-12-20T15:29:03Z

The unicode/utf8 stuff proves to be a pain in the *** again and again.
It seems PHP interprets some cyrillic letters as line breaks if you don't specify the u modifier to tell PHP that it receives an UTF-8 string. But using this modifier all the time breaks ASCII and ISO handling.

I added a dirty hack to detect if I have to use the modifier or not. For now that's fine but I have to investigate that further to come up with a more sustainable solution.

Never the less I released a new version 3.4.1 with the fix.
composer update should fix the issue :)

If you need me to compile another PHAR for CaptainHook including the fix just let me know.

andrey-yantsen · 2021-12-20T15:31:33Z

That's enough for me right now, thanks! I hope you will be able to find a better solution later.

sebastianfeldmann self-assigned this Dec 20, 2021

sebastianfeldmann added the bug label Dec 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Util::normalizeLineEndings breaks UTF-8 #15

Util::normalizeLineEndings breaks UTF-8 #15

andrey-yantsen commented Dec 15, 2021

sebastianfeldmann commented Dec 20, 2021

andrey-yantsen commented Dec 20, 2021

sebastianfeldmann commented Dec 20, 2021

andrey-yantsen commented Dec 20, 2021

sebastianfeldmann commented Dec 20, 2021

andrey-yantsen commented Dec 20, 2021

Util::normalizeLineEndings breaks UTF-8 #15

Util::normalizeLineEndings breaks UTF-8 #15

Comments

andrey-yantsen commented Dec 15, 2021

sebastianfeldmann commented Dec 20, 2021

andrey-yantsen commented Dec 20, 2021

sebastianfeldmann commented Dec 20, 2021

andrey-yantsen commented Dec 20, 2021

sebastianfeldmann commented Dec 20, 2021

andrey-yantsen commented Dec 20, 2021