Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Util::normalizeLineEndings breaks UTF-8 #15

Open
andrey-yantsen opened this issue Dec 15, 2021 · 6 comments
Open

Util::normalizeLineEndings breaks UTF-8 #15

andrey-yantsen opened this issue Dec 15, 2021 · 6 comments
Assignees
Labels

Comments

@andrey-yantsen
Copy link

Hi!

First of all, thank you for all the libraries you've built, they're magnificent! :)

I've encountered an error when working with \SebastianFeldmann\Git\Command\Diff\Compare\FullDiffList: it goes crazy when the diff has binary/UTF-8 data. After some digging I found out the reason: it's all because of \SebastianFeldmann\Cli\Output\Util::normalizeLineEndings. For example, the following text will be processed incorrectly: text: хо — the last two symbols are Cyrillic letters, but here's the result of the method:

php > var_dump(\SebastianFeldmann\Cli\Output\Util::normalizeLineEndings('text: хо'));
string(10) "text: �
о"

It could be enough to replace the regex pattern with ~(BSR_ANYCRLF)*\R~u — at least it fixes my case, but I'm not sure about the possible side effects.

@sebastianfeldmann
Copy link
Owner

Oh you are absolutely right the u modifier should be there. If I'm not mistaken this should not have any side effects.
I'll release a new version in a couple of minutes

@andrey-yantsen
Copy link
Author

Awesome, thanks! I think by the side effects I mean how it will modify the behaviour of the code, and whether it should be treated as a breaking change or not.

@sebastianfeldmann
Copy link
Owner

And as always, it seems simple but it isn't especially with encoding. Of course adding the u breaks some other tests.

I'll investigate and see what I can find ;)

sebastianfeldmann added a commit that referenced this issue Dec 20, 2021
This fixes issue #15 in a very hacky way.
The pure cyrillic detection must be replaced with a
more generic approach. But for now I could not figure
it out. So this hack must do for now.
@andrey-yantsen
Copy link
Author

Just in case — I had the same problem not only with Cyrillic, but with binary data as well. The first time I noticed the problem was when I run the FullDiffList over a vendored captainhook — the diff went crazy over this file: https://github.com/captainhookphp/captainhook/blob/main/tools/phive. But that time I didn't really paid attention, because, well, it was binary data :)

@sebastianfeldmann
Copy link
Owner

The unicode/utf8 stuff proves to be a pain in the *** again and again.
It seems PHP interprets some cyrillic letters as line breaks if you don't specify the u modifier to tell PHP that it receives an UTF-8 string. But using this modifier all the time breaks ASCII and ISO handling.

I added a dirty hack to detect if I have to use the modifier or not. For now that's fine but I have to investigate that further to come up with a more sustainable solution.

Never the less I released a new version 3.4.1 with the fix.
composer update should fix the issue :)

If you need me to compile another PHAR for CaptainHook including the fix just let me know.

@andrey-yantsen
Copy link
Author

That's enough for me right now, thanks! I hope you will be able to find a better solution later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants