-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unsupported encodings? #59
Comments
You're right that the underlying reader is byte-centric and the wrapper approach falls down on null bytes. None of the control characters of CSV are outside the ASCII format -- are you saying that UTF-16 encodings of these control characters include null bytes? (I've not encountered utf-16 csvs in my work.) |
Yes. The UTF-16 encoding of an ASCII file is simply that ASCII file with null bytes inserted before each byte (or after, depending on if it's UTF-16LE or UTF-16BE). |
... I'm a bit surprised I haven't heard complaints about this before. :) You're right that the approach would need to change to fix this -- namely wrapping the given file in a decoder before handing it to the underlying |
Does this only affect the reader, or does it also affect the writer? I think the "right" solution is to create a backport of the Python 3 csv module, which only works on unicode, and wrap the file being read in a decoder. However, that is, at best, a ways off. One possible, but performance terrible, approach that we could take would be to wrap the file in a decoder, and also an encoder in some acceptable encoding ( |
Turns out that the Python 2 csv module actually has doing that as an example at the end of the documentation. https://docs.python.org/2/library/csv.html#examples I've now written a pure-python backport of the Python 3 csv module, so we could choose that approach to solving the problem. It would just be using the same code as the Python 3 version of unicodecsv when we detected an encoding other than But because the implementation is in pure python, I think it could likely be slower than the special encoding wrapper to ensure that csv was always dealing with @jdunck : I'm interested in writing up a solution to this problem, but I'm not sure which approach would be better. Is it better to use the decoder, or to use the backport? |
I wanted to parse a UTF-16 CSV file, so I did something like this:
Unfortunately, this just raises an exception when I try to read from it. I looked at the unicodecsv source code, and I don't think the unicodecsv approach can ever work for this case. It tries loading the input stream as 8-bit characters, and then decodes each cell value. Python's 'csv' module can't handle NUL bytes, which are common in UTF-16, so this fails.
I think the answer to this is that the 'unicodecsv' library only works for encodings like UTF-8 or Latin-1 which are supersets of ASCII, and don't use 0x00 bytes. Is this true? We should put it in the documentation.
(Also, I think this means I should really upgrade to Python 3!)
The text was updated successfully, but these errors were encountered: