You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm writing an mapReduce script (and thus are working with input / output streams).
If i use the unicodecsv module
#!/usr/bin/python
import sys
import unicodecsv as csv
def mapper():
reader = csv.reader(sys.stdin, delimiter='\t')
# writer = csv.writer(sys.stdout, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)
for line in reader:
print line
Then i get the error:
Traceback (most recent call last):
File "scripts/streaming/adwords/mapper.py", line 30, in <module>
mapper()
File "scripts/streaming/adwords/mapper.py", line 10, in mapper
for line in reader:
File "/usr/local/lib/python2.7/dist-packages/unicodecsv/py2.py", line 117, in next
row = self.reader.next()
_csv.Error: line contains NULL byte
If i read the file with pandas
data = pandas.read_csv(input_file, encoding='utf-16', sep='\t', skiprows=5, skip_footer=1, engine='python')
then everything works like a charm.
I don't know how to resolve this issue. I tried almost everything, even opening and saving (in utf-8) the file with libreOffice, but that can't be a solution because my csv files are to big for libreOffice.
If i open / save the file with libreOffice in utf-8 and run the script again the strings in the lines are prefixed with u. I know this has something to do with encodings but it's not clear to me how it works.
Preferably i want to read the (unicode (i guess)) input stream, map it line by line (and encode it to utf-8) and write it like writer.writerow((line[0] + line[2], line[5])) so that my reducer.py doesn't have to hassle with encodings.
any help would deeply be appreciated.
The text was updated successfully, but these errors were encountered:
The first issue is that you're not using the right encoding. The unicodecsv reader requires a binary-opened file. the reader function takes an encoding argument, which defaults to utf-8, which is incorrect for you.
Unfortunately, unicodecsv doesn't yet support utf-16, because the underlying reader doesn't allow for any null bytes. We have talked about a couple ideas for fixing it, but it hasn't been implemented yet.
I'm writing an mapReduce script (and thus are working with input / output streams).
If i use the
unicodecsv
moduleThen i get the error:
If i read the
file
withpandas
then everything works like a charm.
I don't know how to resolve this issue. I tried almost everything, even opening and saving (in utf-8) the file with libreOffice, but that can't be a solution because my csv files are to big for libreOffice.
If i open / save the file with libreOffice in
utf-8
and run the script again the strings in the lines are prefixed withu
. I know this has something to do with encodings but it's not clear to me how it works.Preferably i want to read the (unicode (i guess)) input stream, map it line by line (and encode it to utf-8) and write it like
writer.writerow((line[0] + line[2], line[5]))
so that my reducer.py doesn't have to hassle with encodings.any help would deeply be appreciated.
The text was updated successfully, but these errors were encountered: