Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

working with streams #73

Open
willembressers opened this issue Jan 25, 2016 · 1 comment
Open

working with streams #73

willembressers opened this issue Jan 25, 2016 · 1 comment

Comments

@willembressers
Copy link

I'm writing an mapReduce script (and thus are working with input / output streams).

If i use the unicodecsv module

#!/usr/bin/python
import sys
import unicodecsv as csv


def mapper():
    reader = csv.reader(sys.stdin, delimiter='\t')
    # writer = csv.writer(sys.stdout, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)

    for line in reader:
        print line

Then i get the error:

Traceback (most recent call last):
  File "scripts/streaming/adwords/mapper.py", line 30, in <module>
    mapper()
  File "scripts/streaming/adwords/mapper.py", line 10, in mapper
    for line in reader:
  File "/usr/local/lib/python2.7/dist-packages/unicodecsv/py2.py", line 117, in next
    row = self.reader.next()
_csv.Error: line contains NULL byte

If i read the file with pandas

data = pandas.read_csv(input_file, encoding='utf-16', sep='\t', skiprows=5, skip_footer=1, engine='python')

then everything works like a charm.

I don't know how to resolve this issue. I tried almost everything, even opening and saving (in utf-8) the file with libreOffice, but that can't be a solution because my csv files are to big for libreOffice.

If i open / save the file with libreOffice in utf-8 and run the script again the strings in the lines are prefixed with u. I know this has something to do with encodings but it's not clear to me how it works.

Preferably i want to read the (unicode (i guess)) input stream, map it line by line (and encode it to utf-8) and write it like writer.writerow((line[0] + line[2], line[5])) so that my reducer.py doesn't have to hassle with encodings.

any help would deeply be appreciated.

@ryanhiebert
Copy link
Collaborator

The first issue is that you're not using the right encoding. The unicodecsv reader requires a binary-opened file. the reader function takes an encoding argument, which defaults to utf-8, which is incorrect for you.

Unfortunately, unicodecsv doesn't yet support utf-16, because the underlying reader doesn't allow for any null bytes. We have talked about a couple ideas for fixing it, but it hasn't been implemented yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants