Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bag of Words + Unicode Decode Unicode cruft returns #87

Open
roomthily opened this issue Apr 30, 2015 · 3 comments
Open

Bag of Words + Unicode Decode Unicode cruft returns #87

roomthily opened this issue Apr 30, 2015 · 3 comments
Assignees
Labels

Comments

@roomthily
Copy link
Contributor

Honestly not sure if this ran before/after implementation but it's fantastic either way.

unicode_cruft_fail

And also really want to know where this came from re: devil donuts.

@roomthily roomthily added the bug label Apr 30, 2015
@roomthily roomthily self-assigned this Apr 30, 2015
@roomthily
Copy link
Contributor Author

And whatever we want to call this:

solr_cruft_fail

Update: this is called trying to embed math formulas in an email listserv.

@roomthily
Copy link
Contributor Author

Fun fact, the cruft in the first image is emoji-related and we don't manage that (or nutch doesn't handle that).

The source file:

we_cannot_handle_emoji

@roomthily
Copy link
Contributor Author

Should be okay with the unicode/decode here: b429761.

Honestly quite disappointed in the lack of emoji support :(.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant