Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/port to py3 #184

Open
wants to merge 33 commits into
base: develop
Choose a base branch
from
Open

Feature/port to py3 #184

wants to merge 33 commits into from

Conversation

geron
Copy link

@geron geron commented Nov 28, 2016

No description provided.

flavioamieiro and others added 27 commits November 23, 2016 14:02
It looks like we are not using it anymore
Celery 4.0 won't automatically register tasks that inherit from Task. We
need to solve this before we can use the newer version. Celery's
documentation only mentions class-based tasks in a way that is very
different from what we do here, though.
This test was changed by 2to3, that's why it broke.
This test now makes sure only the correct document is updated.
2to3 introduced an error because it couldn't know what StringIO was
being used for.
The next few commits will change tests in a similar manner. This test is
no longer touching the database, because we rely on the PyPLNTask test
to test the fetch/save functionality.
List sorting changed in Python 3 and apparently string sorting did too.
This test is still not passing because cld is throwing an exception
Address a possible exception raised by Magic.id_buffer and remove the
superfluous text.decode('utf-8', 'replace') call since decoding with the
iso8859-1 codec will never raise a UnicodeDecodeError exception.
Copy link
Member

@flavioamieiro flavioamieiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work @geron . I know it's not completely finished, but aside from the small comments I made, I think everything is great. Check out what I said about the options we have for the \x96 byte issue, and let's see if @fccoelho has any other insight.

checker.set_text(document['text'])
errors = [[e.word, e.wordpos, e.suggest()] for e in checker]
except KeyError:
errors = None

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the change in the way we treat the case where the specific dictionary does not exist (it wasn't a very good idea to do that catching an exception) and I understand why you kept the try/except block (because there is another possible KeyError there). But now that this change was made, I think we can get rid of the entire try/except. This possible KeyError is possible in every Task (it represents a pre-requisite not being there when the task runs). I think I prefer to let this exception trickle up so we can see later there was an error (as it happens everywhere else). In short: you found a bug that meant silencing errors that shouldn't be silenced :)

# Decoding with iso8859-1 doesn't raise UnicodeDecodeError, so this is
# a last resort.
result = text.decode('iso8859-1')
return result

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand here, the behavior didn't change because we were never executing result = text.decode('utf-8', 'replace') (since decoding with iso8859-1 never raises a UnicodeDecodeError), right? If that's the case, perfect. If not, I think it would be a good idea to keep the forced decoding.

Do you know why decoding with iso8859-1 never raises this exception? (I'm not doubting it, just didn't understand the reason :) )

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because iso8859-1 is a single byte encoding and every byte from \x00 to \xff is a "valid" char in it. If for instance you were to erroneously try to decode chars encoded in utf8 you would just obtain a sequence of strange chars (aka a mojibake) for each multi-byte utf8 char. If you can provide a test case that proves otherwise I'd be glad to change my mind though.

A sort of proof for what I said about every byte being a valid iso8859-1 char:

In [1]: import struct

In [2]: for i in range(256):
   ...:     byte = struct.pack('B', i)
   ...:     char = byte.decode('iso8859-1')
   ...:     print(i, repr(char), char)
   ...:     
0 '\x00' 
1 '\x01'2 '\x02'3 '\x03'4 '\x04'5 '\x05' 
6 '\x06'7 '\x07' 
8 '\x08' 
9 '\t' 	
10 '\n' 

11 '\x0b' 

12 '\x0c' 

13 '\r' 
14 '\x0e' 
15 '\x0f' 
16 '\x10'17 '\x11'18 '\x12'19 '\x13'20 '\x14'21 '\x15'22 '\x16'23 '\x17'24 '\x18'25 '\x19'26 '\x1a'27 '\x1b'28 '\x1c'29 '\x1d'30 '\x1e'31 '\x1f'32 ' '  
33 '!' !
34 '"' "
35 '#' #
36 '$' $
37 '%' %
38 '&' &
39 "'" '
40 '(' (
41 ')' )
42 '*' *
43 '+' +
44 ',' ,
45 '-' -
46 '.' .
47 '/' /
48 '0' 0
49 '1' 1
50 '2' 2
51 '3' 3
52 '4' 4
53 '5' 5
54 '6' 6
55 '7' 7
56 '8' 8
57 '9' 9
58 ':' :
59 ';' ;
60 '<' <
61 '=' =
62 '>' >
63 '?' ?
64 '@' @
65 'A' A
66 'B' B
67 'C' C
68 'D' D
69 'E' E
70 'F' F
71 'G' G
72 'H' H
73 'I' I
74 'J' J
75 'K' K
76 'L' L
77 'M' M
78 'N' N
79 'O' O
80 'P' P
81 'Q' Q
82 'R' R
83 'S' S
84 'T' T
85 'U' U
86 'V' V
87 'W' W
88 'X' X
89 'Y' Y
90 'Z' Z
91 '[' [
92 '\\' \
93 ']' ]
94 '^' ^
95 '_' _
96 '`' `
97 'a' a
98 'b' b
99 'c' c
100 'd' d
101 'e' e
102 'f' f
103 'g' g
104 'h' h
105 'i' i
106 'j' j
107 'k' k
108 'l' l
109 'm' m
110 'n' n
111 'o' o
112 'p' p
113 'q' q
114 'r' r
115 's' s
116 't' t
117 'u' u
118 'v' v
119 'w' w
120 'x' x
121 'y' y
122 'z' z
123 '{' {
124 '|' |
125 '}' }
126 '~' ~
127 '\x7f'128 '\x80'129 '\x81'130 '\x82'131 '\x83'132 '\x84'133 '\x85'134 '\x86'135 '\x87'136 '\x88'137 '\x89'138 '\x8a'139 '\x8b'140 '\x8c'141 '\x8d'142 '\x8e'143 '\x8f'144 '\x90'145 '\x91'146 '\x92'147 '\x93'148 '\x94'149 '\x95'150 '\x96'151 '\x97'152 '\x98'153 '\x99'154 '\x9a'155 '\x9b'156 '\x9c'157 '\x9d'158 '\x9e'159 '\x9f'160 '\xa0'  
161 '¡' ¡
162 '¢' ¢
163 '£' £
164 '¤' ¤
165 '¥' ¥
166 '¦' ¦
167 '§' §
168 '¨' ¨
169 '©' ©
170 'ª' ª
171 '«' «
172 '¬' ¬
173 '\xad' ­
174 '®' ®
175 '¯' ¯
176 '°' °
177 '±' ±
178 '²' ²
179 '³' ³
180 '´' ´
181 'µ' µ
182 '¶'183 '·' ·
184 '¸' ¸
185 '¹' ¹
186 'º' º
187 '»' »
188 '¼' ¼
189 '½' ½
190 '¾' ¾
191 '¿' ¿
192 'À' À
193 'Á' Á
194 'Â' Â
195 'Ã' Ã
196 'Ä' Ä
197 'Å' Å
198 'Æ' Æ
199 'Ç' Ç
200 'È' È
201 'É' É
202 'Ê' Ê
203 'Ë' Ë
204 'Ì' Ì
205 'Í' Í
206 'Î' Î
207 'Ï' Ï
208 'Ð' Ð
209 'Ñ' Ñ
210 'Ò' Ò
211 'Ó' Ó
212 'Ô' Ô
213 'Õ' Õ
214 'Ö' Ö
215 '×' ×
216 'Ø' Ø
217 'Ù' Ù
218 'Ú' Ú
219 'Û' Û
220 'Ü' Ü
221 'Ý' Ý
222 'Þ' Þ
223 'ß' ß
224 'à' à
225 'á' á
226 'â' â
227 'ã' ã
228 'ä' ä
229 'å' å
230 'æ' æ
231 'ç' ç
232 'è' è
233 'é' é
234 'ê' ê
235 'ë' ë
236 'ì' ì
237 'í' í
238 'î' î
239 'ï' ï
240 'ð' ð
241 'ñ' ñ
242 'ò' ò
243 'ó' ó
244 'ô' ô
245 'õ' õ
246 'ö' ö
247 '÷' ÷
248 'ø' ø
249 'ù' ù
250 'ú' ú
251 'û' û
252 'ü' ü
253 'ý' ý
254 'þ' þ
255 'ÿ' ÿ

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@geron that makes perfect sense to me. As I said, I didn't doubt it, I just didn't understand it before. Now I am a little ashamed of not realizing that.

call('utf-8'),
call('iso8859-1')])


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! We really needed this!

u", ".join(unicode(item) for item in diff_set)))
self.assertEqual(refreshed_document['mimetype'], 'application/pdf')
", ".join(str(item) for item in diff_set)))
self.assertEqual(result['mimetype'], 'application/pdf')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason, when running this test on my machine, the date format is different. It might be a version of pdfinfo or something like that. For the test to pass on my machine I needed the following patch:

diff --git a/tests/test_worker_extractor.py b/tests/test_worker_extractor.py
index 4c94349..9f4d2e0 100644
--- a/tests/test_worker_extractor.py
+++ b/tests/test_worker_extractor.py
@@ -123,7 +123,7 @@ class TestExtractorWorker(TestCase):
                 'Author':         'Álvaro Justen',
                 'Creator':        'Writer',
                 'Producer':       'LibreOffice 3.5',
-                'CreationDate':   'Fri Jun  1 17:07:57 2012',
+                'CreationDate':   'Fri Jun  1 17:07:57 2012 BRT',
                 'Tagged':         'no',
                 'Pages':          '1',
                 'Encrypted':      'no',

This probably does not happen on your system, so we might need to check if the problem really is with software versions, and make sure we are always using the same, everywhere.

result = Extractor().process(data)
self.assertEqual(result['text'], expected)
self.assertEqual(result['file_metadata'], {})
self.assertEqual(result['language'], 'en')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the test that is still not passing because pycld2 is blowing up with the weird byte, right?

I don't know exactly what to do about this. I don't think any of the options is very good. From what I remembered you gave me three options (of course we can think of new ones):

  1. Try to use the old cld version
  2. Not detect language for this documents
  3. Try to remove the weird byte somehow

1 sounds terrible because it is old, unmaintained code, that does not support python3. It's my last option!

2 has one big problem: if we don't detect language, we won't run palavras for the documents that are in Portuguese. This means a lot of the important analysis won't happen.

3 sounds a bit more interesting if we can do it the right way. I'm worried about two things: we should only try to do it when we detect the weird byte in the document, and it also should be reasonably fast. This options has another limitation, though: if we find different, new weird bytes, we have to change our code to check for them also.

@fccoelho Any ideas on how to deal with this?

@fccoelho
Copy link
Member

fccoelho commented Dec 1, 2016

Minor comment: Mongodict is still in the requirements...

@geron
Copy link
Author

geron commented Dec 1, 2016

@fccoelho the purpose of this branch is to only port to python 3. migrating to another database will come in steps in another branch after this one is merged.

@flavioamieiro
Copy link
Member

Mongodict hasn't been used for about a year, so if it is in the requirements file, it should go. I didn't find it there, though.

@fccoelho
Copy link
Member

fccoelho commented Dec 1, 2016 via email

@fccoelho
Copy link
Member

fccoelho commented Dec 8, 2016

In the bigram implementation, the order of the bigrams do not matter since they only represent the unique bigrams found in a document.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants