Feature/port to py3 #184

geron · 2016-11-28T15:04:59Z

No description provided.

It looks like we are not using it anymore

Celery 4.0 won't automatically register tasks that inherit from Task. We need to solve this before we can use the newer version. Celery's documentation only mentions class-based tasks in a way that is very different from what we do here, though.

This test was changed by 2to3, that's why it broke.

This test now makes sure only the correct document is updated.

2to3 introduced an error because it couldn't know what StringIO was being used for.

The next few commits will change tests in a similar manner. This test is no longer touching the database, because we rely on the PyPLNTask test to test the fetch/save functionality.

List sorting changed in Python 3 and apparently string sorting did too.

This test is still not passing because cld is throwing an exception

…ode_object

…xtension

Address a possible exception raised by Magic.id_buffer and remove the superfluous text.decode('utf-8', 'replace') call since decoding with the iso8859-1 codec will never raise a UnicodeDecodeError exception.

flavioamieiro

Thanks for all the work @geron . I know it's not completely finished, but aside from the small comments I made, I think everything is great. Check out what I said about the options we have for the \x96 byte issue, and let's see if @fccoelho has any other insight.

flavioamieiro · 2016-11-28T15:11:39Z

pypln/backend/workers/spellchecker.py

+                checker.set_text(document['text'])
+                errors = [[e.word, e.wordpos, e.suggest()] for e in checker]
+            except KeyError:
+                errors = None



I really like the change in the way we treat the case where the specific dictionary does not exist (it wasn't a very good idea to do that catching an exception) and I understand why you kept the try/except block (because there is another possible KeyError there). But now that this change was made, I think we can get rid of the entire try/except. This possible KeyError is possible in every Task (it represents a pre-requisite not being there when the task runs). I think I prefer to let this exception trickle up so we can see later there was an error (as it happens everywhere else). In short: you found a bug that meant silencing errors that shouldn't be silenced :)

flavioamieiro · 2016-11-28T15:22:59Z

pypln/backend/workers/extractor.py

+        # Decoding with iso8859-1 doesn't raise UnicodeDecodeError, so this is
+        # a last resort.
+        result = text.decode('iso8859-1')
+    return result



From what I understand here, the behavior didn't change because we were never executing result = text.decode('utf-8', 'replace') (since decoding with iso8859-1 never raises a UnicodeDecodeError), right? If that's the case, perfect. If not, I think it would be a good idea to keep the forced decoding.

Do you know why decoding with iso8859-1 never raises this exception? (I'm not doubting it, just didn't understand the reason :) )

Because iso8859-1 is a single byte encoding and every byte from \x00 to \xff is a "valid" char in it. If for instance you were to erroneously try to decode chars encoded in utf8 you would just obtain a sequence of strange chars (aka a mojibake) for each multi-byte utf8 char. If you can provide a test case that proves otherwise I'd be glad to change my mind though.

A sort of proof for what I said about every byte being a valid iso8859-1 char:

In [1]: import struct In [2]: for i in range(256): ...: byte = struct.pack('B', i) ...: char = byte.decode('iso8859-1') ...: print(i, repr(char), char) ...: 0 '\x00' 1 '\x01' � 2 '\x02' � 3 '\x03' � 4 '\x04' � 5 '\x05' 6 '\x06' � 7 '\x07' 8 '\x08' 9 '\t' 10 '\n' 11 '\x0b' 12 '\x0c' 13 '\r' 14 '\x0e' 15 '\x0f' 16 '\x10' � 17 '\x11' � 18 '\x12' � 19 '\x13' � 20 '\x14' � 21 '\x15' � 22 '\x16' � 23 '\x17' � 24 '\x18' � 25 '\x19' � 26 '\x1a' � 27 '\x1b' � 28 '\x1c' � 29 '\x1d' � 30 '\x1e' � 31 '\x1f' � 32 ' ' 33 '!' ! 34 '"' " 35 '#' # 36 '$' $ 37 '%' % 38 '&' & 39 "'" ' 40 '(' ( 41 ')' ) 42 '*' * 43 '+' + 44 ',' , 45 '-' - 46 '.' . 47 '/' / 48 '0' 0 49 '1' 1 50 '2' 2 51 '3' 3 52 '4' 4 53 '5' 5 54 '6' 6 55 '7' 7 56 '8' 8 57 '9' 9 58 ':' : 59 ';' ; 60 '<' < 61 '=' = 62 '>' > 63 '?' ? 64 '@' @ 65 'A' A 66 'B' B 67 'C' C 68 'D' D 69 'E' E 70 'F' F 71 'G' G 72 'H' H 73 'I' I 74 'J' J 75 'K' K 76 'L' L 77 'M' M 78 'N' N 79 'O' O 80 'P' P 81 'Q' Q 82 'R' R 83 'S' S 84 'T' T 85 'U' U 86 'V' V 87 'W' W 88 'X' X 89 'Y' Y 90 'Z' Z 91 '[' [ 92 '\\' \ 93 ']' ] 94 '^' ^ 95 '_' _ 96 '`' ` 97 'a' a 98 'b' b 99 'c' c 100 'd' d 101 'e' e 102 'f' f 103 'g' g 104 'h' h 105 'i' i 106 'j' j 107 'k' k 108 'l' l 109 'm' m 110 'n' n 111 'o' o 112 'p' p 113 'q' q 114 'r' r 115 's' s 116 't' t 117 'u' u 118 'v' v 119 'w' w 120 'x' x 121 'y' y 122 'z' z 123 '{' { 124 '|' | 125 '}' } 126 '~' ~ 127 '\x7f' � 128 '\x80' � 129 '\x81' � 130 '\x82' � 131 '\x83' � 132 '\x84' � 133 '\x85' � 134 '\x86' � 135 '\x87' � 136 '\x88' � 137 '\x89' � 138 '\x8a' � 139 '\x8b' � 140 '\x8c' � 141 '\x8d' � 142 '\x8e' � 143 '\x8f' � 144 '\x90' � 145 '\x91' � 146 '\x92' � 147 '\x93' � 148 '\x94' � 149 '\x95' � 150 '\x96' � 151 '\x97' � 152 '\x98' � 153 '\x99' � 154 '\x9a' � 155 '\x9b' � 156 '\x9c' � 157 '\x9d' � 158 '\x9e' � 159 '\x9f' � 160 '\xa0' 161 '¡' ¡ 162 '¢' ¢ 163 '£' £ 164 '¤' ¤ 165 '¥' ¥ 166 '¦' ¦ 167 '§' § 168 '¨' ¨ 169 '©' © 170 'ª' ª 171 '«' « 172 '¬' ¬ 173 '\xad' 174 '®' ® 175 '¯' ¯ 176 '°' ° 177 '±' ± 178 '²' ² 179 '³' ³ 180 '´' ´ 181 'µ' µ 182 '¶' ¶ 183 '·' · 184 '¸' ¸ 185 '¹' ¹ 186 'º' º 187 '»' » 188 '¼' ¼ 189 '½' ½ 190 '¾' ¾ 191 '¿' ¿ 192 'À' À 193 'Á' Á 194 'Â' Â 195 'Ã' Ã 196 'Ä' Ä 197 'Å' Å 198 'Æ' Æ 199 'Ç' Ç 200 'È' È 201 'É' É 202 'Ê' Ê 203 'Ë' Ë 204 'Ì' Ì 205 'Í' Í 206 'Î' Î 207 'Ï' Ï 208 'Ð' Ð 209 'Ñ' Ñ 210 'Ò' Ò 211 'Ó' Ó 212 'Ô' Ô 213 'Õ' Õ 214 'Ö' Ö 215 '×' × 216 'Ø' Ø 217 'Ù' Ù 218 'Ú' Ú 219 'Û' Û 220 'Ü' Ü 221 'Ý' Ý 222 'Þ' Þ 223 'ß' ß 224 'à' à 225 'á' á 226 'â' â 227 'ã' ã 228 'ä' ä 229 'å' å 230 'æ' æ 231 'ç' ç 232 'è' è 233 'é' é 234 'ê' ê 235 'ë' ë 236 'ì' ì 237 'í' í 238 'î' î 239 'ï' ï 240 'ð' ð 241 'ñ' ñ 242 'ò' ò 243 'ó' ó 244 'ô' ô 245 'õ' õ 246 'ö' ö 247 '÷' ÷ 248 'ø' ø 249 'ù' ù 250 'ú' ú 251 'û' û 252 'ü' ü 253 'ý' ý 254 'þ' þ 255 'ÿ' ÿ

@geron that makes perfect sense to me. As I said, I didn't doubt it, I just didn't understand it before. Now I am a little ashamed of not realizing that.

flavioamieiro · 2016-11-28T15:36:11Z

tests/test_worker_extractor.py

+                          call('utf-8'),
+                          call('iso8859-1')])
+
+


Thanks! We really needed this!

flavioamieiro · 2016-11-28T15:54:15Z

tests/test_worker_extractor.py

-                         u", ".join(unicode(item) for item in diff_set)))
-        self.assertEqual(refreshed_document['mimetype'], 'application/pdf')
+                         ", ".join(str(item) for item in diff_set)))
+        self.assertEqual(result['mimetype'], 'application/pdf')



For some reason, when running this test on my machine, the date format is different. It might be a version of pdfinfo or something like that. For the test to pass on my machine I needed the following patch:

diff --git a/tests/test_worker_extractor.py b/tests/test_worker_extractor.py index 4c94349..9f4d2e0 100644 --- a/tests/test_worker_extractor.py +++ b/tests/test_worker_extractor.py @@ -123,7 +123,7 @@ class TestExtractorWorker(TestCase): 'Author': 'Álvaro Justen', 'Creator': 'Writer', 'Producer': 'LibreOffice 3.5', - 'CreationDate': 'Fri Jun 1 17:07:57 2012', + 'CreationDate': 'Fri Jun 1 17:07:57 2012 BRT', 'Tagged': 'no', 'Pages': '1', 'Encrypted': 'no',

This probably does not happen on your system, so we might need to check if the problem really is with software versions, and make sure we are always using the same, everywhere.

flavioamieiro · 2016-11-28T16:03:32Z

tests/test_worker_extractor.py

+        result = Extractor().process(data)
+        self.assertEqual(result['text'], expected)
+        self.assertEqual(result['file_metadata'], {})
+        self.assertEqual(result['language'], 'en')


This is the test that is still not passing because pycld2 is blowing up with the weird byte, right?

I don't know exactly what to do about this. I don't think any of the options is very good. From what I remembered you gave me three options (of course we can think of new ones):

Try to use the old cld version

Not detect language for this documents

Try to remove the weird byte somehow

1 sounds terrible because it is old, unmaintained code, that does not support python3. It's my last option!

2 has one big problem: if we don't detect language, we won't run palavras for the documents that are in Portuguese. This means a lot of the important analysis won't happen.

3 sounds a bit more interesting if we can do it the right way. I'm worried about two things: we should only try to do it when we detect the weird byte in the document, and it also should be reasonably fast. This options has another limitation, though: if we find different, new weird bytes, we have to change our code to check for them also.

@fccoelho Any ideas on how to deal with this?

fccoelho · 2016-12-01T11:52:25Z

Minor comment: Mongodict is still in the requirements...

geron · 2016-12-01T13:35:40Z

@fccoelho the purpose of this branch is to only port to python 3. migrating to another database will come in steps in another branch after this one is merged.

flavioamieiro · 2016-12-01T17:04:11Z

Mongodict hasn't been used for about a year, so if it is in the requirements file, it should go. I didn't find it there, though.

fccoelho · 2016-12-01T19:27:17Z

When I pulled from Geron's fork to look at the PR, Pycharm detected it as a requirement. But I didn't check the requirements file

…

On Thu, Dec 1, 2016 at 3:04 PM, Flávio Amieiro ***@***.***> wrote: Mongodict hasn't been used for about a year, so if it is in the requirements file, it should go. I didn't find it there, though. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#184 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAIjW401GeqKmAttdSfapwQEkxMiUmRuks5rDv4MgaJpZM4K94Qx> .

-- Flávio Codeço Coelho ================ +55(21) 3799-5551 Professor Escola de Matemática Aplicada Fundação Getulio Vargas Praia de Botafogo, 190 sala 312 Rio de Janeiro - RJ 22250-900 Brasil

fccoelho · 2016-12-08T17:16:02Z

In the bigram implementation, the order of the bigrams do not matter since they only represent the unique bigrams found in a document.

flavioamieiro and others added 27 commits November 23, 2016 14:02

Uses pycld2 instead of the (outdate) chrom[...]tector

702dbf1

Removes pyparsing from requirements

cb3d1d2

It looks like we are not using it anymore

fix cld import

915efa7

prevent mongo from connecting at import time

0589592

run 2to3

0b4ccf6

Removes redundant try/except block in urlparse import

e41028e

Pins celery version

ccfb5d9

Celery 4.0 won't automatically register tasks that inherit from Task. We need to solve this before we can use the newer version. Celery's documentation only mentions class-based tasks in a way that is very different from what we do here, though.

Removes unnecessary cast to list that 2to3 inserted

01a5fa6

Fixes test that expected str but receives bytes

b16be95

This test was changed by 2to3, that's why it broke.

Adds test to make sure the 'process' method receives the expected data

21aa0a6

Fixes existing base task test

7d540d0

This test now makes sure only the correct document is updated.

Uses BytesIO instead of StringIO in wordcloud

aa4478a

2to3 introduced an error because it couldn't know what StringIO was being used for.

Changes Wordcloud test not to touch the database

d311b74

The next few commits will change tests in a similar manner. This test is no longer touching the database, because we rely on the PyPLNTask test to test the fetch/save functionality.

Changes palavras_raw test to not touch the database

65c07b1

Fix freqdist test and sorting

9c8f952

List sorting changed in Python 3 and apparently string sorting did too.

fix spellchecker tests

05594a1

spellchecker: warn if dictionary is missing

7b31c98

fix test_unknown_mimetype_should_be_flagged test

00cce60

Update TestExtractorWorker.test_unknown_encoding_should_be_ignored

afaaa0b

This test is still not passing because cld is throwing an exception

fix TestExtractorWorker.test_unescape_html_entities

427da7d

fix TestExtractorWorker.test_should_detect_encoding_and_return_a_unic…

2c0f8e8

…ode_object

fix TestExtractorWorker.test_should_guess_mimetype_for_file_without_e…

6989936

…xtension

updated more extractor tests

17e47cb

fix extractor.extract_pdf

4eb5f61

Rewrite extractor.trial_decode and write tests for it

24c266f

Address a possible exception raised by Magic.id_buffer and remove the superfluous text.decode('utf-8', 'replace') call since decoding with the iso8859-1 codec will never raise a UnicodeDecodeError exception.

extractor: convert text to string before calling parse_html

c084132

extractor: fix language detection

8e67779

flavioamieiro reviewed Nov 28, 2016

View reviewed changes

geron added 5 commits December 2, 2016 16:47

extractor: remove checks for text being a str, it will always be

11c203c

extractor: remove up to 1k bytes that cld says are invalid

c6b3296

SpellingChecker: no need to check for KeyError from document keys

25a8e54

extractor: turn redundant tests into integration test

573a111

extractor tests: support newer version of pdfinfo

0265786

change bigram worker to return metric names and respect bigram order

7b84def

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/port to py3 #184

Feature/port to py3 #184

geron commented Nov 28, 2016

flavioamieiro left a comment

flavioamieiro Nov 28, 2016

flavioamieiro Nov 28, 2016

geron Nov 28, 2016

flavioamieiro Nov 28, 2016

flavioamieiro Nov 28, 2016

flavioamieiro Nov 28, 2016

flavioamieiro Nov 28, 2016

fccoelho commented Dec 1, 2016

geron commented Dec 1, 2016

flavioamieiro commented Dec 1, 2016

fccoelho commented Dec 1, 2016 via email

fccoelho commented Dec 8, 2016

Feature/port to py3 #184

Are you sure you want to change the base?

Feature/port to py3 #184

Conversation

geron commented Nov 28, 2016

flavioamieiro left a comment

Choose a reason for hiding this comment

flavioamieiro Nov 28, 2016

Choose a reason for hiding this comment

flavioamieiro Nov 28, 2016

Choose a reason for hiding this comment

geron Nov 28, 2016

Choose a reason for hiding this comment

flavioamieiro Nov 28, 2016

Choose a reason for hiding this comment

flavioamieiro Nov 28, 2016

Choose a reason for hiding this comment

flavioamieiro Nov 28, 2016

Choose a reason for hiding this comment

flavioamieiro Nov 28, 2016

Choose a reason for hiding this comment

fccoelho commented Dec 1, 2016

geron commented Dec 1, 2016

flavioamieiro commented Dec 1, 2016

fccoelho commented Dec 1, 2016 via email

fccoelho commented Dec 8, 2016