-
Notifications
You must be signed in to change notification settings - Fork 5
/
index.html
656 lines (529 loc) · 34.6 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
<!DOCTYPE html>
<html>
<head>
<title>Project Naptha</title>
<meta charset="utf8">
<link rel="chrome-webstore-item" href="https://chrome.google.com/webstore/detail/molncoemjfmpgdkbdlbjmhlcgniigdnf">
<style>
img.nomargin {
margin: 0;
box-shadow: none;
}
html {
/*background-color: #e6e9e9;*/
background-color: #F5F0E8;
}
body {
font-family: Lato, "Avenir", Helvetica, Arial, sans-serif;
font-size: 16pt;
font-weight: 300;
margin: 0;
min-width: 1000px;
line-height: 1.5em;
color: rgba(92, 92, 92, 1);
-webkit-font-smoothing: antialiased;
}
.mainimg {
box-shadow: 0 0 40px rgba(0, 0, 0, 0.51);
background: white;
float: right;
margin: 50px;
margin-right: 70px;
}
.content {
width: 950px;
margin-left: auto;
margin-right: auto;
background-color: #ffffff;
-webkit-box-shadow: 0 0 2px rgba(0, 0, 0, 0.06);
-moz-box-shadow: 0 0 2px rgba(0, 0, 0, 0.06);
box-shadow: 0 0 2px rgba(0, 0, 0, 0.06);
padding: 1em 2em 1em;
margin-bottom: 40px;
}
.header {
width: 950px;
margin-left: auto;
margin-right: auto;
margin-top: 5px;
margin-bottom: 15px;
}
h1, h2, h3, h4, h5, h6 {
color: #222;
font-weight: 600;
line-height: 1.3em;
}
h2 {
margin-top: 1.3em;
}
hr {
border: 0;
border-top: 1px solid rgba(0, 0, 0, 0.2);
}
.jumbo {
width: 100%;
background-color: rgb(199, 95, 0);
color: white;
box-shadow: 0 0 50px rgba(0,0,0,0.3) inset;
}
.jumbo.blue {
background-color: rgb(0, 127, 199);
}
.jumbo .internal {
/*width: 400px;*/
padding: 80px;
color: rgb(247, 247, 247);
padding-top: 30px;
padding-bottom: 30px;
font-size: 30pt;
line-height: 1.3em;
}
.jumbo input, .jumbo button {
border-radius: 5px;
-webkit-appeareance: none;
font-size: 20px;
padding: 10px 16px;
border: none;
}
.jumbo input[type=submit], .jumbo button {
background-color: rgb(255, 204, 0);
}
.jumbo button {
font-size: 40px;
}
a {
text-decoration: none
}
</style>
<link rel="stylesheet" href="naptha.css">
<link rel="shortcut icon" href="favicon.ico">
<link href='https://fonts.googleapis.com/css?family=Lato:300,400,700' rel='stylesheet' type='text/css'>
</head>
<body ocr="custom">
<div style="overflow: auto" class="header">
<h1 style="margin-bottom: 0; float: left">Project <span style="color: rgb(255, 133, 0)">Naptha</span> </h1>
<h3 style="color: gray; float: right; font-weight: normal; margin-top: 1.7em">highlight, copy, and translate text from <u>any</u> image.</h3>
</div>
<div style="overflow:auto">
<div class="jumbo">
<div class="mainimg" style="margin-top: 70px">
<img src="img/tyger.jpg" class="main" style="width: 420px; padding: 20px;" id="tyger">
</div>
<div class="internal" style="padding-bottom: 70px">
<p>
<b>Project Naptha</b> automatically applies state-of-the-art computer vision algorithms on <b>every image</b> you see while browsing the web. The result is a <b>seamless and intuitive</b> experience, where you can <b>highlight</b> as well as <b>copy and paste</b> and even <b>edit and translate</b> the text formerly trapped within an image.
</p>
<div id="install-button" style="display: none">
<button onclick="window.location='https://chrome.google.com/webstore/detail/molncoemjfmpgdkbdlbjmhlcgniigdnf'">Add to <b>Chrome</b></button>
<a href="#" onclick="switch_consolation();return false" style="font-size: 16px; color: white; font-weight: normal">Other browser?</a>
</div>
<form method="POST" action="https://sky-lighter.appspot.com/form/email" id="consolation-prize">
<div style="font-size: 20px; line-height: 1em">Unfortunately, your browser is <b>not yet supported</b> (but feel free to play around with this page, which shows off most of the features and works on most modern browsers), currently only Google Chrome is supported. Type in your email below and sign up for updates on this project. Depending on the number of sign-ups, a Firefox version may be released in a few weeks. If you're interested in Naptha for other browsers, <a href="mailto:[email protected]">email me</a>. </div>
<input type="email" name="email" style="width: 300px" placeholder="Enter email address">
<input type="submit" value="Sign me up">
</form>
<script>
if(window.chrome){
document.getElementById('install-button').style.display = ''
document.getElementById("consolation-prize").style.display = 'none'
}
function switch_consolation(){
document.getElementById('install-button').style.display = 'none'
document.getElementById("consolation-prize").style.display = ''
}
</script>
</div>
</div>
<div class="content">
<p>
<b>Words on the web exist in two forms</b>: there’s the text of articles, emails, tweets, chats and blogs— which can be copied, searched, translated, edited and selected— and then there’s the text which is shackled to images, found in comics, document scans, photographs, posters, charts, diagrams, screenshots and memes.
Interaction with this second type of text <b>has always been a second class experience</b>, the only way to search or copy a sentence from an image would be to do as the ancient monks did, manually transcribing regions of interest.
</p>
<p>
<b>This entire webpage is a live demo</b>. You can watch as moving your cursor over a block of words changes it into the little I-beam. You can drag over a few lines and watch as a semitransparent blue box highlights the text, helping you keep track of where you are and what you’re reading. Hit <b>Ctrl+C to copy the text</b>, where you can paste it into a search bar, a Word document, an email or a chat window. Right-click and you can <b>erase</b> the words from an image, <b>edit</b> the words, or even <b>translate it into a different language</b>.
</p>
<p>
This was made by <a href="http://twitter.com/antimatter15">@antimatter15</a> (<a href="http://google.com/+KevinKwok">+KevinKwok</a> on Google+), and <a href="http://omrelli.ug/">Guillermo Webster</a>.
</p>
</div>
</div>
<div class="header">
<h1 style="margin-top: 0">Animated GIF <span style="color: rgb(255, 133, 0)">tl;dr</span></h1>
</div>
<div class="content">
<!-- <iframe width="960" height="500" src="//www.youtube.com/v/scIflsrf92c?playlist=scIflsrf92c&autoplay=1&controls=0&loop=1" frameborder="0" allowfullscreen></iframe> -->
<center>
<img src="img/bradbury-paste.gif" style="width: 300px">
<img src="img/fast-translate.gif" style="width: 300px">
<img src="img/super-serenity.gif" style="width: 300px">
</center>
<p>If you stare at these three animated gifs long and hard enough, you might not need to read anything.</p>
</div>
<div style="overflow:auto">
<div class="jumbo">
<div class="mainimg">
<img src="img/hilighting.png" class="main" style="width: 420px; padding: 10px;">
</div>
<div class="internal">
Example: <b>Comics</b>
</div>
</div>
<div class="content">
<p>
Early in <b>October 2013</b>, coincidentally less than a week before I developed the first prototype of this extension, <a href="http://xkcd.com/1271/">xkcd</a> published a comic (shown on the right) which somewhat ironically depicts the impetus for the extension.
</p>
<p>
The comic decries websites which arbitrarily hinder users from absentmindedly selecting random blocks of text— but the irony is that <b>xkcd should count himself among the long list of offenders</b> because up until now, it simply wasn't possible to select text inside a comic.
</p>
<p>
An interesting thing to note is the language agnostic nature of Project Naptha's underlying SWT algorithm (see the technical details by scrolling down a bit more) makes it detect the little squiggles as text as well. Depending on how you look at it, this can be seen as a <a href="http://4.bp.blogspot.com/-PWWww4RfHr4/UfFs7NuyUaI/AAAAAAAABZQ/qQaIP3kDpAs/s1600/bug_feature.jpg">bug, or a feature</a>.
</p>
<p>
Also, because handwriting detection is particularly difficult (in particular, the issue is character segmentation, it's quite difficult to separate apart letters which are smushed so close as to be connected), if you try to copy and paste text from a comic, it ends up jumbled. This might be improved in the future, because certain parts of the Naptha stack do lag behind the present state-of-the-art by a few years.
</p>
</div>
</div>
<div style="overflow:auto">
<div class="jumbo">
<div class="mainimg">
<img src="img/irate.png" class="main" style="width: 700px">
</div>
<div class="internal">
Example: <b>Scans</b>
</div>
</div>
<div class="content">
<p>
It usually takes some <b>special software to convert a scan into a PDF</b> document that you can highlight and copy from, and this extra step means that a lot of the time, you aren't dealing with a nicely formatted and processed PDF, but a raw scan distributed as a TIFF or JPEG.
</p>
<p>
Usually, that just meant suffering through the document, or in the worst case, printing it out so that I could scribble with a pen along, while I read. But with this extension, it's possible to just select text from a picture, attached to an email, or linked from a class action lawsuit overview.
</p>
<p>
It's even possible for files you have locally on your computer. Simply drag the image file over to your browser window. Note that you might have to go to <tt>chrome://extensions</tt> and check the "Allow access to file URLs" checkbox.
</p>
</div>
</div>
<div style="overflow:auto">
<div class="jumbo">
<img src="img/nobody.jpg" class="mainimg" style="width: 500px">
<div class="internal">
Example: <b>Photos</b>
</div>
</div>
<div class="content">
<p>
The algorithm used by Project Naptha (Stroke Width Transform) was actually designed for detecting text in natural scenes and photographs (a more technically challenging and general problem than most regular images).
</p>
<p>
Naptha actually <b>also supports rotated text</b> (though it is still absolutely hopeless if the text is rotated by more than 30 degrees or so— sorry vertical text, I'll figure you out later!), which actually took a really long time to implement.
</p>
<p>
But with these types of images, the actual text recognition becomes somewhat of a crapshoot. While it's quite possible that the <b>quality could improve in future versions</b>, with better trained models and algorithms, and the inclusion of human-aided transcription services, you should probably calibrate your expectations fairly low to avoid disappointment.
<!-- It might be my unique situation as its creator, but I feel slightly giddy, whenever I find out that it works with one of these types of images. -->
</p>
</div>
</div>
<div style="overflow:auto">
<div class="jumbo">
<img src="img/thiel.png" class="mainimg" style="width: 400px; padding: 10px">
<div class="internal">
Example: <b>Diagrams</b>
</div>
</div>
<div class="content">
<p>
Diagrams are cool. There are charts and diagrams all over the web, and sometimes you'll want to look up one of the chart axes, and it's pretty convenient to be able to do that without needing to type it up again. Maybe there's a circuit diagram and you want to check out where a certain component can be bought— just highlight its label and copy and paste it into the search bar.
</p>
<p>
This particular diagram was found on <a href="http://blakemasters.com/post/23435743973/peter-thiels-cs183-startup-class-13-notes-essay">Blake Masters' course notes</a> for Peter Thiel's Stanford class. I haven't actually read it, but it was on Hacker News so it just happened to be one of the tabs that I currently have open.
</p>
</div>
</div>
<div style="overflow:auto">
<div class="jumbo">
<img src="img/equis.jpg" class="mainimg">
<div class="internal">
Example: <b>Internet Memes</b>
</div>
</div>
<div class="content">
<p>
The truth is that I've spent way too much time on reddit and 4chan in search of test images for the text detection and layout analysis algorithms. Time really does go by when you can rationalize procrastination as something "productive". The result is that my test corpus is something on the order of 50% internet meme (In particular, I'm a fan of Doge, in part because Comic Sans is interpreted remarkably well by the built-in Ocrad text recognizer).
</p>
<p>
It's actually a bit difficult to recognize the text of the standard-template internet meme (mad props to CaptionBot, bro). Bold Impact font is actually notoriously hard to recognize with general-purpose text recognizers because a lot of what distinguishes letters isn't the overall shape, but rather the subtle rounding of corners (compare D, 0, O) or relatively short protrusions (the stubby little tail for L that differentiates it from an I).
</p>
<p>
I started building a text recognizer algorithm specifically designed for Impact font, and it was actually working pretty well, but I kind of misplaced the code somewhere. So, until I find it or replace it, you'll have to use Tesseract configured with the "Internet Meme" language.
</p>
</div>
</div>
<div style="overflow:auto">
<div class="jumbo">
<img src="img/protobowl.png" class="mainimg" style="width: 750px">
<div class="internal">
Example: <b>Screenshots</b>
</div>
</div>
<div class="content">
<p>
Screenshots are a nice way to save things in a state that you can recall later in a more or less complete form— the only caveat being the fact that you would have to re-type the text later if you find a need for it. On the other hand, copying and saving just the text of something ends up losing the spatial context of its origin.
</p>
<p>
Project Naptha kind of <b>transforms static screenshots into something more akin to an interactive snapshot</b> of the computer as it was when the screen was captured. While clicking on buttons won't submit forms or upload documents, the cursor changes when hovering over different parts, and blocks of text become selectable, just like they were before frozen in carbonite.
</p>
<p>
While it's not a perfect substitute— the text recognition screws up every once in a while, so the reconstruction isn't reliably perfect, it still has a rather significant and profound effect.
</p>
</div>
</div>
<div style="overflow:auto">
<div class="jumbo blue">
<img src="img/russia.jpg" class="mainimg" style="width: 700px">
<div class="internal">
Sneak Peek: <b>Translation</b>
</div>
</div>
<div class="content">
<p>
There's always been the dream of a universal translation machine— something like the Babel fish out of Hitchhiker's Guide that will allow anyone to magically communicate with anyone else and to fully appreciate the art and culture of any society (Vogon poetry notwithstanding).
</p>
<p>
I'm here to say that it's still a ways off, but at least I have enough to do a pretty impressive demo.
</p>
<p>
<b>Try it out:</b> Highlight some region of text on that image. Right click on it and navigate to the "Translate" menu. Select whatever language you want.
</p>
</div>
</div>
<div style="overflow:auto">
<div class="jumbo blue">
<img src="img/chelsea.jpg" class="mainimg">
<div class="internal">
Sneak Peek: <b>Erase Text</b>
</div>
</div>
<div class="content">
<p>
This is actually the first step in translating an image: to erase the text from the image so new words can be put on top of it. This is done by something called "Inpainting" and these types of algorithms are most famously deployed as Adobe Photoshop's "Content-Aware Fill" feature.
</p>
<p>
It extrapolates solid colors from the regions surrounding the text, and propagates the colors inwards until the entire area is covered. From a distance, it usually does a pretty good job, but it's hardly a substitute for a true original.
</p>
<p>
<b>Try it out:</b> Highlight the text over the cat's face. Right click on the selection squig and click "Erase Text", which can be found under the "Translate" menu.
</p>
</div>
</div>
<div style="overflow:auto">
<div class="jumbo blue">
<img src="img/skid.jpg" class="mainimg">
<div class="internal">
Sneak Peek: <b>Change Text</b>
</div>
</div>
<div class="content">
<p>
With the same trick that Translation uses— it's possible to substitute in your own text. This will probably work better in the future, once there's some actual font detecting logic besides <tt style="font-size: 12pt">if uppercase and super bold, then Impact font, if uppercase otherwise then XKCD font, and for everything else, Helvetica Neue</tt>.
</p>
<p>
I don't know where else to mention this, because it's one of those little things that simultaneously applies to everything and nothing at once— but it's also possible to select multiple regions by holding the shift key. I spent way too long writing the algorithms to merge multiple selection regions when appropriate.
</p>
<p>
<b>Try it out:</b> Highlight some meme text. Right click on the selection squig and click "Reprint Text", which can be found under the "Translate" menu. After that, select the text on one region that you'd like to edit and click "Modify Text" which should appear in the context menu.
</p>
</div>
</div>
<div class="header">
<h1 style="margin-top: 0">Chrono<span style="color: rgb(255, 133, 0)">logy</span></h1>
</div>
<img src="img/panoram.jpg" style="margin-right: auto; margin-left: auto; display: block; width: 1035px">
<div class="content">
<p>
During May 2012, I was reading about seam carving, an interesting and almost magical algorithm which could rescale images without apparently squishing it. After playing with the the little seams that the seam carver tended to generate, I noticed that they tended to converge arrange themselves in a way that cut through the spaces in between letters (dynamic programming approaches are actually fairly common when it comes to letter segmentation, but I didn't know that). It was then, while reading a particularly verbose <a href="http://www.smbc-comics.com/?id=2614">smbc</a> comic, I thought that it should be possible to come up with something which would read images (with <canvas>), figure out where lines and letters were, and draw little selection overlays to assuage a pervasive text-selection habit.
</p>
<p>
My first attempt was simple. It projected the image onto the side, forming a vertical pixel histogram. The significant valleys of the resulting histograms served as a signature for the ends of text lines. Once horizontal lines were found, it cropped each line, and repeated the histogram process, but vertically this time, in order to determine the letter positions. It only worked for strictly horizontal machine printed text, because otherwise the projection histograms would end up too noisy. For one reason or another, I decided that the problem either wasn't worth tackling, or that I wasn't ready to.
</p>
<p>
Fast forward a year and a half, I'm a freshman at MIT during my second month of school. There's a hackathon that I think I might have signed up for months in advance, marketed as MIT's biggest. I slept late the night before for absolutely no particular reason, and woke up at 7am because I wanted to make sure that my registration went through. I walked into the unfrozen ice rink, where over 1,000 people were claiming tables and splaying laptop cables on the ground— so this is what my first ever hackathon is going to look like.
</p>
<p>
Everyone else was "plugged in" or something; big headphones, staring intently at dozens of windows of Sublime Text. Fair enough, it was pretty loud. I had no idea what I would have ended up doing, and I wasn't able to meet anyone else who was both willing to collaborate and had an idea interesting enough for me to want to. So I decided to walk back to my dorm and take a nap.
</p>
<p>
I woke up from that nap feeling only slightly <i>more</i> tired, and nowhere closer to figuring out what I was going to do. I decided to make my way back to the hackathon, because there's free food there or something.
</p>
<!-- TODO: finish this
what happened?
still no idea wat to do
looking though old abandoned projects
tried several different ones
decided to continue on whatever i had the most progress on
worked through the night
fell asleep on table
forgot to sign up to get judged
guillermo to da rescue
omg wow its really cinder block and echoey behind an auditorium
technical malfunction during demo
woot 2nd place and false promises of release in 2 weeks
winter break
iap
february
march
april
hey, look! we're at present now!
-->
<!-- I’m currently a freshman at MIT, and I’ve been working on this project for essentially the past 6 months. The first version was prototyped over a 20 hour period during HackMIT (October 5-6 2013), MIT’s largest annual hackathon, and went on to win 2nd place and a $2000 prize. I scrapped all the code and started from scratch some time around winter break, and its present form is mostly based on that.
-->
</div>
<div class="header">
<h1 id="privacy">Security <span style="color: rgb(255, 133, 0)">&</span> Privacy</h1>
</div>
<div class="content">
<p>
If you paid attention to the permissions requested in the installation dialog, you might have wondered about why exactly this extension requires such sweeping access to your information. Project Naptha operates a very low level, it's actually ideally the kind of functionality that gets built in to browsers and operating systems natively. In order to allow you to highlight and interact with images <i>everywhere</i>, it needs the ability to read images located everywhere.
</p>
<p>
One of the more impressive things about this project is the fact that it's almost entirely written in client side javascript. That means that it's pretty much totally functional without access to a remote server. That does come at a bit of a caveat, which is that online translation running offline is an oxymoron, and lacking access to a cached OCR service running in the cloud means reduced performance and lower transcription accuracy.
</p>
<p>
So there is a trade-off that has to happen between privacy and user experience. And I think the default settings strike a delicate balance between having all the functionality made available and respecting user privacy. I've heard complaints on both sides (roughly equal in quantity, actually, which is kind of intriguing)— lots of people want high quality transcription to the default, and others want no server communication whatsoever as the default.
<!--On the extreme end, sacrificing all user privacy (note something that Naptha very much <i>doesn't</i> do) it would be possible to do all of this with very little computational power on your laptop, tablet or phone with high quality recognition and . -->
</p>
<p>
By default, when you begin selecting text, it sends a secure HTTPS request containing the URL of the specific image and literally nothing else (no user tokens, no website information, no cookies or analytics) and the requests are not logged. The server responds with a list of existing translations and OCR languages that have been done. This allows you to recognize text from an image with much more accuracy than otherwise possible. However, this can be disabled simply by checking the "Disable Lookup" item under the Options menu.
</p>
<p>
The translation feature is currently in limited rollout, due to scalability issues. The online OCR service also has per-user metering, and so such requests include a unique identifier token. However, the token is entirely anonymous and is not linked with any personally identifiable information (it handled entirely separately from the lookup requests).
</p>
<!-- todo, talk about the new architecture
Prelookup Requests
- sends a sha256 hash of the URL of the image
- hashing means that it is difficult to determine what specific image was being looked up if it is not already in the database
- server responds with Y or N, indicating if client should continue to full lookup request
- this is intended to be high volume traffic, and the vast majority of the time it will respond with N (should not continue)
- the request lacks any information whatsoever about the user and has no ability to connect requests to particular users
Lookup Requests
- send full url to the server
- server returns list of translations and chunks
- lacks identifying information
Chunk Read Request
- given a chunk id, read the contents of a chunk
- lacks identifying information
Recognize/Upload
- includes a user token for rate control
Translate
- includes user token for access control and metering
-->
</div>
<div class="header">
<h1 >About this <span style="color: rgb(255, 133, 0)">Demo</span>...</h1>
</div>
<div class="content">
<p>
So actually, the thing that is running on this page isn't the fully fledged Project Naptha. It's essentially just the front-end, so it lacks all of the computational heavy lifting that actually makes it cool. All the text metrics and layout analyses were precomputed. Before you raise your pitchforks, there's actually a good reason this demo page runs what amounts to a <a href="http://spongebob.wikia.com/wiki/Weenie_Hut_Franchise">Weenie Hut Jr.</a> version of the script.
</p>
<p>
The computationally expensive backend uses WebWorkers extensively, which, although has fairly good modern browser support, has subtle differences between platforms. Safari has some weird behavior when it comes to sending around ImageData instances, and transferrable typed arrays are slightly different in Firefox and Chrome. Most importantly though, the current stable version (34) of Google Chrome, at time of writing actually suffers from a <a href="https://code.google.com/p/chromium/issues/detail?id=361792">debilitatingly broken WebWorkers</a> implementation. Rather fortunately, Chrome extensions don't seem to suffer from the same problem.
</p>
</div>
<div class="header">
<h1 >How it <span style="color: rgb(255, 133, 0)">Works</span></h1>
</div>
<div class="content">
<p>
The dichotomy between words expressed as text and those trapped within images is so firmly engrained into the browsing experience, that you might not even recognize it as counter-intuitive. For a technical crowd, the limitation is natural, lying in the fact that images are fundamentally “raster” entities, devoid of the semantic information necessary to indicate which regions should be selectable and what text is contained.
</p>
<img src="img/swt-letters.png" style="width: 800px">
<p>
Computer vision is an active field of research essentially about teaching computers how to actually “see” things, recognizing letters, shapes and objects, rather than simply pushing copies of pixels around.
</p>
<img src="img/swt-lines.png" style="width: 800px">
<p>
In fact, optical character recognition (OCR) is nothing new. It has been used by libraries and law firms to digitize books and documents for at least 30 years. More recently, it has been combined with text detection algorithms to read words off photographs of street signs, house numbers and business cards.
</p>
<img src="img/swt-refined.png" style="width: 800px">
<p>
The primary feature of Project Naptha is actually the text detection, rather than optical character recognition. It runs an algorithm called the <a href="http://www.math.tau.ac.il/~turkel/imagepapers/text_detection.pdf">Stroke Width Transform</a>, invented by Microsoft Research in 2008, which is capable of identifying regions of text in a language-agnostic manner. In a sense that’s kind of like what a human can do: we can recognize that a sign bears written language without knowing what language it's written in, nevermind what it means.
</p>
<p>
However, half a second is still quite noticeable, as studies have shown that users not only discern, but feel readily annoyed by delays as short as a hundred milliseconds. To get around that, Project Naptha is actually continually watching cursor movements and extrapolating half a second into the future so that it can kick off the processing in advance, so it feels instantaneous.
</p>
<p>
In conjunction with other algorithms, like connected components analysis (identifying distinct letters), otsu thresholding (determining word spacing), disjoint set forests (identifying lines of text), Project Naptha can very quickly build a model of text regions, words, and letters— all while completely unaware of the specifics, what specific letters exist.
</p>
<p>
Once a user begins to select some text, however, it scrambles to run character recognition algorithms in order to determine what exactly is being selected. This recognition process happens on a per-region basis, so there’s no wasted effort in doing it before the user is done with the final selection.
</p>
<p>
The recognition process involves blowing up the region of interest so that each line is on the order of 100 pixels tall, which can be as large as a 5x magnification. It then does an intelligent color masking filter before sending it to a built-in <a href="https://github.com/antimatter15/ocrad.js">pure-javascript port of the open source Ocrad OCR engine</a>.
</p>
<p>
Because this process is relatively computational expensive, it makes sense to do this type of “lazy” recognition- staving off until the last possible moment to run the process. It can take as much as five to ten seconds to complete, depending on the size of the image and selection. So there’s a good chance that by the time you hit Ctrl+C and the text gets copied into your clipboard, the OCR engine still isn’t done processing the text.
</p>
<p>
That’s all okay though, because in place of the text which is still getting processed, it inserts a little flag describing where the selection is and which part of the image to read from. For the next 60 seconds, Naptha tracks that flag and substitutes it with the final, recognized text as soon as it can.
</p>
<p>
Sometimes, the built-in OCR engine isn’t good enough. It only supports languages with the Latin alphabet and a limited number of diacritics, and doesn’t contain a language model so that it outputs a series of letters dependent on the probability given its context (for instance, the algorithm may decide that “he1|o” is a better match than “hello” because it only looks at the letter shape). So there’s the option of sending the selected region over to a cloud based text recognition service powered by Tesseract, Google’s (formerly HP’s) award-winning open-source OCR engine which supports dozens of languages, and uses an advanced language model.
</p>
<p>
If anyone triggers the Tesseract engine on a public image, the recognition result is saved, so that future users who stumble upon the same image will instantaneously load the cached version of the text.
</p>
<p>
There is a class of algorithms for something called “Inpainting”, which is about reconstructing pictures or videos in spite of missing pieces. This is widely used for film restoration, and commonly found in Adobe Photoshop as the “Content-Aware Fill” feature.
</p>
<p>
Project Naptha uses the regions detected as text as a mask for a particular inpainting algorithm developed in 2004 based on the <a href="https://github.com/antimatter15/inpaint.js">Fast Marching Method by Alexandru Telea</a>. This mask can be used to fill in the spots where the text is taken from, creating a blank slate for which new content can be printed.
</p>
<p>
With some rudimentary layout analysis and text metrics, Project Naptha can figure out the alignment parameters of the text (centered, justified, right or left aligned), the font size and font weight (bold, light or normal). With that information, it can reprint the text in a similar font, in the same place. Or, you can even change the text to say whatever you want it to say.
</p>
<p>
It can even be chained to an online translation service, Google Translate, Microsoft Translate, or Yandex Translate in order to do automatic document translations. With Tesseract’s advanced OCR engine, this means it’s possible to read text in languages with different scripts (Chinese, Japanese, or Arabic) which you might not be able to type into a translation engine.
</p>
</div>
<!-- <h3>
Welcome to your new pony.
</h3>
-->
<!--
<div class="header">
<h1 >Fut<span style="color: rgb(255, 133, 0)">ure</span></h1>
</div>
<div class="content">
<p>
This really depends on whether this catches on fire.
</p>
</div> -->
<div class="header">
<h1 >What's in a <span style="color: rgb(255, 133, 0)">Name</span>?</h1>
</div>
<div class="content">
<img src="img/rose.jpg" style="float: right; width: 300px; margin: 30px">
<p>
The prototype which was demonstrated at HackMIT 2013, later winning 2nd place, was rather blandly dubbed "Images as Text". Sure, it pretty aptly summed up the precise function of the extension, but it really lacked that little spark of life.
</p>
<p>
So from then, I set forth searching for a new name, something that would be rife with puntastic possibilities. One of the possibilities was "Pyranine", the chemical used in making the ink for flourescent highlighters (my roommate, a chemistry major, happened to be rather fond of the name). I slept on that idea for a few nights, and realized that I had totally forgotten how to spell it, and so it was crossed off the candidate list.
</p>
<p>
Naptha, its current name, is drawn from an even more tenuous association. See, it comes from the fact that "highlighter" kind of sounds like "lighter", and that naptha is a type of fuel often used for lighters. It was in fact one of the earliest codenames of the project, and brought rise to a <b>rather fun little easter egg</b> which you can play with by quickly clicking about a dozen times over some block of text inside a picture.
</p>
</div>
<div class="header" style="margin-bottom: 40px">
With apologies to the color <span style="color: rgb(255, 133, 0)">orange</span> <br>Made by <a href="http://antimatter15.com/">antimatter15</a> (<a href="mailto:[email protected]">email</a> / <a href="http://twitter.com/antimatter15">twttr</a> / <a href="http://google.com/+KevinKwok">g+</a>)
</div>
<script src="naptha-demo.js"></script>
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-50196519-1"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'UA-50196519-1');
</script>
</body>
</html>