-
Notifications
You must be signed in to change notification settings - Fork 71
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
57 additions
and
24 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,24 +1,57 @@ | ||
tessdata - Tesseract Language Trained Data | ||
========================================== | ||
|
||
# Accessible URLs | ||
|
||
### 4.0.0 | ||
|
||
- [https://tessdata.projectnaptha.com/4.0.0/eng.traineddata.gz](http://tessdata.projectnaptha.com/4.0.0/eng.traineddata.gz) | ||
- [https://raw.githubusercontent.com/naptha/tessdata/gh-pages/4.0.0/eng.traineddata.gz](https://raw.githubusercontent.com/naptha/tessdata/gh-pages/4.0.0/eng.traineddata.gz) | ||
|
||
### 4.0.0-best (Higher OCR accuracy) | ||
|
||
- [https://tessdata.projectnaptha.com/4.0.0_best/eng.traineddata.gz](http://tessdata.projectnaptha.com/4.0.0_best/eng.traineddata.gz) | ||
- [https://raw.githubusercontent.com/naptha/tessdata/gh-pages/4.0.0_best/eng.traineddata.gz](https://raw.githubusercontent.com/naptha/tessdata/gh-pages/4.0.0_best/eng.traineddata.gz) | ||
|
||
### 4.0.0-fast (Shorter OCR time) | ||
|
||
- [https://tessdata.projectnaptha.com/4.0.0_fast/eng.traineddata.gz](http://tessdata.projectnaptha.com/4.0.0_fast/eng.traineddata.gz) | ||
- [https://raw.githubusercontent.com/naptha/tessdata/gh-pages/4.0.0_fast/eng.traineddata.gz](https://raw.githubusercontent.com/naptha/tessdata/gh-pages/4.0.0_fast/eng.traineddata.gz) | ||
|
||
### 3.02 | ||
|
||
- [https://tessdata.projectnaptha.com/3.02/eng.traineddata.gz](http://tessdata.projectnaptha.com/3.02/eng.traineddata.gz) | ||
- [https://raw.githubusercontent.com/naptha/tessdata/gh-pages/3.02/eng.traineddata.gz](https://raw.githubusercontent.com/naptha/tessdata/gh-pages/3.02/eng.traineddata.gz) | ||
# Overview | ||
This repo contains various sets of `.traineddata` that can be used by Tesseract.js. This includes the files used by Tesseract.js by default, as well as alternative versions. The contents of the files, and how to use them with Tesseract.js, are explained below. | ||
|
||
## Language Data | ||
A description of each set of files is below. The source is also listed, although the version used here may not reflect the latest version of the files in the repo linked. | ||
|
||
- `4.0.0_best_int` - Integerized Version of "Tessdata Best" | ||
- OEM: LSTM only | ||
- Used by Tesseract.js by default: Yes. | ||
- This is the default data used when OEM is set to LSTM only, which is the default. | ||
- Published to NPM package: Yes. | ||
- `4.0.0` - "Tessdata" | ||
- OEM: LSTM + Legacy | ||
- An integerized version of "Tessdata Best" for the LSTM engine is included, in addition to data for the Legacy data. | ||
- Used by Tesseract.js by default: Yes. | ||
- This is the default data used when OEM is set to Legacy or LSTM with Legacy fallback. | ||
- Published to NPM package: Yes. | ||
- Source: https://github.com/tesseract-ocr/tessdata | ||
- `4.0.0-fast` - "Tessdata Fast" | ||
- OEM: LSTM only | ||
- Used by Tesseract.js by default: No. | ||
- Published to NPM package: No. | ||
- Source: https://github.com/tesseract-ocr/tessdata_fast | ||
- `4.0.0_best` - "Tessdata Best" | ||
- OEM: LSTM only | ||
- Used by Tesseract.js by default: No. | ||
- This data can be *significantly* larger than the integerized version and can result in longer runtimes. Results may be more accurate, however the difference usually ranges from negligible to marginal. | ||
- Before using this data, developers should review file sizes and run accuracy/performance tests to confirm implementing is worthwhile. | ||
- Published to NPM package: No. | ||
- Source: https://github.com/tesseract-ocr/tessdata_best | ||
- `3.0.2` - Historic Tessdata files from Tesseract v3 | ||
- OEM: Legacy only | ||
- Used by Tesseract.js by default: No. | ||
- These are old files and may be removed from this repo at some point. | ||
- Published to NPM package: No. | ||
|
||
## NPM Packages | ||
The `4.0.0` and `4.0.0_best_int` files for each language are published in a language-specific NPM package. Each language has its own package since combining into a single package would lead to an enormous download. The packages are named `@tesseract.js-data/{lang}`. For example, the English package is named `@tesseract.js-data/eng`. | ||
|
||
# Using Language Data with Tesseract.js | ||
See the Tesseract.js documentation for instructions on how to set `langPath` manually. Details regarding where the files in this repo can be found are below. | ||
## CDNs | ||
These files can be accessed using any CDN that automatically mirrors NPM. Popular examples are below. | ||
### JSDelivr (Default) | ||
By default, Tesseract.js uses the JSDelivr CDN. The link for the default English data on JSDelivr is below. | ||
https://cdn.jsdelivr.net/npm/@tesseract.js-data/[email protected]/4.0.0_best_int/eng.traineddata.gz | ||
### Unpkg | ||
Unpkg is another CDN that mirrors NPM. In most regions, unpkg appears to be slightly less reliable than JSDelivr (although still usable). However, users have reported that unpkg is accessible in parts of China that JSDelivr is blocked in, so use unpkg for that reason. Discussion regarding this issue, as well as example code that switches from JSDelivr to `unkpg`, can be found [here](https://github.com/naptha/tesseract.js/issues/899#issuecomment-1975051720). | ||
|
||
The link for the default English data on unkpkg is below. | ||
https://unpkg.com/@tesseract.js-data/eng/4.0.0_best_int/eng.traineddata.gz | ||
## Local Copy | ||
Users are free to use their own local copy of these files rather than relying on a remote CDN. For Node.js, you can simply add the relevant NPM packages as a dependency, or download the file and include it as a project resource. For the browser version, simply download the relevant files and host them yourself on your website. | ||
## GitHub Pages Site (Depreciated) | ||
**The `tessdata.projectnaptha.com` site is depreciated, and is no longer updated. Do not point new code to this site.** | ||
|
||
In old versions of Tesseract.js, the default `langPath` location was a simple GitHub pages site that hosted this repo. However, in addition to users reporting that GitHub pages was unreliable, this repo is now over the GitHub pages size limit. Therefore, that site is no longer updated. The site is being left as-is to avoid breaking old code, however developers are encouraged to switch. |