Version 1.0
The Renmin OCR/NER Collection supports evaluation of named entity recognition (NER) over optical character recognition (OCR) output. Once constructed, the collection consists of a set of pdf newspaper pages from Renmin Ribao, together with character-tokenized versions of the articles in those pages annotated with fifteen named entity types.
To create the collection, you will need to download the source pdf files from Renmin, then replace the coded tokens in the annotation files with actual tokens recovered from the downloaded pdf files. This somewhat tortuous process is necessary because we do not have the rights to distribute the Renmin source data. Fortunately, the attached scripts should do all of the work for you.
To recreate the OCR/NER data you will need:
-
This distribution, which contains:
- This README
- Annotation files (
train.encoded.txt
,dev.encoded.txt
andtest.encoded.txt
) in which token strings have been encoded using keys in the original pdf file. This ensures that the text cannot be recreated without access to the pdfs. These encoded annotation files are in thedata
directory. - The
create-renmin-collection.py
python3 program that converts encoded tokens back to actual token strings. This program, together with the two other scripts it calls (renmin-downloader.py
andrenmin-reconstructor.py
) are in thecode
directory.
-
The Renmin source data are available from the Renmin Ribao Web Site. You will need http connectivity to this site.
-
You will need Python 3.6 or later to run the scripts in this distribution.
-
To generate images of the newspaper pages, you will need the ability to convert from pdf to the image format of your choice. We assume a DPI of 216; if you use a different DPI, you will need to convert the offsets in the output files.
The create-renmin-collection.py
program will download the Renmin pdf pages used by the collection, and use them to decode the tokens listed in the encrypted train, dev, and test files. This script takes a single argument, the name of the directory into which the collection is to be placed. The scripts use relative paths to find the location of the encoded files and the other scripts, so they should not be moved from the directory in which they were created.
The CoNLL files created by create-renmin-collection.py
should have the following MD5 checksums:
File | MD5 Checksum |
---|---|
train.conll.txt |
4ba2f972324ad9e11ee81f4991b41492 |
dev.conll.txt |
670e50e9d71e76f0a1154f542892f7ff |
test.conll.txt |
6ee5a3a892900b15f41a411973aad112 |
We leave it to the user of the collection to convert the pdf files to images in a form that meets their needs; this ensures that the image file format, DPI, etc., work with the user's OCR system.
The files created through this process (train.conll.txt
, dev.conll.txt
, and test.conll.txt
) are tab-separated, UTF-8-encoded text files. Sentences are separated by blank lines. The columns, from left to right, are:
- Token (e.g., '合' or ','). The data are tokenized by character, so a word might span more than one token.
- Tag (e.g.,
B-PER
orO
). Tags are in what Wikipedia calls "IOB2" format. - Unique token ID (e.g.,
Renmin_2018_06_01_02-128-9
). This ID encodes the year, month, day, page number, box number, and token number counts left to right (or top to bottom) within the box. - ID of previous token in reading order (e.g.,
Renmin_2018_06_01_02-130-17
orNone
). Because the layout is drawn from a printed newspaper, the previous token is not always the immediately preceding token in the file. - xmin (e.g., "276.995563"). This and the following three fields are the coordinates of the box tightly surrounding the line containing the token on the page assuming a DPI of 216. (This means several tokens may have the same location. The token number indicates the token offset within the line)
- ymin
- xmax
- ymax
Note that although we refer to these files as CoNLL files (after the format used in the CoNLL 2002 and 2003 shared tasks), the term CoNLL file has been overloaded in the literature and now is used to refer to a family of related formats. In particular, the original CoNLL files do not conform to the specification given here.
All named entities are tagged with a type in this dataset, with MISC serving as the type for all entities that do not fit into one of the other categories. Any token that is not part of a named entity mention is tagged with the 'O' tag.
Type | Description | Examples |
---|---|---|
PER | Person | Enrico Rastelli, Beyoncé |
ORG | Organization | International Jugglers Association |
COMM | Commercial Org. | Penguin Magic, Papermoon Diner |
POL | Political Organization | Green Party, United Auto Workers |
GPE | Geo-political Entity | Morocco, Carlisle, Côte d'Ivoire |
LOC | Natural Location | Kartchner Caverns, Lake Erie |
FAC | Facility | Stieff Silver Building, Hoover Dam |
GOVT | Government Building | White House, 10 Downing St. |
AIR | Airport | Ninoy Aquino International, BWI |
EVNT | Named Event | WWII, Operation Desert Storm |
VEH | Vehicle | HMS Titanic, Boeing 747 |
COMP | Computer Hard/Software | Nvidia GeForce RTX 2080 Ti, Reunion |
WEAP | Weapon | Fat Man, 13"/35 caliber gun |
CHEM | Chemical | Iron, NaCl, hydrochloric acid |
MISC | Other named entity | Dark Star, Lord of the Rings |
If anyone cares to create a Chinese version of this README that is better than the output of Google Translate, we will add it to the repository.
If you use this collection, please cite:
Dawn Lawrie, James Mayfield and David Etter, "Building OCR/NER test collections." In Proceedings of LREC 2020, Marseille, 2020.