-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Max Thomas
committed
May 8, 2015
1 parent
8746123
commit 32947d9
Showing
1 changed file
with
39 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,12 +1,49 @@ | ||
# annotated-nyt | ||
Utilities for reading the Annotated NYT corpus | ||
Utilities for reading the [Annotated NYT corpus](https://catalog.ldc.upenn.edu/LDC2008T19). | ||
|
||
Latest Maven dependency | ||
--- | ||
```xml | ||
<dependency> | ||
<groupId>edu.jhu.hlt</groupId> | ||
<artifactId>annotated-nyt</artifactId> | ||
<version>1.1.1</version> | ||
<version>1.1.2</version> | ||
</dependency> | ||
``` | ||
|
||
## Quick start | ||
Create a `NYTCorpusDocumentParser` object: | ||
```java | ||
NYTCorpusDocumentParser parser = new NYTCorpusDocumentParser(); | ||
``` | ||
|
||
Read a single `.xml` document from the annotated NYT corpus: | ||
```java | ||
Path p = Paths.get("/your/path/.xml"); | ||
byte[] bytes = Files.readAllBytes(p); | ||
NYTCorpusDocument ncd = parser.fromByteArray(bytes, false); | ||
AnnotatedNYTDocument and = new AnnotatedNYTDocument(ncd); | ||
``` | ||
|
||
## API | ||
The API is guaranteed not pass along any `null` fields. | ||
|
||
Many of the fields in the corpus can be empty or `null` in the | ||
documents themselves. These fields are represented in the wrapper | ||
object, `AnnotatedNYTDocument`, as `Optional` fields. | ||
|
||
Many convenience methods exist to convert naturally list-based items (e.g., | ||
the body as a `List` of paragraphs). Many of these sections, however, | ||
can also be `null`. In these cases, the API will return an empty `List` | ||
object. These lists will never be `null`. | ||
|
||
## Running the integration test | ||
The integration test can be executed with the following command: | ||
|
||
```sh | ||
mvn clean verify -Pitest -DanytDataPath=/path/to/your/LDC/corpus/data/dir | ||
``` | ||
|
||
The `anyDataPath` property should point to your `data` directory | ||
from the extracted ANYT corpus. This directory contains many folders | ||
with numbers as names, representing years of annotated NYT data. |