Skip to content

Commit

Permalink
UpdatE README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
Max Thomas committed May 8, 2015
1 parent 8746123 commit 32947d9
Showing 1 changed file with 39 additions and 2 deletions.
41 changes: 39 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,49 @@
# annotated-nyt
Utilities for reading the Annotated NYT corpus
Utilities for reading the [Annotated NYT corpus](https://catalog.ldc.upenn.edu/LDC2008T19).

Latest Maven dependency
---
```xml
<dependency>
<groupId>edu.jhu.hlt</groupId>
<artifactId>annotated-nyt</artifactId>
<version>1.1.1</version>
<version>1.1.2</version>
</dependency>
```

## Quick start
Create a `NYTCorpusDocumentParser` object:
```java
NYTCorpusDocumentParser parser = new NYTCorpusDocumentParser();
```

Read a single `.xml` document from the annotated NYT corpus:
```java
Path p = Paths.get("/your/path/.xml");
byte[] bytes = Files.readAllBytes(p);
NYTCorpusDocument ncd = parser.fromByteArray(bytes, false);
AnnotatedNYTDocument and = new AnnotatedNYTDocument(ncd);
```

## API
The API is guaranteed not pass along any `null` fields.

Many of the fields in the corpus can be empty or `null` in the
documents themselves. These fields are represented in the wrapper
object, `AnnotatedNYTDocument`, as `Optional` fields.

Many convenience methods exist to convert naturally list-based items (e.g.,
the body as a `List` of paragraphs). Many of these sections, however,
can also be `null`. In these cases, the API will return an empty `List`
object. These lists will never be `null`.

## Running the integration test
The integration test can be executed with the following command:

```sh
mvn clean verify -Pitest -DanytDataPath=/path/to/your/LDC/corpus/data/dir
```

The `anyDataPath` property should point to your `data` directory
from the extracted ANYT corpus. This directory contains many folders
with numbers as names, representing years of annotated NYT data.

0 comments on commit 32947d9

Please sign in to comment.