-
Notifications
You must be signed in to change notification settings - Fork 6
A collection of tools for applying word senses to large corpora
License
fozziethebeat/C-Cat
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
The C-Cat library provides libraries for large scale text processing using the hadoop framework. It's ultimate goal is to provide tools and libraries for automatically customizing a wordnet ontology based on the contents of a particular corpus. It is structured into three sub-modules: 1) extendOntology-core: This provides a core set of text and collection like classes. 2) extendOntology-wordnet: This provides core wordnet libraries for reading, writing, and modifying the wordnet hierarchy along with several synset similarity metrics and several word sense disambiguation algorithms. 3) extendOntology: This is the complete package that includes both the core and wordnet submodules. On top of the two modules, it includes a text pre-processing framework for documents stored in HBase. This is the most unstable module and is under heavy development. This project utilized maven as it's build system. Most of the library dependencies are handled via maven, but a few jars are from libraries that have not been mavenized yet. To install these jars into maven, run ./add_non_maven_jars.sh Then, build the entire project with mvn package This will create two jars in target: extendOntology-1.0.jar and extendOntology-1.0-jar-with-dependencies.jar. To run any of the mains provided without maven, include both of these jars in the classpath.
About
A collection of tools for applying word senses to large corpora
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published