Initial release of Generalised Brown source code.

sean-chester · Nov 14, 2015 · f3ee0ad · f3ee0ad
commit f3ee0ad
Show file tree

Hide file tree

Showing 54 changed files with 4,773 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -0,0 +1,165 @@
+## generalised-brown
+version 1.0
+© 2015 Sean Chester and Leon Derczynski
+
+-------------------------------------------
+### Table of Contents 
+
+  * [Introduction](#introduction)
+  * [Requirements](#requirements)
+  * [Installation](#installation)
+  * [Usage](#usage)
+  * [License](#license)
+  * [Contact](#contact)
+
+
+------------------------------------
+### Introduction
+<a name="introduction" ></a>
+
+The *generalised-brown* software suite clusters word types by 
+distributional similarity in two phases. It first generates a list 
+of merges based on the well-known Brown clustering algorithm and 
+then recalls historical states to vary the granularity of the
+clusters. For example, given the following corpus:
+
+> Alice likes dogs and Bob likes cats while Alice hates snakes and Bob hates spiders
+
+Greedily clustering word types based on *average mutual information* 
+(i.e., running the *C++ merge generator*) produces the following 
+merge list (assuming _a_ = _|V|_ = 10):
+
+> snakes spiders 8
+> dogs cats 7
+> Alice Bob 6
+> and while 5
+> likes hates 4
+> dogs snakes 3
+> dogs and 2
+> dogs Alice 1
+> dogs likes 0
+
+One can then recall any historical state of the computation in order to 
+produce a set of clusters (i.e., run the *python cluster generator*).
+For example, with _c_ = 5, we recall the state _c_ - 1 = 4 to produce 
+the following clusters:
+
+> {snakes, spiders}
+> {dogs, cats}
+> {Alice, Bob}
+> {likes, hates}
+> {and, while}
+
+This approach (setting separate values of _a_ and _c_) we refer to as 
+*Roll-up feature generation*. By contrast, traditional Brown clustering 
+would produce the following five clusters (equivalent to running the 
+*C++ merge generator* with _a_ = 5 **and** the *python cluster generator* 
+with _c_ = 5):
+
+> {likes, hates}
+> {snakes, spiders, cats, dogs}
+> {and, while}
+> {Alice}
+> {Bob}
+
+For details about the concepts implemented in this software, please 
+read our recent AAAI paper:
+
+> L. Derczynski and S. Chester. 2016. "Generalised Brown Clustering 
+>   and Roll-up Feature Generation." In: Proceedings of the 
+>   Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16). 
+>		7 pages. To appear.
+
+For details about traditional Brown clustering, consult the article 
+in which it was introduced:
+
+> PF Brown et al. 1992. "Class-based n-gram models of natural language."
+>   Computational Linguistics 18(4): 467--479.
+
+or the implementation that our *C++ merge generator* forked:  
+
+> [wcluster](https://github.com/percyliang/brown-cluster).
+
+
+------------------------------------
+### Requirements
+<a name="requirements" ></a>
+
+*generalised-brown* relies on the following applications:
+
+ + For compiling the *C++ merge generator*: A C++ compiler that 
+ is compatible with C++ 11 and OpenMP (e.g., the newest 
+ [GNU compiler](https://gcc.gnu.org/)) and the *make* program
+
+ + For running the *python cluster generator*: A *python* 
+ interpreter
+
+------------------------------------
+### Installation
+<a name="installation" ></a>
+
+The *python cluster generator* does not need to be compiled.
+To compile the *C++ merge generator*, navigate to the 
+*merge_generator/* subdirectory of the project and type:
+
+>make
+
+------------------------------------
+### Usage
+<a name="usage" ></a>
+
+To produce a set of features for a corpus, you will first want to use 
+Generalised Brown (i.e., the *C++ merge generator*) to create a merge list. 
+Then, you can create c clusters by running the *python cluster generator* 
+on the merge list. This second step can be done for as many values of _c_ 
+as you like, but we recommend that each value of _c_ is not larger than the 
+value of _a_ used to generate the merge list.
+
+To run the *C++ merge generator*, type:
+
+>./merge_generator/wcluster --text [input_file] --a [active_set_size]
+
+The resultant merges will be recorded in:
+
+>./[input_file]-c[active_set_size]-p1.out/merges
+
+To run the *python cluster generator*, type:
+
+>python ./cluster_generator/cluster.py -in ./[input_file]-c[active_set_size]-p1.out/merges -c 3
+
+Each word type will be printed to *stdout* with its cluster id.
+
+The *C++ merge generator* runs in _O(|V| a^2)_ time, where _|V|_ is the number 
+of distinct word types in the corpus (i.e., the size of the vocabulary) and 
+_a_ is a bound on the algorithm's search space. The *python cluster generator* 
+runs in _O(|V|)_ time.
+
+
+------------------------------------
+### License
+<a name="license"></a>
+
+This software consists of two sub-modules, each released under a 
+different license: 
+
+ + The *python cluster generator* is subject to the terms of 
+[The MIT License](http://opensource.org/licenses/MIT) 
+
+ + The *C++ merge generator* follows the original licensing terms  
+of [wcluster](https://github.com/percyliang/brown-cluster). 
+
+See the relevant sub-directories of this repository for the 
+specific details of each license.
+
+
+
+------------------------------------
+### Contact
+<a name="contact"></a>
+
+This software suite will undergo a major revision; so, you are encouraged 
+to ensure that this is still the latest version. Please do not hesitate to 
+contact the authors if you have comments, questions, or bugs to report.
+>[generalised-brown on GitHub](https://github.com/sean-chester/generalised-brown) 
+
+------------------------------------
diff --git a/cluster_generator/LICENSE.md b/cluster_generator/LICENSE.md
@@ -0,0 +1,19 @@
+Copyright (c) 2015 Sean Chester and Leon Derczynski
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/cluster_generator/cluster.py b/cluster_generator/cluster.py
@@ -0,0 +1,70 @@
+#!/usr/bin/env python
+#                                       cluster.py
+#                       &copy; Sean Chester ([email protected])
+#                                      22 July 2015
+
+import csv
+import argparse
+
+# Input parsing
+parser = argparse.ArgumentParser(
+	description='Prints out a tree with a specified number of leaves, given an ' + \
+		'input file with an ordered list of merges. Each unique path identifies ' + \
+		'one leaf. All word types that have the same path as each other belong to the ' + \
+		'same leaf (and correspond to one Brown cluster).', \
+	epilog='If the output is to be read by humans, consider piping results to ' + \
+		'the sort command to print the leaves in depth-first order. (Then ' + \
+		'similar leaves/clusters will appear nearer each other in the output.)')
+parser.add_argument(
+	'-in', '--input-file', \
+	help="Input file containing ordered merges", \
+	required=True, \
+	dest='input', \
+	metavar='INPUT_FILE')
+parser.add_argument(
+	'-c', '--num-classes', \
+	type=int, \
+	help="Number of leaves/classes/clusters to produce", \
+	required=True, \
+	dest='leaves', \
+	metavar='NUM_CLASSES')
+parser.add_argument(
+	'-d', '--depth', \
+	type=int, \
+	help="Truncation depth for paths (i.e., no leaf appears farther than d-1 hops from the " + \
+		"root). Note: setting this parametre likely results in fewer than NUM_CLASSES leaves, " + \
+		"because the --num-classes filter is (logically) applied first.", \
+	required=False, \
+	dest='depth')
+args = parser.parse_args()
+
+# If depth wasn't passed as a parametre, give it a default value of being
+# equal to --num-classes.
+if args.depth is None:
+	args.depth = args.leaves
+
+# Actual processing -- read merge list in reverse and map each encountered
+# word type onto a tree path in a dictionary.
+tree = {}
+with open( args.input ) as tsv:
+	for line in reversed(list(csv.reader(tsv, delimiter="\t", quotechar=None))):
+		merge_into = line[0]
+		merge_from = line[1]
+		if not tree.has_key(merge_into):
+			tree[merge_into] = "0"
+			tree[merge_from] = "1"
+			args.leaves = args.leaves - 2
+		elif args.leaves > 0:
+			parent = tree[merge_into]
+			if len( parent ) < args.depth:
+				tree[merge_from] = parent + "1"
+				tree[merge_into] = parent + "0"
+			else:
+				tree[merge_from] = parent
+			args.leaves = args.leaves - 1
+		else: 
+			tree[merge_from] = tree[merge_into]
+
+for (cluster, path) in tree.items():
+	print( path + "\t" + cluster )
+
diff --git a/merge_generator/CHANGE_LOG.md b/merge_generator/CHANGE_LOG.md
@@ -0,0 +1,22 @@
+# Change Log
+
+--------------------
+
+## 1.3.1: [Sean Chester](https://github.com/sean-chester)
+ + Added conceptual generalisation whereby every merge is logged so that 
+ historical states can be recalled with ../cluster_generator/cluster.py.
+ + Added more parallelism (courtesy of 
+ [Kenneth S Bøgh](https://dk.linkedin.com/in/kenneth-sejdenfaden-bøgh-58915524)).
+ + Aliased the input parametre _c_ as _a_ to fit the conceptual generalisation 
+ (while maintaining backwards compatibility).
+
+## 1.3: [Percy Liang](https://github.com/percyliang)
+ + compatibility updates for newer versions of g++ (courtesy of Chris Dyer).
+
+## 1.2: [Percy Liang](https://github.com/percyliang)
+ + make compatible with MacOS (replaced timespec with timeval and changed order of linking).
+
+## 1.1: [Percy Liang](https://github.com/percyliang) 
+ + Removed deprecated operators so it works with GCC 4.3.
+
+--------------------
diff --git a/merge_generator/LICENSE.md b/merge_generator/LICENSE.md
@@ -0,0 +1,15 @@
+(C) Copyright 2015 (Sean Chester)[https://github.com/sean-chester] 
+and (Leon Derczynski)[http://derczynski.com/]
+(C) Copyright 2007-2012, Percy Liang
+
+http://cs.stanford.edu/~pliang
+
+Permission is granted for anyone to copy, use, or modify these programs and
+accompanying documents for purposes of research or education, provided this
+copyright notice is retained, and note is made of any changes that have been
+made.
+
+These programs and documents are distributed without any warranty, express or
+implied.  As the programs were written for research purposes only, they have
+not been tested to the degree that would be advisable in any important
+application.  All use of these programs is entirely at the user's own risk.
diff --git a/merge_generator/Makefile b/merge_generator/Makefile
@@ -0,0 +1,16 @@
+# 1.2: need to make sure opt.o goes in the right order to get the right scope on the command-line arguments
+# Use this for Linux
+ifeq ($(shell uname),Linux)
+	files=$(subst .cc,.o,basic/logging.cc $(shell /bin/ls *.cc) $(shell /bin/ls basic/*.cc | grep -v logging.cc))
+else
+	files=$(subst .cc,.o,basic/opt.cc $(shell /bin/ls *.cc) $(shell /bin/ls basic/*.cc | grep -v opt.cc))
+endif
+
+wcluster: $(files)
+	g++ -Wall -g -std=c++0x -O3 -fopenmp -o wcluster $(files) -lpthread
+
+%.o: %.cc
+	g++ -Wall -g -O3 -fopenmp -std=c++0x -o $@ -c $< 
+
+clean:
+	rm wcluster basic/*.o *.o