ParaQuery is a tool that helps a user interactively explore and characterize a given pivoted paraphrase collection, analyze its utility for a particular domain, and compare it to other popular lexical similarity resources – all within a single interface.
For running ParaQuery with the bundled databases, you need:
However, if you need to generate paraphrase rules from your own bilingual data, then you also need:
- Java 1.5 or higher
- Apache Hadoop
ParaQuery works by taking a pivoted paraphrase database and converting it into an SQLite database which can then be queried. ParaQuery provides an index
command that can take a gzipped file containing pivoted paraphrases and write out an SQLite database to disk. The gzipped file needs to contain paraphrase rules in the following format on each line:
[X] ||| source ||| target ||| features ||| pivots
where source
is the source English phrase being paraphrased, target
is its paraphrase, features
is a whitespace-separated list of features associated with the paraphrase pair, and pivots
is a list of the foreign-language pivots that were used in generating this particular paraphrase pair. The format is based on the format used by the Joshua paraphrase extractor since that is what is currently used to produce the compressed paraphrase files (See section on Generating Paraphrase Rules below).
The features currently produced by the Joshua paraphrase extractor are as follows:
0
(always 0; indicates whether the rule is a glue rule which it never is for paraphrases)- Ignored by ParaQuery.
1
ifsource
andtarget
are identical-log p(target|source)
. This is what's currently used by ParaQuery as the score for a paraphrase pair.-log p(source|target)
- Ignored by ParaQuery.
- Ignored by ParaQuery.
- Number of words in
source
- Number of words in
target
- Difference in number of words between
target
andsource
- Is the rule purely lexical? (Right now, ParaQuery works best with purely lexical rules so this is always 1. However, in the future it might be useful for cases where this is not true, e.g., hierarchical or syntactic paraphrases.)
- Ignored by ParaQuery.
- Ignored by ParaQuery.
- Ignored by ParaQuery.
- Ignored by ParaQuery.
- Ignored by ParaQuery.
- Ignored by ParaQuery.
Note that we have made some modifications to the Joshua paraphrase extractor (e.g., some additional filtering options) and, therefore, we bundle our modified version with ParaQuery. Several fields produced by the paraphrase extractor are currently ignored by ParaQuery but may be useful in the future.
The pivots
are a list of the pivoted phrases along with the score contributed to the pair by that pivot.
Here's an example of a paraphrase rule generated using French as the pivot language:
[X] ||| accidents at sea ||| maritime accidents ||| 0.0 1.0 0.0 2.6342840508626013 2.5230584157523763 14.266144534541269 6.260026717543335 3.0 2.0 -1.0 1.0 0.0 0.0 3.833333333333333 0.1353352832366127 0.0 0.0 ||| ["accidents maritimes:0.07177033492822964"]
If you would like to use ParaQuery right out of the box, four databases are already available. These databases are generated from the European Parliament bilingual corpora. The four ParaQuery databases available use the following languages as pivots:
Note that each of the above links is to a file called .paradb
which is the SQLite database generated from the respective bilingual corpus for use with ParaQuery. Since they are all named .paradb
, you should probably put them in seperate directories. To generate your own databases from your own data, please read on.
The code to generate the gzipped paraphrase rules in the format that's currently readable by ParaQuery is also included here. To generate pivoted paraphrases, you need three files: the file containing the foreign language sentences sentences.fr
, the file containing the corresponding English sentences sentences.en
, and, finally, a file sentences.align
containing the word alignments between the sentences. To generate the word alignments, you can probably use the Berkeley Word Aligner. The alignments need to be in the following format:
0-0 1-1 1-2 2-3
where the first number is an index for the foreign language sentence and the second number for the English sentence. Note also that both sentences.fr
and sentences.en
must be tokenized.
Please note that if you want to generate paraphrases using one of the other languages in the Europarl corpus, you do not need to do much work. Chris Callison-Burch has the files from each of the 13 languages nicely processed and available for download here as part of his paraphrasing software.
Once these files are ready, paraphrase rule files can be created as follows:
-
Prepare data to run through the Thrax offline grammar extractor (
create_thrax_data.sh
is bundled with ParaQuery underscripts/
):create_thrax_data.sh sentences.fr sentences.en sentences.align > sentences.input
-
Run Thrax to extract the paraphrase grammar:
hadoop jar thrax.jar hiero.conf <outdir> >& thrax.log
, where the librarythrax.jar
comes bundled with ParaQuery under thelib/
directory and so does the configuration filehiero.conf
. The only option you should need to modify is theinput-file
inhiero.conf
-- to point tosentences.input
. If you want to modify the other options, read more about Thrax here.<outdir>
is your desired output directory. -
Get the final hadoop output in the current directory:
hadoop fs -getmerge <outdir>/final ./rules.gz
-
Sort the generated paraphrase rules by the source side:
zcat rules.gz | sort -t'|' -k1,4 | gzip > rules-sorted.gz
-
Run the paraphrase grammar builder (note that
joshua.jar
is bundled with ParaQuery underlib/
):(java -Dfile.encoding=UTF8 -Xmx8g -classpath joshua.jar joshua.tools.BuildParaphraseGrammarWithPivots -g rules-sorted.gz | gzip > para-grammar.gz) 2>build_para.log
. -
Sort by both source and target side:
zcat para-grammar.gz | sort -t'|' -k4,7 | gzip > para-grammar-sorted.gz
-
Aggregate paraphrase rules (sum duplicate rules that you might get from different pivots):
java -Dfile.encoding=UTF8 -Xmx8g -classpath joshua.jar joshua.tools.AggregateParaphraseGrammarWithPivots -g para-grammar-sorted.gz | gzip > final-para-grammar.gz
-
Sort by the source side:
zcat final-para-grammar.gz | sort -t'|' -k1,4 | gzip > final-para-grammar-sorted.gz
Once the gzipped paraphrase file has been generated, it can be easily converted to the SQLite database from inside ParaQuery:
- Run
paraquery
(the launching script provided) - At the resulting prompt, run the following command which will create a
.paradb
file in the current directory:index final-para-grammar-sorted.gz
- If a
.paradb
file in the current directory,paraquery
will automatically attach it and output a message when starting up. Otherwise, the path to the.paradb
file must be provided as an argument.
Once you have a database loaded up, you can use all the commands that ParaQuery supports. Please read the detailed user manual for a detailed explanation of how to use ParaQuery.
We would like to thank Juri Ganitkevitch, Jonny Weese, and Chris Callison-Burch for all their help and guidance during the development of ParaQuery.