-
Notifications
You must be signed in to change notification settings - Fork 11
Clustering Tutorial
- Have crawl segments from Apache Nutch in consistent format.
- Have latest version of
autoext-spark-xx-SNAPSHOT.jar
. Visit Build Instructions for building the sources to obtain either an executable or spark-submit jar.
-master local
argument will be used in the following steps. To run these jobs in cluster mode, start the job with spark-submit
command instead of java -jar
and ignore -master local
Available goals
$java -jar autoext-spark-0.2-SNAPSHOT.jar help
Usage
Commands::
similarity - Computes similarity between documents.
createseq - Creates a sequence file (compatible with Nutch Segment) from raw HTML files.
partition - Partitions Nutch Content based on host names.
grep - Greps for the records which contains url and content type filters.
help - Prints this help message.
merge - Merges (smaller) part files into one large sequence file.
dedup - Removes duplicate documents (exact url matches).
d3export - Exports clusters into most popular d3js format for clusters.
keydump - Dumps all the keys of sequence files(s).
sncluster - Cluster using Shared near neighbor algorithm.
simcombine - Combines two similarity measures on a linear scale.
This step is optional: If you are trying to cluster the output of Apache Nutch which produces SequenceFiles or your data is already in SequenceFiles please skip this step.
This tool is designed to work on Hadoop/Spark backend where we use SequenceFile to efficiently store data. If you have a bunch of raw HTML files, use java -jar autoext-spark-0.2-SNAPSHOT.jar createseq
to convert them into sequence file.
Usage
java -jar autoext-spark/target/autoext-spark-0.2-SNAPSHOT.jar createseq
-in VAL : path to directory having html pages
-out VAL : path to output Sequence File
Clustering is computationally expensive job! So it is better to partition dataset for interesting documents to cluster. For instance, there is no need to cluster images and other non-HTML web pages for DOM structure and style. This step partitions data based on domain names and content types.
Usage
$ java -jar autoext-spark-0.2-SNAPSHOT.jar partition
Option "-out" is required
-app (--app-name) VAL : Name for spark context. (default: ContentPartitioner)
-in VAL : path to a file/folder having input data
-list VAL : path to a file which contains many input paths (one
path per line).
-locallist : When this flag is set the -list is forced to treat as
local file. By default the list is read from
distributed filesystem when applicable (default:
false)
-master (--master) VAL : Spark master. This is not required when job is
started with spark-submit
-out VAL : Path to file/folder where the output shall be stored
Example 1: To partition a single segment:
java -jar autoext-spark-0.2-SNAPSHOT.jar partition \ -in nutch-segments/20151013204832/content/ \ -out partition1 -master local
Example 2: To partition multiple segments:
Note: To do this, all those segments needs to be kept in a text file and then supply as value to -list input.txt
instead of -in
. Expected format is one segment path per line
java -jar autoext-spark-0.2-SNAPSHOT.jar partition \ -list input.txt \ -out partition2 -master local
Note : In case you missed it, the /content/
suffix is necessary to the segment paths.
Pick all the paths you are interested to cluster from above partition step and then put them to a text file, say paths.txt
. This can be easily done with find ../partition1/ -regex .*ml -type d > paths.txt
Usage
java -jar autoext-spark-0.2-SNAPSHOT.jar similarity
Option "-func" is required
-app (--app-name) VAL : Name for spark context. (default:
ContentSimilarityComputer)
-func VAL : Similarity function. Valid function names =
{structure, style}
-in VAL : path to a file/folder having input data
-list VAL : path to a file which contains many input paths (one
path per line).
-locallist : When this flag is set the -list is forced to treat as
local file. By default the list is read from
distributed filesystem when applicable (default:
false)
-master (--master) VAL : Spark master. This is not required when job is
started with spark-submit
-out VAL : Path to file/folder where the output shall be stored
Example1 : Compute Style similarity
java -jar autoext-spark-0.2-SNAPSHOT.jar similarity -func style \
-list paths.txt -out results/style -master local
Note : This step takes much time to complete. Pick a small dataset for testing in localmode. If you have a large dataset, then distributed mode is the way to go!!
Example1 : Compute Style similarity
java -jar autoext-spark-0.2-SNAPSHOT.jar similarity -func structure \
-list paths.txt -out results/structure -master local
Usage
$ java -jar autoext-spark-0.2-SNAPSHOT.jar simcombine
Option "-in1" is required
-app (--app-name) VAL : Name for spark context. (default: SimilarityCombiner)
-in1 VAL : Path to similarity Matrix 1 (Expected : saved
MatrixEntry RDD).
-in2 VAL : Path to Similarity Matrix 2 (Expected : saved
MatrixEntry RDD)
-master (--master) VAL : Spark master. This is not required when job is
started with spark-submit
-out VAL : Path to output file/folder where the result
similarity matrix shall be stored.
-weight N : Weight/Scale for combining the similarities. The
expected is [0.0, 1.0]. The combining step is
out = in1 * weight + (1.0 - weight) * in2
Example
java -jar autoext-spark-0.2-SNAPSHOT.jar simcombine \
-in1 results/structure -in2 results/style \
-weight 0.5 -out results/combined -master local
Usage
java -jar autoext-spark-0.2-SNAPSHOT.jar sncluster
Option "-out" is required
-app (--app-name) VAL : Name for spark context. (default:
SharedNeighborCuster)
-in VAL : path to a file/folder having input data
-list VAL : path to a file which contains many input
paths (one path per line).
-locallist : When this flag is set the -list is forced to
treat as local file. By default the list is
read from distributed filesystem when
applicable (default: false)
-master (--master) VAL : Spark master. This is not required when job
is started with spark-submit
-out VAL : Path to file/folder where the output shall be
stored
-share (--sharingThreshold) N : if the percent of similar neighbors in
clusters exceeds this value, then those
clusters will be collapsed/merged into same
cluster. Range:[0.0, 1.0] (default: 0.8)
-sim (--similarityThreshold) N : if two items have similarity above this
value, then they will be treated as
neighbors. Range[0.0, 1.0] (default: 0.7)
Example : Style clusters
java -jar target/autoext-spark-0.2-SNAPSHOT.jar sncluster \
-in results/style \
-out results/clusters -master local \
-share 0.8 -sim 0.8
Example : Structural clusters
java -jar target/autoext-spark-0.2-SNAPSHOT.jar sncluster \
-in results/structure \
-out results/clusters -master local \
-share 0.8 -sim 0.8
Example : Structure and style combined clusters
java -jar target/autoext-spark-0.2-SNAPSHOT.jar sncluster \
-in results/combined \
-ids resuls/sim-ids \
-out results/clusters -master local \
-share 0.8 -sim 0.8
Example
java -jar autoext-spark-0.2-SNAPSHOT.jar d3export \
-in results/clusters/ -out results/clusters.d3.json -master local
Use the JSON file generated from previous step and visualize using sample d3js chats in visuals/webapp/circles-tooltip.html
To load the charts, you may simply launch google-chrome $PWD/visuals/webapp/circles-tooltip.html
from the root of the project.
Once the web page is loaded, use the file chooser dialogue to choose your clusters JSON file.
The following actions are supported in UI:
+ Left click the circle: Zoom inside the cluster + Hover on the circle: Shows tooltip + Click on the outer circle: Zooms out to the upper level + Right click on a circle: Opens a web page if the cluster name is an HTTP URL (In this case, yes)
---
There are few more tools developed to test and debug the above tasks. Hopefully they will be useful for anyone experimenting with this clustering tool kit.
This tool filters content in sequence files matching to specified -urlfilter
AND/OR -contentfilter
patterns.
Usage
java -jar autoext-spark-0.2-SNAPSHOT.jar grep
Option "-out" is required
-app (--app-name) VAL : Name for spark context. (default: ContentGrep)
-contentfilter VAL : Content type filter substring
-in VAL : path to a file/folder having input data
-list VAL : path to a file which contains many input paths (one
path per line).
-locallist : When this flag is set the -list is forced to treat as
local file. By default the list is read from
distributed filesystem when applicable (default:
false)
-master (--master) VAL : Spark master. This is not required when job is
started with spark-submit
-out VAL : Path to file/folder where the output shall be stored
-urlfilter VAL : Url filter substring
Merges multiple sequence files into one large sequence file with a configurable number of parts (use -numparts
argument below).
Usage
java -jar autoext-spark-0.2-SNAPSHOT.jar merge
Option "-out" is required
-app (--app-name) VAL : Name for spark context. (default: ContentMerge)
-in VAL : path to a file/folder having input data
-list VAL : path to a file which contains many input paths (one
path per line).
-locallist : When this flag is set the -list is forced to treat as
local file. By default the list is read from
distributed filesystem when applicable (default:
false)
-master (--master) VAL : Spark master. This is not required when job is
started with spark-submit
-numparts N : Number of parts in the output. Ex: 1, 2, 3....
Optional => default
-out VAL : Path to file/folder where the output shall be stored
This tool pops out only unique records from sequence file. The uniqueness is determined by Keys (i.e. URL) only.
Usage
java -jar target/autoext-spark-0.2-SNAPSHOT.jar dedup
Option "-out" is required
-app (--app-name) VAL : Name for spark context. (default: DeDuplicator)
-in VAL : path to a file/folder having input data
-list VAL : path to a file which contains many input paths (one
path per line).
-locallist : When this flag is set the -list is forced to treat as
local file. By default the list is read from
distributed filesystem when applicable (default:
false)
-master (--master) VAL : Spark master. This is not required when job is
started with spark-submit
-out VAL : Path to file/folder where the output shall be stored
This tool dumps Keys of sequence file(s) into a plain text file Usage
java -jar target/autoext-spark-0.2-SNAPSHOT.jar keydump
Option "-out" is required
-app (--app-name) VAL : Name for spark context. (default: KeyDumper)
-in VAL : path to a file/folder having input data
-list VAL : path to a file which contains many input paths (one
path per line).
-locallist : When this flag is set the -list is forced to treat as
local file. By default the list is read from
distributed filesystem when applicable (default:
false)
-master (--master) VAL : Spark master. This is not required when job is
started with spark-submit
-out VAL : Path to file/folder where the output shall be stored