Note: This repo is deprecated. Updated version can be found in the cc-notebooks repo.
The goal of this repo is to allow users to easily and interactively view statistics about the Common Crawl data set.
Python environment (3.x)
Jupyter Notebook
Clone the cc-webgraph repository and run the java tools with mvn package
.
You may then clone this repository in the cc-webgraph
root, or choose your own location. If you choose your own location please keep in mind to update any variables related to pathing.
Once this repository is cloned, download and run the following .sh script. This script will download the files that describe the web graph.
A deeper description of these files may be found in the blog. You may need to change the value of the WG
variable to the path that you have cloned cc-webgraph
to, if you chose your own location to clone this current repository.
Within the script, many WebGraph commands are run, the documentation for these may be found (here)[http://webgraph.di.unimi.it/docs/].
Once the webgraph_commands.sh
script has run, you should find many files appear, such as the indegree
and outdegree
files. These files may also be found in the blog.
You are now ready to run the notebook! Simply load it and take a look. Double check any variables with pathing.
For any existing code, remember to double check any variables related to pathing. Change them to adhere to your own directory's structure, if required.
The current existing notebook code will process and plot an indegree frequency scatter plot, using the .indegree
file which for each line i, the i'th line has contains the number of Pay level Domains with i indegrees.
If you would like to analyze the metrics of another graph characteristic (such as outdegrees), it is as simple as changing the file_name
variable so that it points to the correct file.
Common Errors:
- For Windows users the
run_webgraph.sh
file fromcc-webgraph
may not work due to Windows/Linux syntatical differences. See the Windows-friendlyrun_webgraph.sh
script attached in this repo.