Note: This repo is deprecated. Updated version can be found in the cc-notebooks repo.

Common Crawl statistics notebooks

Introduction

The goal of this repo is to allow users to easily and interactively view statistics about the Common Crawl data set.

Requirements

Python environment (3.x)

Jupyter Notebook

Quick Start

Clone the cc-webgraph repository and run the java tools with mvn package.

You may then clone this repository in the cc-webgraph root, or choose your own location. If you choose your own location please keep in mind to update any variables related to pathing. Once this repository is cloned, download and run the following .sh script. This script will download the files that describe the web graph. A deeper description of these files may be found in the blog. You may need to change the value of the WG variable to the path that you have cloned cc-webgraph to, if you chose your own location to clone this current repository.

Within the script, many WebGraph commands are run, the documentation for these may be found (here)[http://webgraph.di.unimi.it/docs/].

Once the webgraph_commands.sh script has run, you should find many files appear, such as the indegree and outdegree files. These files may also be found in the blog.

You are now ready to run the notebook! Simply load it and take a look. Double check any variables with pathing.

Notebook

For any existing code, remember to double check any variables related to pathing. Change them to adhere to your own directory's structure, if required.

The current existing notebook code will process and plot an indegree frequency scatter plot, using the .indegree file which for each line i, the i'th line has contains the number of Pay level Domains with i indegrees.

If you would like to analyze the metrics of another graph characteristic (such as outdegrees), it is as simple as changing the file_name variable so that it points to the correct file.

FAQ

Common Errors:

For Windows users the run_webgraph.sh file from cc-webgraph may not work due to Windows/Linux syntatical differences. See the Windows-friendly run_webgraph.sh script attached in this repo.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
run_webgraph.sh		run_webgraph.sh
topology_stats.ipynb		topology_stats.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Note: This repo is deprecated. Updated version can be found in the cc-notebooks repo.

Common Crawl statistics notebooks

Introduction

Requirements

Quick Start

Notebook

FAQ

About

Releases

Packages

Languages

commoncrawl/commoncrawl_notebooks

Folders and files

Latest commit

History

Repository files navigation

Note: This repo is deprecated. Updated version can be found in the cc-notebooks repo.

Common Crawl statistics notebooks

Introduction

Requirements

Quick Start

Notebook

FAQ

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages