Skip to content

commoncrawl/commoncrawl_notebooks

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

Note: This repo is deprecated. Updated version can be found in the cc-notebooks repo.

Common Crawl statistics notebooks

Introduction

The goal of this repo is to allow users to easily and interactively view statistics about the Common Crawl data set.

Requirements

Python environment (3.x)

Jupyter Notebook

Quick Start

Clone the cc-webgraph repository and run the java tools with mvn package.

You may then clone this repository in the cc-webgraph root, or choose your own location. If you choose your own location please keep in mind to update any variables related to pathing. Once this repository is cloned, download and run the following .sh script. This script will download the files that describe the web graph. A deeper description of these files may be found in the blog. You may need to change the value of the WG variable to the path that you have cloned cc-webgraph to, if you chose your own location to clone this current repository.

Within the script, many WebGraph commands are run, the documentation for these may be found (here)[http://webgraph.di.unimi.it/docs/].

Once the webgraph_commands.sh script has run, you should find many files appear, such as the indegree and outdegree files. These files may also be found in the blog.

You are now ready to run the notebook! Simply load it and take a look. Double check any variables with pathing.

Notebook

For any existing code, remember to double check any variables related to pathing. Change them to adhere to your own directory's structure, if required.

The current existing notebook code will process and plot an indegree frequency scatter plot, using the .indegree file which for each line i, the i'th line has contains the number of Pay level Domains with i indegrees.

If you would like to analyze the metrics of another graph characteristic (such as outdegrees), it is as simple as changing the file_name variable so that it points to the correct file.

FAQ

Common Errors:

  • For Windows users the run_webgraph.sh file from cc-webgraph may not work due to Windows/Linux syntatical differences. See the Windows-friendly run_webgraph.sh script attached in this repo.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 93.1%
  • Shell 6.9%