Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[do not merge] add support for the columnar index #23

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@ dist/
__pycache__
cdx_toolkit.egg-info
.coverage
.eggs/
44 changes: 44 additions & 0 deletions ATHENA.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Using cdx_toolkit's columnar index with Athena

## Installing

```
$ pip install cdx_toolkit[athena]
```

## Credentials and Configuration

In addition to having AWS credentials, a few more configuration items are needed.

credentials: can be done multiple ways, here is one: ~/.aws/config and [profile cdx_athena]

aws_access_key_id
aws_secret_access_key

s3_staging_dir=, needs to be writeable, need to explain how to clear this bucket
schema_name= will default to 'ccindex', this is the database name, not the table name

region=us-east-1 # this is the default, and this is where CC's data is stored
# "When specifying a Region inline during client initialization, this property is named region_name."
s3_staging_dir=s3://dshfjhfkjhdfshjgdghj/staging


## Initializing the database

asetup
asummary
get_all_crawls

## Arbitrary queries

asql
explain the partitions
explain how to override the safety-belt LIMIT

## Iterating similar to the CDX index

## Generating subset WARCs from an sql query or iteration

## Clearing the staging directory

configure rclone
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
- 0.9.34 (not yet tagged)
+ experimental support for CC's columnar index

- 0.9.33
+ rename master to main
+ drop python 3.5 testing because of setuptools-scm
Expand Down
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,10 @@ hides these differences as best it can. cdx_toolkit also knits
together the monthly Common Crawl CDX indices into a single, virtual
index.

CommonCrawl also has a non-CDX "columnar index" hosted on AWS,
accessible via the (paid) Amazon Athena service. This index can be
queried using SQL, and has a few columns not present in the CDX index.

Finally, cdx_toolkit allows extracting archived pages from CC and IA
into WARC files. If you're looking to create subsets of CC or IA data
and then process them into WET or WAT files, this is a feature you'll
Expand Down Expand Up @@ -218,7 +222,7 @@ cdx_toolkit has reached the beta-testing stage of development.

## License

Copyright 2018-2020 Greg Lindahl and others
Copyright 2018-2022 Greg Lindahl and others

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this software except in compliance with the License.
Expand Down
4 changes: 2 additions & 2 deletions azure-pipelines.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ jobs:
echo "Note: using 3.6 because packages are deprecating 3.5 support"
pip install -r requirements.txt
fi
pip --use-feature=in-tree-build install . .[test]
pip --use-feature=in-tree-build install . .[test] .[athena]
displayName: 'Install dependencies'

- script: |
Expand Down Expand Up @@ -74,7 +74,7 @@ jobs:

- script: |
python -m pip install --upgrade pip
pip --use-feature=in-tree-build install . .[test]
pip --use-feature=in-tree-build install . .[test] .[athena]
displayName: 'Install dependencies'

- script: |
Expand Down
Loading