This is a package of tools to be used for scraping websites via morph.io into the Data Together pipeline.
Copyright (C) 2017 Data Together
This program is free software: you can redistribute it and/or modify it under
the terms of the GNU Affero General Public License as published by the Free Software
Foundation, version 3.0.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
See the LICENSE
file for details.
We would love involvement from more people! If you notice any errors or would like to submit changes, please see our Contributing Guidelines.
We use GitHub issues for tracking bugs and feature requests and Pull Requests (PRs) for submitting changes
Refer to these resources to learn more about using Github, best practices for writing Python code, code linters to analyze your code for errors, and writing docstrings which describes what functions do.
- How to use Git Version Control
- Github's "Hello, World" Guides
- PEP8 Python style guide
- Python linters automatically check your code for style errors:
- Function documentation should follow the Google-style Python docstring format
Tested with Python 2.7, 3.6
Install via pip
pip install archivertools
In order to authenticate to the Data Together servers, make sure the environment variable MORPH_DT_API_KEY
is set. To do this in morph.io, see the morph documentation on secret values.
For testing on your local system, you can set an environment variable within python using the os
package
import os
os.environ['MORPH_DT_API_KEY'] = 'the_text_of_your_dt_api_key'
The Archiver class provides the interface for saving data that will be ingested onto Data Together. All of the data and files are stored in a local sqlite database called data.sqlite
. It is important that you call the Archiver.commit()
function at the end of your run to ensure that the data is ingested.
from archivertools import Archiver
url = 'http://example.org'
UUID = '0000'
a = Archiver(url,UUID)
For urls on the current page that will be ingested by the Data Together crawler. The
url = 'http://example.org/links'
a.addURL(url)
Add a local file to be uploaded to Data Together pipeline. Automatically computes hash
filename='example_file.csv'
comments='information about the file, such as encoding, metadata, etc' # optional
a.addFile(filename,comments)
Run this function at the end of your scraper to let Data Together know that your scraper has finished running. It will authenticate with Data Together and begin the upload process
a.commit()