archivertools

This is a package of tools to be used for scraping websites via morph.io into the Data Together pipeline.

License & Copyright

Copyright (C) 2017 Data Together
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, version 3.0.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

See the LICENSE file for details.

Getting Involved

We would love involvement from more people! If you notice any errors or would like to submit changes, please see our Contributing Guidelines.

We use GitHub issues for tracking bugs and feature requests and Pull Requests (PRs) for submitting changes

Getting Started and Style Guides

Refer to these resources to learn more about using Github, best practices for writing Python code, code linters to analyze your code for errors, and writing docstrings which describes what functions do.

How to use Git Version Control
Github's "Hello, World" Guides
PEP8 Python style guide
Python linters automatically check your code for style errors:
Function documentation should follow the Google-style Python docstring format
- sublime text autodocstring

Installation

Tested with Python 2.7, 3.6

Install via pip

pip install archivertools

Usage

In order to authenticate to the Data Together servers, make sure the environment variable MORPH_DT_API_KEY is set. To do this in morph.io, see the morph documentation on secret values.

For testing on your local system, you can set an environment variable within python using the os package

import os

os.environ['MORPH_DT_API_KEY'] = 'the_text_of_your_dt_api_key'

The Archiver class provides the interface for saving data that will be ingested onto Data Together. All of the data and files are stored in a local sqlite database called data.sqlite. It is important that you call the Archiver.commit() function at the end of your run to ensure that the data is ingested.

Initialization

from archivertools import Archiver

url = 'http://example.org'
UUID = '0000'
a = Archiver(url,UUID)

Saving child urls

For urls on the current page that will be ingested by the Data Together crawler. The

url = 'http://example.org/links'
a.addURL(url)

Saving files

Add a local file to be uploaded to Data Together pipeline. Automatically computes hash

filename='example_file.csv'
comments='information about the file, such as encoding, metadata, etc' # optional
a.addFile(filename,comments)

Committing

Run this function at the end of your scraper to let Data Together know that your scraper has finished running. It will authenticate with Data Together and begin the upload process

a.commit()

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github		.github
archivertools		archivertools
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py
tests.py		tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

archivertools

License & Copyright

Getting Involved

Getting Started and Style Guides

Installation

Usage

Initialization

Saving child urls

Saving files

Committing

About

Releases

Packages

Contributors 5

Languages

License

datatogether/archivertools

Folders and files

Latest commit

History

Repository files navigation

archivertools

License & Copyright

Getting Involved

Getting Started and Style Guides

Installation

Usage

Initialization

Saving child urls

Saving files

Committing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages