Disclaimer

A tool to extract historical data from the StackOverflow archive maintained by the Internat Archive Wayback Machine at https://web.archive.org/web/*/stackoverflow.com

Disclaimer

This project is basically a big hack that started as a simple script using wget. I don't make any guarantees whatsoever either regarding the program work, or the data provided here.

Use at your own risks. Read the LICENCE file.

Requirements

This requires Python 3.5 with Request, BeautifulSoup and Sqlite3 installed

It was tested on Linux Debian. The tool makes extensive use of the standard Python multiprocessing library, so depending on you OS it may--or may not--work.

How to run

python3.5 master.py

The master.py file spawns a couple of processes to:

download the cached file list from the Internet Archive CDX server,
download the individual files from the WaybackMachine,
parse them with BieautifulSoup to gather data
and store them in SQLite database

Which data?

StackOverflow, Inc openly provides several ways to access their raw DB data. Either interactively through SEDE, or by downloading their quarterly DB dump.

But there is one missing data: the page views. This tool was designed to extract the page view count from cached data on the Internet Archive.

Currently, the tool gathers the retrieval data, view count, question id, and associated tags for all cached StackOverflow web pages.

A dump of the DB as CSV is updated form my home server in the data directory of this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
config		config
data		data
utils		utils
workers		workers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
download.py		download.py
extract.py		extract.py
master.py		master.py
monitor.py		monitor.py
snapshots		snapshots

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Disclaimer

Requirements

How to run

Which data?

About

Releases

Packages

Languages

License

YesIKnowIT/SODump

Folders and files

Latest commit

History

Repository files navigation

Disclaimer

Requirements

How to run

Which data?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages