RomaScraper

This scraper written in Python 3 was intentended to scrape an article site and download for offline viewing in HTML format. This repository also includes scripts to convert them to .docx files. The main script (roma_scraper.py) can be easily modified to fetch specific content from any site.

There are differences depending which conversion method is used:

Powershell: Slower to convert, sometimes it seems like it gets stuck, results in a smaller file size since all hrefs (links) to images stay as links in when converted to docx. This also preserves all the CSS formatting and colors so it is more accurate.
Pandoc: You can use Pandoc (binary not included in this repository, just grab it from their site) to convert HTML to DOCX. This method is faster at conversion, and embeds the images, resulting in bigger files but you don't need to be online to view them. However, this method removes any CSS and colors in the resulting files. Just include pandoc.exe in this same directory to use it.

TODO Note: Right now the the .py and .bat scripts used for downloading and conversion don't check for existing files, so it will replace any files you already have with the same name that are already downloaded or converted, there is no way to resume where you left off if stopped.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
resultUTFSig		resultUTFSig
.gitignore		.gitignore
README.md		README.md
bin_bom.py		bin_bom.py
convert to word.bat		convert to word.bat
html2docx.ps1		html2docx.ps1
links roma.txt		links roma.txt
pandoc convert.bat		pandoc convert.bat
roma_scraper.py		roma_scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RomaScraper

About

Releases

Packages

Languages

SilverKnightVGM/RomaScraper

Folders and files

Latest commit

History

Repository files navigation

RomaScraper

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages