GitHub - mediastandardstrust/publicity_machine: Project to grab press releases and feed them up to a churnalism engine

mediastandardstrust / publicity_machine Public

Notifications You must be signed in to change notification settings
Fork 3
Star 9

Project to grab press releases and feed them up to a churnalism engine

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
churn		churn
wikiscraper		wikiscraper
.gitignore		.gitignore
README		README
acga.py		acga.py
churnalism.cfg.EXAMPLE		churnalism.cfg.EXAMPLE
congressional_leadership.py		congressional_leadership.py
digitalspy.py		digitalspy.py
eurekalert		eurekalert
generic_releases.py		generic_releases.py
job_list.py		job_list.py
marketwire		marketwire
prnewswire		prnewswire
util.py		util.py

Repository files navigation

publicity_machine
=================

a python framework to grab press releases and load them
into churnalism.

The "churn" package provides the framework. In general, you derive
a scraper from churn.BaseScraper, then invoke it's main() method.

Derived scrapers need to provide:

    name          - name of the scraper (eg 'prnewswire')
    doc_type      - numeric type of documents uploaded to the server (usually
                    each scraper will be assigned it's own doc_type)
    find_latest() - to grab a list of the latest press releases (usually
                    from an rss feed). Returns an iterable object of urls to
                    scrape, most likely a simple list, but you could return
                    a generator object if you wanted.
    extract()     - parse html data to pull out the various text and metadata
                    of the press release, returning it as a dictionary.
                    all strings should be returned as unicode.
                    The required fields are:

                     url
                     title: a single line of text
                     text: main text of press release (stripped of html tags)
                     date: time of issue, as a unix timestamp
                     source: eg name of the company that issued the press release
                    Any other fields can also be included. Some which are
                    reasonably standard so far:

                     location: eg "London, England"
                     language: eg 'fr_CA', 'es', 'en', 'en_us' etc...
                     topics: comma-separated list eg "Technology, Health"
                     id: internal id of press release from website CMS
                         (for when there's an obvious ID in the url)


The marketwire scraper is probably the simplest example.
The prnewswire scraper is an example of a scraper with some custom options (to
scrape historical press releases).

a quick bare-bones outline:
-------------------------------------------------------------------------
#!/usr/bin/env python

from churn.basescraper import BaseScraper

class BlahBlahScraper(BaseScraper):
    name = 'blahblah'
    doc_type = 42

    def find_latest(self):
        urls = []
        ...blah blah blah...
        return urls

    def extract(self,html, url):
        press_release = {}
        ...blah blah blah...
        return press_release

if __name__ == '__main__':
    scraper = BlahBlahScraper()
    scraper.main()
-------------------------------------------------------------------------