-
Notifications
You must be signed in to change notification settings - Fork 3
mediastandardstrust/publicity_machine
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
publicity_machine ================= a python framework to grab press releases and load them into churnalism. The "churn" package provides the framework. In general, you derive a scraper from churn.BaseScraper, then invoke it's main() method. Derived scrapers need to provide: name - name of the scraper (eg 'prnewswire') doc_type - numeric type of documents uploaded to the server (usually each scraper will be assigned it's own doc_type) find_latest() - to grab a list of the latest press releases (usually from an rss feed). Returns an iterable object of urls to scrape, most likely a simple list, but you could return a generator object if you wanted. extract() - parse html data to pull out the various text and metadata of the press release, returning it as a dictionary. all strings should be returned as unicode. The required fields are: url title: a single line of text text: main text of press release (stripped of html tags) date: time of issue, as a unix timestamp source: eg name of the company that issued the press release Any other fields can also be included. Some which are reasonably standard so far: location: eg "London, England" language: eg 'fr_CA', 'es', 'en', 'en_us' etc... topics: comma-separated list eg "Technology, Health" id: internal id of press release from website CMS (for when there's an obvious ID in the url) The marketwire scraper is probably the simplest example. The prnewswire scraper is an example of a scraper with some custom options (to scrape historical press releases). a quick bare-bones outline: ------------------------------------------------------------------------- #!/usr/bin/env python from churn.basescraper import BaseScraper class BlahBlahScraper(BaseScraper): name = 'blahblah' doc_type = 42 def find_latest(self): urls = [] ...blah blah blah... return urls def extract(self,html, url): press_release = {} ...blah blah blah... return press_release if __name__ == '__main__': scraper = BlahBlahScraper() scraper.main() -------------------------------------------------------------------------
About
Project to grab press releases and feed them up to a churnalism engine
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published