Skip to content
This repository has been archived by the owner on Mar 6, 2023. It is now read-only.

stalkerg/python-readability

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This code is under the Apache License 2.0. http://www.apache.org/licenses/LICENSE-2.0

This is Python3 fork of https://github.com/buriy/python-readability and https://github.com/ftzeng/python-readability (some python3 support). I support only Python3 and drop Python2.x. This is not only Python3 fork. I added new features and some fixing like "lead" or "main_image_url".

Installation:

pip install git+https://github.com/stalkerg/python-readability

Usage:

from readability.readability import Document
import urllib.request

html = urllib.request.urlopen(url).read()
doc = Document(html)
doc.parse(["summary", "short_title"])
readable_article = doc.summary()
readable_title = doc.short_title()

Document() _init_ arguments:

  • input: input html as text
  • base_url: will allow adjusting links to be absolute
  • debug: output debug messages
  • min_text_length: minimum text size
  • retry_length: acceptable length of the text
  • positive_keywords: the list of positive search patterns in classes and ids, for example: ["news-item", "block"]
  • negative_keywords: the list of negative search patterns in classes and ids, for example: ["mysidebar", "related", "ads"]

Document() parse arguments:

  • params_list: list params for parse. Accept variants: ["content", "title", "short_title", "summary", "lead", "first_image_url", "main_image_url"]
  • html_partial: if True make html without html/body tags.

About

Get from the page the essence! by Python3

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • HTML 50.6%
  • Python 48.4%
  • Makefile 1.0%