content-extraction

Here are 37 public repositories matching this topic...

currentslab / extractnet

A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one package

python machine-learning text-mining news web-scraping webscraping news-articles news-extractor content-extraction news-extraction text-cleaning date-extraction author-extraction

Updated Dec 25, 2023
HTML

mvasilkov / readability2

Star

Readability2 converts HTML to plain text.

javascript html readability plaintext content-extraction

Updated Dec 12, 2018
TypeScript

tuffstuff9 / nextjs-pdf-parser

Star

Next.js template for seamless PDF parsing using pdf2json and FilePond. Ideal for developers seeking a ready-to-use solution for PDF content extraction in Next.js projects.

nextjs content-extraction pdf-parsing react-pdf pdf-parser pdf2json filepond pdf-upload pdf-parse nextjs-pdf-parser nextjs-pdf react-pdf-parser nextjs-pdf-parse nextjs-pdf-parsing

Updated Dec 8, 2023
TypeScript

gregors / boilerpipe-ruby

Star

Pure ruby implementation of the Boilerpipe content extraction algorithm tuned for online articles

news webscraping content-extraction boilerpipe boilerpipe-algorithm

Updated Feb 21, 2021
Ruby

vrknetha / mcp-server-firecrawl

Star

FireCrawl MCP Server is a powerful web scraping integration for Claude and other LLMs. It provides JavaScript rendering, batch processing, and search capabilities through a Model Context Protocol (MCP) interface. Now with support for self-hosted instances and advanced features like parallel processing, automatic retries, and content filtering

web-crawler web-scraping data-collection batch-processing content-extraction search-api claude llm-tools firecrawl model-context-protocol mcp-server firecrawl-ai javascript-rendering

Updated Jan 6, 2025
JavaScript

nikitautiu / learnhtml

Star

Web content extraction using machine learning

html deep-learning content-extraction

Updated Mar 3, 2021
HTML

pdfix / pdfix_sdk_example_cpp

Star

Make PDF Files Accessible, Extract Data from PDF, Convert PDF to HTML, Fill-in PDF Form, Stamp PDF and more...

Updated Jan 14, 2025
C++

gdamdam / sumo

Star

Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more

nlp nltk automatic-summarization content-extraction semantic-analysis sentence-extraction entity-recognition

Updated Jan 15, 2019
Python

oiwn / dom-content-extraction

Star

DOM Based Content Extraction via Text Density

scraping content-extraction dom-based

Updated Jan 17, 2025
Rust

timoteostewart / benson

Star

Benson turns a list of URLs into mp3s of the contents of each web page - take control over your reading backlog!

productivity web-scraping content-extraction boilerplate-removal

Updated Oct 30, 2024
Python

LandWhale2 / TD-Spider

Star

Via Text Density Simple Web Crawler With Go

golang data-mining opensource dom web-crawler scraping content-extraction keyword-search text-density

Updated Mar 19, 2023
Go

bencmc / youtube_video_summarizer

Star

This repository houses a Python application for extracting YouTube video transcripts and summarizing its content.

python natural-language-processing youtube-api video-processing openai text-summarization text-processing natural content-extraction streamlit transcript-analysis gpt-35-turbo langchain-python

Updated Sep 29, 2023
Python

peremenov / seize

Star

Seize is light Node or Browser web-page content extractor inspired by arc90 readability and Safari Reader

dom extract reader readability content-extraction text-score

Updated May 20, 2017
HTML

zeoagency / mobile-first-indexing-tool

Star

Mobile First Indexing Tool

aws-lambda seo mfi content-extraction lighthouse seo-tool aws-layers

Updated Sep 8, 2022
Python

minarc / godensity

Star

This repository is implematation of 📄 DOM based content extraction via text density. Tested for Korean web pages.

content-extraction web-content-extractor

Updated Sep 7, 2024
Go

arman-bd / www2any

Star

A web application that scrapes web pages, extracts main content, and uses OpenLLaMA to convert the content into specified formats.

flask transformer webscraping content-extraction playwright llm openllama

Updated Dec 9, 2024
HTML

spences10 / mcp-jinaai-reader

Star

🔍 Model Context Protocol (MCP) tool for parsing websites using the Jina.ai Reader

mcp documentation-tool text-extraction web-scraping content-extraction web-content jinaai llm-tools model-context-protocol

Updated Jan 20, 2025
JavaScript

leroyanders / acrticle-scrapper

Star

This Python-based repository hosts a sophisticated service designed for scraping web articles and converting them into Markdown format. The core functionality of this service includes extracting the main content of articles, such as headlines, key paragraphs, and associated images, and then seamlessly transforming this content into well-structured…

python web-scraping content-extraction metadata-extraction article-parser markdown-conversion image-downloading data-archiving html-to-markdown-converter content-creation-tools