Skip to content
View adbar's full-sized avatar

Organizations

@deutschestextarchiv

Block or report adbar

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
adbar/README.md

Hi there! 👋

Github stars Youtube channel views

⚡  Web  ✍  Blog  ☕  Coffee

I'm a data engineer and scientist specializing in natural language processing. On Github I'm the author and maintainer of projects like Trafilatura, a popular open-source package to gather and extract text data used by researchers and the AI industry.

Most Popular Blog Posts

Open-Source Tech Stack

Skills Programming languages
Open source skills Most used languages

Pinned Loading

  1. trafilatura Public

    Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

    Python 3.9k 272

  2. htmldate Public

    Fast and robust date extraction from web pages, with Python or on the command-line

    Python 121 26

  3. simplemma Public

    Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

    Python 151 13

  4. py3langid Public

    Forked from saffsd/langid.py

    Faster, modernized fork of the language identification tool langid.py

    Python 50 9

  5. courlan Public

    Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters

    Python 133 9

  6. German-NLP Public

    Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German

    458 67

519 contributions in the last year

Contribution Graph
Day of Week February March April May June July August September October November December January
Sunday
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Less
No contributions.
Low contributions.
Medium-low contributions.
Medium-high contributions.
High contributions.
More

Contribution activity

January 2025

Created 3 repositories

Created a pull request in lichess-bot-devs/lichess-bot that received 7 comments

timer: import submodules only and add tests

Type of pull request: Bug fix Feature Other Description: This PR updates the timer class to make the code more concise and adds tests. Related…

+126 −28 lines changed 7 comments
Opened 5 other pull requests in 3 repositories
lichess-bot-devs/lichess-bot 1 closed 1 merged
niklasf/python-chess 1 open 1 merged
deepset-ai/haystack-integrations 1 merged
Reviewed 4 pull requests in 3 repositories
lichess-bot-devs/lichess-bot 2 pull requests
adbar/trafilatura 1 pull request
niklasf/python-chess 1 pull request
Loading

Seeing something unexpected? Take a look at the GitHub profile guide.