Skip to content

A Python scraper designed to pull data from HLTV.org and tabulate it into a series of CSV files.

Notifications You must be signed in to change notification settings

pulkitgupta2k/HLTV-Scraper

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Working Update

HLTV Scraper

This is a multi-threaded Python scraper designed to pull data from HLTV.org and tabulate it into a series of CSV files. It is written in pure Python, so it should run on any system that can run Python 3. It is not compatible with Python 2, so you may need to install the latest Python release from here.

Installation

Since this is written in pure Python, there are no dependencies to install. Simply clone the repository or download the zip file, then cd to the directory and run python3 start.py. There is demonstration here.

Getting New Matches

This works by scraping several pieces of data from the HLTV webpages. First, it will paginate through the match results page and determine which Match IDs are not yet in the database. If there are new IDs to tabulate, it will append them to the matchIDs.csv file. These new matches are stored in an array called matchesToCheck.

Matching Matches to Events

Once these have been added, it compares matchIDs.csv to joinMatchEvent.csv file. getMatchEvents.py will parse matchesToCheck to find their respective Event IDs and append them to the joinMatchEvent.csv file.

Getting New Events

From there, getEventNames.py compares eventIDs.csv to joinMatchEvent.csv and scrapes the respective event results page to determine various data about the event. The data is then appended to eventIDs.csv.

Getting Match Results

Once the new events have been accounted for, the script takes matchesToCheck and sends them to getMatchInfo.py to scrape the necessary information.

Handling Multiple Maps

Since this returns multidimensional arrays for matches with more than one map, the script calls fixArray() thrice to remove any extra dimensions. The method turns an array like this:

[[1, 2, 3], [3, 4, 5], [['a', 'b', 'c'], ['c', 'd', 'e']], [5, 6, 7]]

To an array like this:

[[1, 2, 3], [3, 4, 5], [5, 6, 7], ['a', 'b', 'c'], ['c', 'd', 'e']]

After that, it tabulates the new information to matchResults.csv.

Getting Match Lineups

Next, the script parses the same new matches stored in matchesToCheck and find the respective team lineups and tabulates the new information to matchLineups.csv.

Getting player stats

Each match has player stats for each map. The script looks for these statistics using the matches stored in matchesToCheck and appends the new data to playerStats.csv.

Updating Players and Teams

Each player and team on HLTV has a unique identification number that increases as new players are added to the database. To find new players and teams, we get the maximum identifier value form the respective .csv file and iterate over it using getIterableItems. From there the relevant pages are scraped and tabulated to players.csv and teams.csv.

Starting Over

If you made an entry of a more recent event and want to go beyond that, remove clear the csv : matchIDs.csv to restart.(do not remove first row, ID and Title)

About

A Python scraper designed to pull data from HLTV.org and tabulate it into a series of CSV files.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%