This code [code here] aims to scrape data from selected Telegram Channels, Groups or Chats through the Telethon Library, also integrating Google's Gspread Library and printing the results in a Google Spreadsheet in real time.
In summary, it is possible to set 'Periods' (date), 'Keywords' (search) and 'ID' (Channels, Groups or Chats) to scrape all the desired content, returning: 'Scraping ID', 'Group', 'Author ID', 'Content', 'Date ', 'Message ID', 'Author', 'Views', 'Reactions', 'Shares', 'Media', 'Comments'.
To avoid impacts from code breaks during the scraping process, it was decided to insert each scraped content into the spreadsheet, one by one, instead of scraping them all and, only at the end, resulting in output to a spreadsheet.
It was asked to scrape Jair Bolsonaro's Channel from Telegram between january 1st 2019 and january 1st 2023, then it returned 5241 posts:
It can be accessed at: [Worksheet of Scraped Telegram Bolsonaro's Posts].
Its use is highly encouraged and recommended for academic and scientific research, content analysis, sentiment and speech. It is free and open, and academic use is encouraged. Its responsible use is the sole responsibility of those who adapt and manipulate the data.
pip install telethon
pip install google-auth google-auth-oauthlib google-auth-httplib2 google-api-python-client
Attention: If you don't have the necessary credentials, you can create it for free on the official Telegram for Developers website: https://my.telegram.org/apps. There you can get your 'api_id' and 'api-hash'.
# setup / change only the first time you use it
username = 'username' # here you put your username from your telegram account
phone = '+5511999999999' # here you put your phone number from your telegram account
api_id = '12345678' # here you put your api_id from https://my.telegram.org/apps
api_hash = '12ab12ab12ab12ab12ab12ab12ab12ab' # here you put your api_hash from https://my.telegram.org/apps
# setup / change every time to use to define scraping
channel = '@jairbolsonarobrasil' # here you put the name of the channel or group that you want to scrap (ex: '@jairbolsonarobrasil' or 'https://t.me/jairbolsonarobrasil/' / not: 'https://web.telegram.org/z/#-1273465589' or '-1273465589')
worksheet_name = 'Telegram Teste' # here you put the name of the file you want as output, it will create a file on your google drive home screen
d_min = 1 # start day / this date will be included
m_min = 1 # start month
y_min = 2022 # start year
d_max = 2 # final day / only the day before this date will be included, that is, this date will not be included
m_max = 1 # final month
y_max = 2022 # final year
key_search = '' # only if you want to search a keyword, if not, keep as ''
01.) It should ask you 'allow this laptop to access your Google credentials?' This will allow code running on this notebook to access your Google Drive and Google Cloud data. Review the code before allowing access. Put ir 'Allow':
02.) Choose an account to proceed to Collaboratory Runtimes. To continue, Google will share your name, email address, preferred language, and profile picture with the Collaboratory Runtimes app. Please review the Collaboratory Runtimes app's Privacy Policy and Terms of Service before using it:
03.) Then it'll call you to config your Telegram, put your phone number:
04.) You will recieve a code:
05.) Came back with your new code:
06.) Put your password for your Telegram account:
07.) You will be notified that the Login was successful:
08.) The scraping will start from the parameters you entered earlier, note that it will also be updated in the panel:
09.) Your file will be automatically generated on the homepage of your logged in Google Drive:
10.) At the end, you will receive a message of how many messages were scraped, based on the loop performed:
The output can be found in this format:
Its use is highly encouraged and recommended for academic and scientific research, content analysis, sentiment and speech. It is free and open, and academic use is encouraged. Its responsible use is the sole responsibility of those who adapt and manipulate the data.
Ergon Cugler de Moraes Silva, from Brazil, mailto: [email protected] / Master's Program in Public Administration and Government, Getulio Vargas Foundation (FGV) / Funded Researcher by the National Council for Scientific and Technological Development (CNPq) / Center of Bureaucratic Studies (NEB) / Núcleo de Estudos da Burocracia (NEB).
SILVA, Ergon Cugler de Moraes. Web Scraping Telegram Posts and Content. (feb) 2023. Avaliable at: https://github.com/ergoncugler/web-scraping-telegram/.