This is a mini R
package for cleaning and preprocess YouTube comment data collected with the R
packages tuber or vosonSML.
The package is a collection of several functions that were developed during several workshops on collecting and analyzing YouTube data at GESIS - Leibniz Institute for the Social Sciences. The main function of the package is parse_yt_comments()
which takes a dataframe containing YouTube comments collected with tuber
or vosonSML
as input and outputs a processed dataframe in which URLs/links, video timestamps user mentions, emoticons, and emoji have been extracted from the comments into separate colums. In addition to this, the function creates a columns containing textual descriptions of the emoji, and another one containing a cleaned version of the comment in which the elements listed before as well as numbers and punctuation have been removed.
Please note: The functions in this package are heavily dependent on the structure of the data exports from tuber
and vosonSML
, and, by extension, the structure of the YouTube API.
If you are interested in becoming a maintainer of this package, feel free to contact us.
# GitHub version
library(remotes)
remotes::install_github("gesiscss/tubecleanR")
We have created some simulated YouTube comment data in the tuber
and vosonSML
formats
# attaching package
library(tubecleanR)
# Checking example comments bundled with the package
View(tuberComments)
View(vosonComments)
# Parsing comments
tuber_parsed <- parse_yt_comments(tuberComments)
voson_parsed <- parse_yt_comments(vosonComments)
# Checking parsed versions of example comments
View(tuber_parsed)
View(Voson_parsed)
The parse_yt_comments()
function is meant to be used for YouTube comment data collected with the get_all_comments()
function from tuber
or the Collect()
function from vosonSML
. Both of those require access credentials for the YouTube API. Check the documentation of those two packages for further details.
If you want to learn more about getting access to the YouTube API, collecting comment (and other) data from the API using R
, and processing and exploring the resulting data, you can also check out the materials from our workshop.
If you are using this package in your research, please cite it as follows:
> citation("tubecleanR")
To cite package ‘tubecleanR’ in publications use:
Kohne, J., & Breuer, J. (2024). tubecleanR: Parsing and Preprocessing YouTube Comment
Data. R package version 0.1.0. <https://gesiscss.github.io/tubecleanR/>.
A BibTeX entry for LaTeX users is
@Manual{,
title = {tubecleanR: Parsing and Preprocessing YouTube Comment Data},
author = {Julian Kohne and Johannes Breuer},
year = {2024},
note = {R package version 0.1.0},
url = {https://gesiscss.github.io/tubecleanR/},
}