Using gtrendsR for daily hits #412

marcyshieh · 2022-04-26T22:11:27Z

Using the gtrendsR package and a modified version of Alex Dyachenko’s tutorial, I’ve been trying to query estimated Google Trends daily hits. I noticed that my modified version of Alex’s code doesn’t allow me to stop mid-month. In my modified version, all the days past the last day of the previous month show up as NA. Is there a way to resolve the issue?

In essence, I am just trying to replicate the steps in this Medium article but instead of doing monthly, I want to do an entire range of time.

Here's some replication code and the sample.xlsx file.

# daily estimates 

library(gtrendsR)
library(tidyverse)
library(lubridate)
library(readxl)
library(here)
library(stringr)
library(curl)

get_daily_gtrend <- function(keyword = c('Taylor Swift', 'Kim Kardashian'), geo = 'US', from = '2004-01-01', to = '2004-11-02') {
  if (ymd(to) >= floor_date(Sys.Date(), 'month')) {
    to <- floor_date(ymd(to), 'month') - days(1)
    
    if (to < from) {
      stop("Specifying \'to\' date in the current month is not allowed")
    }
  }
  
  aggregated_data <- gtrends(keyword = keyword, geo = geo, time = paste(from, to))
  if(is.null(aggregated_data$interest_over_time)) {
    print('There is no data in Google Trends!')
    return()
  }
  
  mult_m <- aggregated_data$interest_over_time %>%
    mutate(hits = as.integer(ifelse(hits == '<1', '0', hits))) %>%
    group_by(month = floor_date(date, 'month'), keyword) %>%
    dplyr::summarise(hits = sum(hits)) %>%
    ungroup() %>%
    mutate(ym = format(month, '%Y-%m'),
           mult = hits / max(hits)) %>%
    dplyr::select(month, ym, keyword, mult) %>%
    as_tibble()
  
  pm <- tibble(s = seq(ymd(from), ymd(to), by = 'month'), 
               e = seq(ymd(from), ymd(to), by = 'month') + months(1) - days(1))
  
  raw_trends_m <- tibble()
  
  for (i in seq(1, nrow(pm), 1)) {
    curr <- gtrends(keyword, geo = geo, time = paste(pm$s[i], pm$e[i]))
    if(is.null(curr$interest_over_time)) next
    print(paste('for', pm$s[i], pm$e[i], 'retrieved', count(curr$interest_over_time), 'days of data (all keywords)'))
    raw_trends_m <- rbind(raw_trends_m,
                          curr$interest_over_time)
  }
  
  trend_m <- raw_trends_m %>%
    dplyr::select(date, keyword, hits) %>%
    mutate(ym = format(date, '%Y-%m'),
           hits = as.integer(ifelse(hits == '<1', '0', hits))) %>%
    as_tibble()
  
  
  trend_res <- trend_m %>%
    left_join(mult_m) %>%
    mutate(est_hits = hits * mult) %>%
    dplyr::select(date, keyword, est_hits) %>%
    as_tibble() %>%
    mutate(date = as.Date(date))
  
  
  return(trend_res)
}


all <- read_excel("sample.xlsx")

all$Name <- trimws(all$Name)

all <- distinct(all)

all$surname <- str_extract(all$Name, '[^ ]+$')

all$surname <- trimws(all$surname)

all_j <- all %>%
  dplyr::select(Year, Folder) %>%
  distinct()

#####

cand2004 <- all %>% 
  arrange(Folder, str_count(Name, "\\w+"), nchar(Name)) %>%
  group_by(Folder, Year) %>%
  mutate(order = row_number()) %>%
  ungroup() 

cand2004 <- cand2004 %>%
  dplyr::select(Year, Folder, Name, order) %>%
  distinct() %>%
  separate(Folder, c("state", "name"), sep="\\-", extra = "merge")

cand2004_grp1 <- cand2004 %>%
  filter(Year == 2004, order == 1)

cand2004_grp1a <- split(cand2004_grp1,rep(1:20,each=5))

l <- cand2004_grp1a$`1` %>% dplyr::pull(Name) 

l <- as.list(unique(l))

r <- tibble()


for(k in l) {
  r <- r %>%
    rbind(get_daily_gtrend(keyword = k, geo = 'US', from = '2004-01-01', to = '2004-11-02'))
}

r %>% view()

The text was updated successfully, but these errors were encountered:

JBleher · 2022-04-26T23:14:10Z

It is a problem how you loop over the dates. You can only download daily data for at moist 270 days.

The code you get builds queries for each month.

  pm <- tibble(s = seq(ymd(from), ymd(to), by = 'month'), 
               e = seq(ymd(from), ymd(to), by = 'month') + months(1) - days(1))

Also note that what you are doing makes the resulting time series hardly useful, since the queries are not comparable over time. You are stitching daily hits together which are standardized for the time frame in which you download the data.

See our paper: https://www.sciencedirect.com/science/article/abs/pii/S2452306221001210

This has nothing to do with the package, it is just how your code is written.

PMassicotte · 2022-04-27T12:54:52Z

This is also what Google Trends returns:

https://trends.google.com/trends/explore?date=2019-12-31%202020-11-01&geo=US&q=charles%20jones

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using gtrendsR for daily hits #412

Using gtrendsR for daily hits #412

marcyshieh commented Apr 26, 2022

JBleher commented Apr 26, 2022

PMassicotte commented Apr 27, 2022

Using gtrendsR for daily hits #412

Using gtrendsR for daily hits #412

Comments

marcyshieh commented Apr 26, 2022

JBleher commented Apr 26, 2022

PMassicotte commented Apr 27, 2022