Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using gtrendsR for daily hits #412

Open
marcyshieh opened this issue Apr 26, 2022 · 2 comments
Open

Using gtrendsR for daily hits #412

marcyshieh opened this issue Apr 26, 2022 · 2 comments

Comments

@marcyshieh
Copy link

Using the gtrendsR package and a modified version of Alex Dyachenko’s tutorial, I’ve been trying to query estimated Google Trends daily hits. I noticed that my modified version of Alex’s code doesn’t allow me to stop mid-month. In my modified version, all the days past the last day of the previous month show up as NA. Is there a way to resolve the issue?

In essence, I am just trying to replicate the steps in this Medium article but instead of doing monthly, I want to do an entire range of time.

Here's some replication code and the sample.xlsx file.

# daily estimates 

library(gtrendsR)
library(tidyverse)
library(lubridate)
library(readxl)
library(here)
library(stringr)
library(curl)

get_daily_gtrend <- function(keyword = c('Taylor Swift', 'Kim Kardashian'), geo = 'US', from = '2004-01-01', to = '2004-11-02') {
  if (ymd(to) >= floor_date(Sys.Date(), 'month')) {
    to <- floor_date(ymd(to), 'month') - days(1)
    
    if (to < from) {
      stop("Specifying \'to\' date in the current month is not allowed")
    }
  }
  
  aggregated_data <- gtrends(keyword = keyword, geo = geo, time = paste(from, to))
  if(is.null(aggregated_data$interest_over_time)) {
    print('There is no data in Google Trends!')
    return()
  }
  
  mult_m <- aggregated_data$interest_over_time %>%
    mutate(hits = as.integer(ifelse(hits == '<1', '0', hits))) %>%
    group_by(month = floor_date(date, 'month'), keyword) %>%
    dplyr::summarise(hits = sum(hits)) %>%
    ungroup() %>%
    mutate(ym = format(month, '%Y-%m'),
           mult = hits / max(hits)) %>%
    dplyr::select(month, ym, keyword, mult) %>%
    as_tibble()
  
  pm <- tibble(s = seq(ymd(from), ymd(to), by = 'month'), 
               e = seq(ymd(from), ymd(to), by = 'month') + months(1) - days(1))
  
  raw_trends_m <- tibble()
  
  for (i in seq(1, nrow(pm), 1)) {
    curr <- gtrends(keyword, geo = geo, time = paste(pm$s[i], pm$e[i]))
    if(is.null(curr$interest_over_time)) next
    print(paste('for', pm$s[i], pm$e[i], 'retrieved', count(curr$interest_over_time), 'days of data (all keywords)'))
    raw_trends_m <- rbind(raw_trends_m,
                          curr$interest_over_time)
  }
  
  trend_m <- raw_trends_m %>%
    dplyr::select(date, keyword, hits) %>%
    mutate(ym = format(date, '%Y-%m'),
           hits = as.integer(ifelse(hits == '<1', '0', hits))) %>%
    as_tibble()
  
  
  trend_res <- trend_m %>%
    left_join(mult_m) %>%
    mutate(est_hits = hits * mult) %>%
    dplyr::select(date, keyword, est_hits) %>%
    as_tibble() %>%
    mutate(date = as.Date(date))
  
  
  return(trend_res)
}


all <- read_excel("sample.xlsx")

all$Name <- trimws(all$Name)

all <- distinct(all)

all$surname <- str_extract(all$Name, '[^ ]+$')

all$surname <- trimws(all$surname)

all_j <- all %>%
  dplyr::select(Year, Folder) %>%
  distinct()

#####

cand2004 <- all %>% 
  arrange(Folder, str_count(Name, "\\w+"), nchar(Name)) %>%
  group_by(Folder, Year) %>%
  mutate(order = row_number()) %>%
  ungroup() 

cand2004 <- cand2004 %>%
  dplyr::select(Year, Folder, Name, order) %>%
  distinct() %>%
  separate(Folder, c("state", "name"), sep="\\-", extra = "merge")

cand2004_grp1 <- cand2004 %>%
  filter(Year == 2004, order == 1)

cand2004_grp1a <- split(cand2004_grp1,rep(1:20,each=5))

l <- cand2004_grp1a$`1` %>% dplyr::pull(Name) 

l <- as.list(unique(l))

r <- tibble()


for(k in l) {
  r <- r %>%
    rbind(get_daily_gtrend(keyword = k, geo = 'US', from = '2004-01-01', to = '2004-11-02'))
}

r %>% view()
@JBleher
Copy link
Contributor

JBleher commented Apr 26, 2022

It is a problem how you loop over the dates. You can only download daily data for at moist 270 days.

The code you get builds queries for each month.

  pm <- tibble(s = seq(ymd(from), ymd(to), by = 'month'), 
               e = seq(ymd(from), ymd(to), by = 'month') + months(1) - days(1))

Also note that what you are doing makes the resulting time series hardly useful, since the queries are not comparable over time. You are stitching daily hits together which are standardized for the time frame in which you download the data.

See our paper: https://www.sciencedirect.com/science/article/abs/pii/S2452306221001210

This has nothing to do with the package, it is just how your code is written.

@PMassicotte
Copy link
Owner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants