-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lecture on web scraping and (perhaps) text #87
Comments
Useful gadget for analyzing HTML |
Something I think could be common for a class of websites. |
Yeah, I think that there are a large number of sites where you really need to run selenium... it emulates both cookies and runs the javascript (at which point it can then be scraped by the other tools). It would be great if we should show a very minimal example of rselenium, if it is relatively easy to show. Do you guys want to grab the R scraping textbooks from my office? |
Relevant repos I have found in Github so far: Useful R packages for data cleaning:
Fun examples (not necessarily economics):
|
Jasmine and I had some discussion about how the lecture can be delivered:
|
I think I really want to emphasize the webscraping more rather than talking about tidyverse transformations. The goal should be about building people's confidence that they can (1) scrape numerical data from the web and (2) could work with text as data. It is more important for me to show the tools than anything else. |
If all we did was give a 1.25 hour presentation on how to scrape a couple of websites, I would be happy. To be clear, we do not need to have an economic application of getting the data, just that we should be scraping data that could be applied to economic problems. For example, you could even take a world-bank or whatever page that has a "download" button and say "lets pretend it didn't have that button", I will show you how you could have gotten the data anyways. |
Just so it's not forgotten, I wanted to link to the notes Jasmine Yang produced on this from a few months ago. If the issue has evolved since then, please feel free to disregard. https://github.com/ubcecon/computing_and_datascience/blob/master/python_sandbox/Web-Scraping.md |
Simple tutorial on using |
Relatedly, I have attached code that scrapes the AER website to look at
programming language usage. Patrick independently did the same thing, and
his results are going to be part of the next AER annual report. His code is
perhaps a bit nicer https://github.com/pbaylis/econ-program-usage
…On Wed, Oct 31, 2018 at 2:25 PM Chiyoung Ahn ***@***.***> wrote:
Simple tutorial on using rvest I wrote yesterday:
https://github.com/chiyahn/notes/blob/master/programming/data-mining/rvest/text-mining-with-rvest.md
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#87 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJ-4vPG2C_--hskRDuhG1ka9EblmMVOaks5uqhVCgaJpZM4X8zdZ>
.
|
@pbaylis Can you give these guys your code to prepare as an Rmd file? I think it would be a nice example code to give people. That said, I want to stress in class an example with data that is not "inside economics" so they don't think of this stuff as just a novelty. |
I don't think it's actually all that clean but sure. Here's the repo. One note - I keep this code in a private repo because I don't want to be seen as encouraging people all over the internet to hammer the AER website (although thanks to Paul, it's considerably more gentle than it could have been). So it's important to talk about being a good scraping citizen when you do this sort of thing: test on a small subset until you know it works, don't parallelize downloading code, and include sleep time when downloading lots of large files or a bunch of websites (which, honestly, my code should do more of). |
@pbaylis Alright Debbie Downer. You environmental economists spend too much time thinking about ethics and the tragedy-of-the-commons. The optimal non-cooperative strategy here is slash-and-burn webscraping. But we will pass on your bleeding heart messages of being good scraping citizens along with the code! |
The rsdriver seems to have a connection issue, so when dealing with cookies, it seems like we need to install docker to run RSelenium |
We have given the students basic docker instruction, so we could conceivably pass on the RSelenium example for them... But I don't think we should use that in the core demo in class (just supplementary links if they want to do further). Let's keep things simple. Also, it is more important to me that we show clean simple examples than fancy stuff, if that stuff is tricky to setyo. Also @chiyahn and @jasminefish000 I want to make sure you guys are talking and planning things out together. If you are both off doing your own things for this lecture, there might be a lot of replication of effort. |
For what it's worth, I've had no problem using rselenium without docker on
Linux.
…On Sun, Nov 4, 2018, 8:00 AM Jesse Perla ***@***.*** wrote:
We have given the students basic docker instruction, so we could
conceivably pass on the RSelenium example for them...
But I don't think we should use that in the core demo in class (just
supplementary links if they want to do further). Let's keep things simple.
Also, it is more important to me that we show clean simple examples than
fancy stuff, if that stuff is tricky to setyo.
Also @chiyahn <https://github.com/chiyahn> and @jasminefish000
<https://github.com/jasminefish000> I want to make sure you guys are
talking and planning things out together. If you are both off doing your
own things for this lecture, there might be a lot of replication of effort.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#87 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJ-4vIxsLnLPBbWkrsKri3LEht3sIpGZks5urw87gaJpZM4X8zdZ>
.
|
The lecture is planned for November 14th
My goal is primarily to help people realize that scraping the web and doing text analysis is Not scary! I don't want fear of it to be a reason they are not willing to get creative in the creation of new sources of data.
You guys can play around with the directory https://github.com/ubcecon/computing_and_datascience/tree/master/R_sandbox etc.
The text was updated successfully, but these errors were encountered: