-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Literature Search helper #38
Comments
@LesleyAtwood do you have thoughts on why there are duplicates? @nathanhwangbo it sounds like automatically searching the API--either with |
@swood-ecology & @nathanhwangbo , the 'duplicates' are a product of the four separate reviews. These papers satisfied the criteria for more than one review meaning they had both tillage treatments and cover crop treatments, per se. |
Should we manage that by updating the literature search for each review separately, or as one big search? |
For the reviewers' sake it makes more sense to update each review separately, but we will need to figure out a way to match papers across reviews so we don't double count the papers. |
@nathanhwangbo , to help match references I added a refs_all_expanded.csv to google drive. Paper titles may be the best way to match provided the matching process isn't case sensitive. Also, we had to manually add a majority of the DOIs because they weren't included in the references. |
Thanks for adding Sorry if my original post wasn't clear. We will be able to automatically search the api using the r package My current process to match the query with the doc is as follows:
After this process, there are only 3 papers in The other two are indexed by the API, so the problem has something to do with the query, not the fact that we're using the lite API. These are the two papers: To confirm that the query is the issue, I tried passing the query into the web of science website, and was unable to find either of these two papers. I'm wondering if maybe these papers have changed keywords since the last time you guys ran the query. Questions:
|
@nathanhwangbo , will you please send me list of the titles with typos. I'd like to fix those before the database is freely accessible. |
@nathanhwangbo , I can't access any of the three links you sent because I'm off campus. Can you send me the paper titles? From the image above, it looks like you searched WoS like we did. I'm not sure why there are two papers excluded from the query list. Once you send me those papers I can investigate. We narrowed the query results in two stages. First, we reviewed all titles and abstracts and excluded ones that gave any indication that the paper would not match our criteria. If the paper passed the title and abstract review, then we downloaded the entire pdf and read the paper to determine if it matched our criteria. Data were extracted from the papers that met our criteria. It was a long process. I don't think expanding the query to include the two rogue papers is worth it. There will already be quite a few papers to filter through, we don't want to add to that part of the process. |
Here's the file with the names I was able to fuzzy-match: https://github.com/Science-for-Nature-and-People/Midwest-Agriculture-Synthesis/blob/master/title_fuzzymatch.csv.
Here are the paper titles for the three links, in the same order as above: "Impact of corn residue removal on soil aggregates and particulate organic matter" (doi: 10.1007/s12155-014-9413-0) "Site-specific nitrogen management of irrigated maize: yield and soil residual nitrate effects" (doi: 10.2136/sssaj2002.0544) |
Based on the keywords, "Impact of corn residue removal on soil aggregates and particulate organic matter" wouldn't show up in the search. However, it does fit our criteria so we'll keep it. I'm surprised "Site-specific nitrogen management of irrigated maize: yield and soil residual nitrate effects" doesn't show up in your search because it includes "variable rate application" as a keyword which is one of the Nutrient Management search terms. It also fits our criteria, so we'll keep it in the database. When you run the fuzzy-match search do you get the same number of "Papers returned from initial search" as I report in the table I sent you? |
No, they're slightly lower. I originally assumed that was just because we're not looking at the entire web of science library, but maybe something's going on with the query. For reference, here's how the numbers compare:
|
I just found a new feature within Colandr that calculates the number of unique papers included in the search. While it didn't change much for 3 of the reviews, the # of papers included in the cover crop review matches your number (351). I think the discrepancy comes down to how I initially searched the papers. If you recall, I accidentally excluded Illinois from my list of states and then had to run a search specifically for Illinois at a later date. When I merged the bib files in Colandr it didn't always remove duplicate papers (possibly due to slight differences in title cases or spaces). The numbers I report in the table are based off the paper counts in Colandr. @Steve, do you think we're okay to move forward when there are a few papers missing in @nathanhwangbo search? I honestly think it's something with Colandr. |
On a slightly different note, Julien made a function in the past to find DOIs using article titles (it uses the package
(note: The only concern are the papers that the function is linking to totally different papers. It'll be hard to catch these when we are looking at new papers. I modified Julien's function to include a tolerance parameter (ie letting us choose how close two titles have to match before we decide they're the same paper), and I've been playing around with trying to find a pretty "safe" parameter value without having to manually search for all the DOIs. |
@LesleyAtwood and @nathanhwangbo I think moving forward is fine when it's just a few papers. The tillage search looked like it had a pretty big difference (1130 to 1004). Can that be chalked up to the Illinois search issue? If you think Colandr is the issue, could we use the .bib file that came directly from WoS to match, rather than the .csv generated by Colandr? |
Also, @nathanhwangbo I think what matters the most is getting the new search to match the old, whether that's by DOI or title. You shouldn't worry about correcting DOIs (unless it improves search matching) because it's not totally essential I be able to use |
I just ran the search through WoS for Tillage. Here are the results, which are much more similar to what I got back in 2018 than what @nathanhwangbo 's table shows. I just copied and pasted the keywords I sent you into WoS and included or excluded Illinois. @nathanhwangbo , maybe rerun your queries like I did and we can see if your results match mine. If they don't then double check that you included all the keywords for each topic. |
What do you think makes for the difference between when you did it and when @nathanhwangbo did? Something to do with the automated vs manual search? That level of difference seems pretty good to me. |
@swood-ecology I'm really not sure why our results differ. Clearly the automated search is dropping results, but I'm even more perplexed why his manual search results don't match mine. Hopefully we'll know more once @nathanhwangbo runs the queries again. |
Found the culprit. The difference between the two manual search results is a small difference in queries. The first term of the Tillage Specific Query in the doc is Should I just stick with the old version? The difference is that the unquoted version is equivalent to
So that explains the difference between our WOS manual searches (1112 vs 1029). I double checked, and the difference between my WOS manual search and my WOS automated search (1029 -> 1004) is a direct result of querying using the Lite API (looking through less collections).
|
Nice catch! Is it hard to make a list of the papers that were included in the original but not the latter search? My hunch is that we want |
For tillage:
For Early Season Pest Management:
|
I think we're good on tillage because those two papers that made it into the final reference list are actually there because they're papers about |
Ah, I didn't think about that! A quick check shows that you're right: both of the papers also show up in the cover crop query. I'll go ahead and keep the quoted versions of the queries then |
I just realized that the Web of Science API doesn't give us a full citation (let alone in Bibtex format), so I'm imagining the following workflow:
The alternative to using |
Started to implement this workflow in FYI, there are 289 different papers that show up if we run the queries for 2018-2019. Question: What's the plan for filtering through the query results? Is Colandr going to be involved? |
Colandr could be involved, but it's not necessary. The benefit of Colandr is that it helps keep the reviewer organized. I found, however, that the machine learning component of Colandr didn't really speed up the review process. Because we don't have another option lined up lets plan to use Colandr. Is there any way we can get both the bib files and pdfs to automatically upload into Colandr? They will have to load by Review so that the filtering criteria doesn't become too overwhelming. |
@nathanhwangbo those 289 papers are ones that didn't show up in the original search that do show up when you do the search to the present day? that's a lot! sounds like I'll have to carve out some time to go through those. @LesleyAtwood I'm fine with using Colandr for this, but I agree that it really just gives us a home base for going through things, rather than a useful machine learning tool. I'm not sure how you'd auto-upload stuff to Colandr since you can't interact with it using code (as far as I know). I think we'd have to have the script auto-run in the background somewhere, ping us once there were a certain number of papers, then we'd have to take that .bib and upload it manually into Colandr. |
I agree, 289 papers is a lot. I guess the topics are more popular than ever! |
Yup, these are 289 are completely new papers (it's actually 290 now 😄 ) Out of these 290 papers, I wasn't able to find DOIs for 3 of them (the titles for these three are saved in the The rest are in I'm starting to play with BibScan, but haven't had a high success rate for downloading pdfs yet. |
Oh geez! Well I should get on reviewing those soon. @LesleyAtwood do you think we could load those into Colandr now and I could start reviewing? |
@swood-ecology, Yes, the Colandr reviews are cleared and ready for the next batch of papers. It will probably be easiest to use the same framework I used because the selection criteria are already created and described. I can send you the selection criteria protocol by the end of the day. It's ready to go I just want to read over it again. |
Great. Should we go over the first couple together just to make sure I have it right? Maybe on Monday? @nathanhwangbo do you have the searches saved to .bib files that we can load into Colandr? Also, do you think it would be possible to have a |
@swood-ecology , Tuesday would be better for me. I'm free between 8-11 and after our SOC market meeting |
I have the .bib files saved in the repo sorted by date of query (see here: https://github.com/Science-for-Nature-and-People/Midwest-Agriculture-Synthesis/tree/master/auto_pubsearch/Bibfiles). However, I tried to import one of them into Colandr and wasn't having any luck (I would try to import the file and nothing would happen). I'm totally new to Colandr, so it might just be that I'm missing a step. @brunj7 also tried with a different bibfile with the same result. He was able to get the import to work with an A few other notes:
|
@nathanhwangbo this paper describes a pretty cool workflow for regularly updating data, like our .bib files. |
@nathanhwangbo are the .bib files separated out by literature review? they didn't seem to be in the repo. you'll see that Lesley has 4 reviews, each of which correspond to search criteria for each review. I hope this isn't too complicated, but we'd need different .bib files for each review, rather than an overall .bib. |
It's not a problem, but reason I put all the reviews together was so that I could easily remove papers that are duplicated across reviews. Should I just leave duplicates in there? |
I see what you mean. I would leave them separately, though, and keep it as how Lesley has done it. We'd have to totally re-do Colandr to be able to read in only one file. |
Ok, the files split by review are here (look for the 20191007 files): https://github.com/Science-for-Nature-and-People/Midwest-Agriculture-Synthesis/tree/master/auto_pubsearch/Bibfiles |
Thanks @nathanhwangbo. I tried uploading the files to Colandr and was having the same problem with the references not being added. At first I wondered if it was a permissions issue and so tried to upload the .bib to a Colandr review for which I'm the owner (@LesleyAtwood owns the AgEvidence reviews). That didn't work. Then I was wondering if there might be something different about the .bib files that you're writing from R vs what's downloaded directly from WoS. One think I noticed is that the .bib files you generated don't have all of the information as the WoS .bib, which includes things like the full abstract, which is needed for the initial screening. Do you think this is reconcilable, or do you think we should be thinking about a workflow where perhaps the |
Also, @LesleyAtwood identified some differences between the references your search generated vs the original approach. Here's your reference: |
Thanks for doing that testing. Can you do similar testing to show how you imported your Wisconsin test file ( In general, though, we are able to get most of the information in the original references -- the ones we don't have are Abstract and EISSN. That being said... if abstracts are required in the |
That's too bad you can't get abstracts because those are definitely a must-have. They pop up in Colandr and allow us to screen the papers within Colandr. So maybe we should think about the manual workflow. Do you think there's anything we could automate? Like, doing the search automated through I'm not sure what's up with Colandr not importing those .bib files. Let me quick email the creator and see if she has any idea. |
Started work on this in
wos_pubsearch.R
rwos
to access the Web of Science APIwosr
no longer works.references_for_app.csv
, so I wanted to make sure that all the papers in here were present in our giant data frame.csv
only has the citation (which isn't available in the big data frame), so I thought the best thing to do would be to pull out the doi from the citation and try to match the doisreferences_for_app
were in the giant data frame, which is pretty bad. The reason for a lot of it is doi missingness in the results data framereferences_for_app.csv
that have the same title, first author, and year.Question: Is there are a reason these duplicates are in
references_for_app.csv
? See the Paper_id pairs: (129, 204) , (45, 310), (142, 95) for exact matches, and (27, 315), (55,189), (231,128) for almost exact matchesThe text was updated successfully, but these errors were encountered: