Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sbtools get #272

Merged
merged 9 commits into from
Jul 18, 2017
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion content/usgs-packages/sbtools_Discovery.Rmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "sbtools - Data discovery"
date: "9999-07-01"
date: "9999-07-25"
author: "Lindsay R. Carr"
slug: "sbtools-discovery"
image: "img/main/intro-icons-300px/r-logo.png"
Expand Down
181 changes: 87 additions & 94 deletions content/usgs-packages/sbtools_Discovery.md

Large diffs are not rendered by default.

179 changes: 179 additions & 0 deletions content/usgs-packages/sbtools_Get.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
---
title: "sbtools - Download Data"
date: "9999-07-01"
author: "Lindsay R. Carr"
slug: "sbtools-get"
image: "img/main/intro-icons-300px/r-logo.png"
output: USGSmarkdowntemplates::hugoTraining
parent: Introduction to USGS R Packages
weight: 2
draft: true
---

```{r setup, include=FALSE, warning=FALSE, message=FALSE}
library(knitr)

knit_hooks$set(plot=function(x, options) {
sprintf("<img src='../%s%s-%d.%s'/ title='%s'/>",
options$fig.path, options$label, options$fig.cur, options$fig.ext, options$fig.cap)

})

opts_chunk$set(
echo=TRUE,
fig.path="static/sbtools-get/",
fig.width = 6,
fig.height = 6,
fig.cap = "TODO"
)

set.seed(1)
```

This lesson will describe the basic functions to manage ScienceBase authenticated sessions and view or download ScienceBase items. If you aren't sure what a ScienceBase item is, head back to the [previous lesson on `sbitems`](/sbtools-sbitem).

Don't forget to load the library if you're in a new R session!

```{r sbtools-library, message=FALSE, warning=FALSE}
library(sbtools)
```

```{r sbtools-auth, echo=FALSE}
# run vizlab::storeSBcreds() once before this can work
home <- path.expand('~')
sbCreds <- file.path(home, ".vizlab/sbCreds")
credList <- readRDS(sbCreds)
un <- rawToChar(credList$username)
pw <- rawToChar(credList$password)
sbtools::authenticate_sb(un, pw)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like a good solution already, but is secrets the way of the future? you mentioned that you and JW got it working for you...are you ready to switch over here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's ready thanks to the windows complications. I'll make an issue for later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #277

```

## Authentication

This section is specific to authentication with ScienceBase. If you don't have a ScienceBase account, skip to the next section. Just know that you will only be able to download public data.

The first step to authenticating (or logging in) to ScienceBase is to use the function `authenticate_sb`. The arguments are your username and password. Alternatively, you can use the function interactively by not supplying any arguments. It will prompt you for your username in the R console and then your password in a pop-up window. Be very cautious when using the username and password arguments - don't include these in any scripts! To be safe, you can leave out the arguments and use the interactive login. Try interactively logging in:

```{r sbtools-login, eval=FALSE}
authenticate_sb()
```

To double check that your authentication was successful, use the function `is_logged_in`. It will return a logical to let you know if you are logged in or not. No arguments are needed.

```{r sbtools-verifylogin}
is_logged_in()
```

Each user has a specific ScienceBase id associated with their account. The user ids can be used to inspect what top-level items saved under your account (discussed in next section). To determine your user id, use the function `user_id` in an authenticated session. No arguments are necessary.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i know the function is user_id, but it bothers me to call a user's top level item an id. what about home item?

Each user has a home ScienceBase item associated with their account. You can inspect the items and
 files attached to your home item (discussed in next section) and even add new items and files
 (discussed in next lesson). To determine the ScienceBase ID of your home item, use the function
 `user_id` in an authenticated session. No arguments are necessary.

or maybe personal if you don't like home?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think home item works, and that really helps clear up some of the initial confusion I had with working on this function. I like it 👍


```{r sbtools-userid}
user_id()
```

When you're done with your session, you can actively logout using the `session_logout`. No arguments are required. If you do not do this, you will be automatically logged out after a certain amount of time or when you close R.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the session_logout" -> "session_logout" or "the session_logout function"


## Inspect and download items

The first inspection step for ScienceBase items is to determine if the item even exists. To do this, use the function `identifier_exists`. The only required argument is `sb_id` which can be either a character string of the item id or an `sbitem`. It will return a logical to indicate if the item exists or not.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on http://www.quickanddirtytips.com/education/grammar/if-versus-whether, i'd recommend "It will return a logical to indicate whether the item exists or not."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📖 learning so much!


```{r sbtools-identifierexists}
identifier_exists("4f4e4acae4b07f02db67d22b")
identifier_exists("thisisnotagoodid")
```

ScienceBase items can be described by alternative identifiers, e.g. digital object identifiers, IPDS codes, etc. They are defined on ScienceBase with a scheme, type, and key. For examples of identifiers, see the "Additional Information | Identifiers" section of [Differential Heating](https://www.sciencebase.gov/catalog/item/580587a2e4b0824b2d1c1f23).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noticing that this section is partially redundant with line 203 of PR #271 (https://github.com/USGS-R/training-curriculum/pull/271/files#diff-41534c65bd2e578b204b437073f26e2dR203). could possibly move the formal, full definition of identifiers (with examples) into https://github.com/USGS-R/training-curriculum/blob/master/content/usgs-packages/sbtools_sbitem.Rmd. it'll probably still be good to have a reminder here and in sbtools_Modify.Rmd, but these could become shorter definitions that refer back to sbtools_sbitem.Rmd


You can use the function `item_exists` to check whether or not a scheme-type-key tuple already exists. The function has three required arguments - `scheme`, `type`, and `key`. Note that the table of alternative identifiers on ScienceBase is in a different order than this function accepts. On ScienceBase: type, scheme, key. For `item_exists`: scheme, type, key.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good tip!


```{r sbtools-itemexists}
# test a made up tuple
item_exists(scheme = "made", type = "this", key = "up")

# test a tuple from the SB item "4f4e4acae4b07f02db67d22b"
item_exists(scheme = "State Inventory", type = "UniqueKey", key = "P1281")

# test the same scheme & type with a made up key
item_exists(scheme = "State Inventory", type = "UniqueKey", key = "1234")
```

You can create sbitems from just the ScienceBase id. To do this use `as.sbitem`. *why you would use it*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you could safely leave out as.sbitem entirely and just show item_get here instead. see comment on item_get below.


```{r sbtools-as-sbitem}
antarctica_sbitem <- as.sbitem("4f4e4b24e4b07f02db6aea14")
class(antarctica_sbitem)
antarctica_sbitem
```

Let's inspect various ScienceBase items. There are functions to look at the parent item, metadata fields, sub-items, and associated files. Each of these functions require the id of the sbitem as the first argument. For all of these examples, we are going to use the same sbitem id, "4f4e4b24e4b07f02db6aea14".

First, let's inspect the parent item. The function to use is `item_get_parent`, and the item id is the only necessary argument.

```{r sbtools-parent}
ex_id <- "4f4e479de4b07f02db491e34"
ex_id_parent <- item_get_parent(ex_id)
ex_id_parent$title
```

Now, let's see if this item has any children by using the `item_list_children` function. Notice that this function says "list" and not "get" as the previous one did. Functions with "list" only return a few fields associated with each item. Functions with "get" are pulling down all available information, including files, associated with an item.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i really like that you've attempted a high-level explanation here. that said, i don't think item_get_parent and item_get_fields actually pull down files, do they? maybe only item_get_wfs (if and when we get that one to work).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh hmmm good point. I think I was going for the fact that there is more info. What's your take? Any higher-level difference here worth noting?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well...hmm...no, i'm not seeing any consistent patterns here...

  • item_list_children returns a list of sbitems
  • item_list_files returns a data.frame
  • item_get_parent returns a single sbitem
  • item_get_fields returns a list of info or drops from list to vector
  • item_get_wfs returns an sp object, e.g., SpatialPolygonsDataFrame
  • item_get returns a single sbitem

you could almost claim that item_get* expects one of each of whatever it's looking for, except that item_get_fields(x, 'identifiers') can return more than one set of identifiers, so...still nope.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll just leave this out completely.


```{r sbtools-children}
ex_id_children <- item_list_children(ex_id)
length(ex_id_children)
sapply(ex_id_children, function(item) item$title)
```

Let's check to see if this item has any files attached to it using `item_list_files`. This will return a dataframe with the three columns: `fname` (filename), `size` (file size in bytes), and `url` (the URL to the file on ScienceBase).

```{r sbtools-files}
ex_id_files <- item_list_files(ex_id)
nrow(ex_id_files)
ex_id_files$fname
```

To actually get the files into R as data, you need to use their URLs and the appropriate parsing function. Both of the files returned for this item are XML, so you can use the `xml2` function, `read_xml`. As practice, we will download the first XML file.

```{r sbtools-filedownload}
xml2::read_xml(ex_id_files$url[1])
```

You can also inspect specific metadata fields of ScienceBase items. To do this, use the `item_get_fields` function. This function requires a second argument to the item id called `fields` that is a character vector of the fields you want to retrieve. See the [developer documentation for a SB item model](https://my.usgs.gov/confluence/display/sciencebase/ScienceBase+Item+Core+Model) for a list of potential fields. You can also use the argument `drop` to indicate that if only one field is requested, the object returned remains a list (`drop=FALSE`) or becomes a vector (`drop=TRUE`). The default is `drop=TRUE`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 for the link


```{r sbtools-fields}
# request multiple fields
multi_fields <- item_get_fields(ex_id, c("summary", "tags"))
length(multi_fields)
names(multi_fields)

# single field, drop=TRUE
single_field_drop <- item_get_fields(ex_id, "summary")
names(single_field_drop)
class(single_field_drop)

# single field, drop=FALSE
single_field <- item_get_fields(ex_id, "summary", drop=FALSE)
single_field
class(single_field)
```

If a field is empty, it will return `NULL`.

```{r sbtools-fields-empty}
# request a nonexistent fields
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"request a nonexistent fields" -> "request nonexistent fields"

item_get_fields(ex_id, c("dates", "citation"))
```

Now that we've inspected the item, let's actually pull the item down. There are a number of extra fields to inspect now.

```{r sbtools-get}
ex_id_item <- item_get(ex_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think item_get pulls down a subset of the info you could pull using item_get_fields, though i see that your above examples rightly pick and choose the info of interest. still, if i'm remembering that right, then maybe it'd make sense to show item_get up top rather than down here. i'd do it by replacing the as.sbitem example above with this item_get example. in my workflows i often use item_get first, to get my bearings, and then augment with item_get_fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm not sure I really follow. So item_get is not returning everything associated with an item?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. well, maybe i was wrong. i just tried item_get on a complicated item and can't find any fields that aren't already present. maybe it used to pull down a subset in an earlier sbtools version, or maybe i'm just waaaay confused and it's a good thing you're reading my comments with a critical eye =)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm was just peering through the history for the function and don't see anything too obvious, but I haven't ever touched the code so who knows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do you still suggest this getting moved up top and deleting the as.sbitem stuff?

Copy link
Member

@aappling-usgs aappling-usgs Jul 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm...in light of all this re-education I've just had, what would you think about deleting both as.sbitem and item_get_fields? If item_get_fields is essentially a less-capable little sister of item_get...why would someone ever choose the function that requires them to dig around in documentation until they find the right field name, when they could just look through their complete item for it? What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah that's a good point. Although, I could see how using that rather than searching through a list could be more readable. E.g. if you've got a a bunch of items and you're doing an lapply:

lapply(items, function(item) {
    list(item$summary, item$tags)
})

# versus

lapply(list(ex_id_item, l), item_get_fields, c("summary", "tags"))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose so. And you can avoid downloading some additional amount of text using item_get_fields, which could add up if you're requesting info on a whole lot of items. Well, so, maybe just leave item_get_fields and item_get as-is. I still don't see much point in introducing as.sbitem, but up to you whether you keep or delete it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a note about item_get_fields giving you a subset of what item_get will, and delete as.sbitem. I struggled with a use-case for a non-power user, so I think it would be best to avoid it all together.

names(ex_id_item)
```

## Web feature services to visualize spatial data???
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah...i'll work on this next

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, @lindsaycarr, my training-oriented notes are at #276 and i made a new sbtools issue at DOI-USGS/sbtools#244. it won't be super impressive, but I did identify three sb_ids whose WFSes can be retrieved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per luke's comments on DOI-USGS/sbtools#244, you could introduce item_get_wfs sorta like this: "the developers thought this could be a cool feature. here's what it might look like. the developers didn't want to invest too much if there wouldn't be demand, but if you'd use it a lot, submit or thumbs-up a GitHub issue"


*Need to pick a different item. This one errs since there is "no ScienceBase WFS Service available".*

```{r sbtools-wfs}
# ex_id_wfs <- item_get_wfs(ex_id)
# names(ex_id_item)
```
Loading