From 8d45c28921a896314375c05bb920feed6fe72830 Mon Sep 17 00:00:00 2001 From: camilavargasp Date: Thu, 19 Oct 2023 14:37:49 -0700 Subject: [PATCH 1/2] streamlining transitions on data publishing lesson --- .../sections/publishing-data-knb-edi.qmd | 152 ++++++++++++------ 1 file changed, 105 insertions(+), 47 deletions(-) diff --git a/materials/sections/publishing-data-knb-edi.qmd b/materials/sections/publishing-data-knb-edi.qmd index 3687bff7..fe480a0c 100644 --- a/materials/sections/publishing-data-knb-edi.qmd +++ b/materials/sections/publishing-data-knb-edi.qmd @@ -5,7 +5,7 @@ - Demonstrate how data and code can be documented and published in open data archives -## The Data Life Cycle - A Recap +## The Data Life Cycle The Data Life Cycle gives you an overview of meaningful steps data goes through in a research project, from planning to archival. This @@ -82,34 +82,38 @@ The data center can help come up with a strategy to tile data structures by subs ::: +## Metadata -## Data sharing and preservation +Within the data life cycle you can be collecting data (creating new data) or integrating data that has all ready been collected. Either way, **metadata** plays a major role to successfully spin around the cycle because it enables data reuse long after the original collection. -![](images/WhyManage.png) +Imagine that you're writing your metadata for a typical researcher (who might even be you!) 30+ years from now - what will they need to understand what's inside your data files? -### Data repositories: built for data (and code) +The goal is to have enough information for the researcher to **understand the data**, **interpret the data**, and then **reuse the data** in another study. -- GitHub is not an archival location -- Dedicated data repositories: KNB, Arctic Data Center, Zenodo, - FigShare - - Rich metadata - - Archival in their mission -- Data papers, e.g., Scientific Data -- List of data repositories: http://re3data.org -![](images/RepoLogos.png) +One way to think about metadata is to answer the following questions with the documentation: + +- What was measured? +- Who measured it? +- When was it measured? +- Where was it measured? +- How was it measured? +- How is the data structured? +- Why was the data collected? +- Who should get credit for this data (researcher AND funding agency)? +- How can this data be reused (licensing)? -### Metadata -Metadata are documentation describing the content, context, and -structure of data to enable future interpretation and reuse of the data. -Generally, metadata describe who collected the data, what data were -collected, when and where it was collected, and why it was collected. +We also know that it is important to keep in mind how will computers organize and integrate this information. There are a number of **metadata standards** that make your metadata machine readable and therefore easier for data curators to publish your data (Ecological Metadata Language, Geospatial Metadata Standards, Biological Data Profile, Darwin Core, Metadata Encoding Transmission Standard, etc.) + +Today we are going to be focusing on the Ecological Metadata Language also known as EML. Which is widespread use in the earth and environmental sciences. + +> "The Ecological Metadata Language (EML) defines a comprehensive vocabulary and a readable XML markup syntax for documenting research data" () + + + + -For consistency, metadata are typically structured following metadata -content standards such as the [Ecological Metadata Language -(EML)](https://knb.ecoinformatics.org/software/eml/). For example, -here's an excerpt of the metadata for a sockeye salmon data set: ``` xml @@ -138,16 +142,75 @@ here's an excerpt of the metadata for a sockeye salmon data set: ``` -That same metadata document can be converted to HTML format and -displayed in a much more readable form on the web: -https://knb.ecoinformatics.org/#view/doi:10.5063/F1F18WN4 -![](images/knb-metadata.png) And as you can see, the whole data set or -its components can be downloaded and reused. -Also note that the repository tracks how many times each file has been -downloaded, which gives great feedback to researchers on the activity -for their published data. + +## Data Identifiers & Citation + +Many journals require a DOI - a digital object identifier - be assigned to the published data before the paper can be accepted for publication. The reason for that is so that the data can easily be found and easily linked to. + +Keep in mind that generally, if the data package needs to be updated (which happens in many cases), each version of the package will get its own identifier. This way, researchers can and should cite the exact version of the data set that they used in their analysis. Having the data identified in this manner allows us to accurately track the dataset usage metrics. + + + +Finally, stressed that researchers should get in the habit of citing the data that they use (even if it’s their own data!) in each publication that uses that data. This is important for correct attribution, provenance of your work and ultimately transparency in the scientific process. + +## Provenance and Computational Workflow + +::: column-margin +![](images/comp-repro.png) +::: + +While the [Knowledge Network for Biocomplexity](https://knb.ecoinformatiocs.org), and similar repositories do focus on preserving data, we really set our sights much more broadly on preserving entire computational workflows that are instrumental to advances in science. A computational workflow represents the sequence of computational tasks that are performed from raw data acquisition through data quality control, integration, analysis, modeling, and visualization. + + +For example, a data acquisition and cleaning workflow often creates a derived and integrated data product that is then picked up and used by multiple downstream analytical workflows that produce specific scientific findings. These workflows can each be archived as distinct data packages, with the output of the first workflow becoming the input of the second and subsequent workflows. + +![](images/comp-workflow-2.png) + + +Adding provenance within your work makes it more reproducible and compliant with the [FAIR](https://www.go-fair.org/fair-principles/) principles. It is also useful for building on the work of others; you can produce similar visualizations for another location, for example, using the same code. + +Tools like Quarto can be used as a provenance tool, as well - by starting with the raw data and cleaning it programmatically, rather than manually, you preserve the steps that you went through and your workflow is reproducible. + + + +## Preserving your data + + + +::: column-margin +![](images/WhyManage.png) +::: + + +### Data repositories: built for data (and code) + +- GitHub is not an archival location +- Dedicated data repositories: KNB, Arctic Data Center, Zenodo, + FigShare + - Rich metadata + - Archival in their mission +- Data papers, e.g., Scientific Data +- List of data repositories: http://re3data.org + +![](images/RepoLogos.png) + +DataONE is a federation of dozens of data repositories that work +together to make their systems interoperable and to provide a single +unified search system that spans the repositories. DataONE aims to make +it simpler for researchers to publish data to one of its member +repositories, and then to discover and download that data for reuse in +synthetic analyses. + +DataONE can be searched on the web (https://search.dataone.org/), which +effectively allows a single search to find data form the dozens of +members of DataONE, rather than visiting each of the currently 43 +repositories one at a time. + +![](images/DataONECNs.png) + + ### Structure of a data package @@ -170,29 +233,24 @@ other files gets an internal identifier, often a UUID that is globally unique. In the example above, the package can be cited with the DOI `doi:10.5063/F1F18WN4`. -### DataONE Federation -DataONE is a federation of dozens of data repositories that work -together to make their systems interoperable and to provide a single -unified search system that spans the repositories. DataONE aims to make -it simpler for researchers to publish data to one of its member -repositories, and then to discover and download that data for reuse in -synthetic analyses. - -DataONE can be searched on the web (https://search.dataone.org/), which -effectively allows a single search to find data form the dozens of -members of DataONE, rather than visiting each of the currently 43 -repositories one at a time. - -![](images/DataONECNs.png) ## Publishing data from the web Each data repository tends to have its own mechanism for submitting data and providing metadata. With repositories like the KNB Data Repository -and the Arctic Data Center, we provide some easy to use web forms for -editing and submitting a data package. Let's walk through a web -submission to see what you might expect. +and the EDI, they provide some easy to use web forms for +editing and submitting a data package. + + +## Publishing Data to EDI + +Short explanations and DEMO + +## Publishing Data to KNB + +Short Explanation and Practice + ::: callout-tip ## Setup @@ -257,7 +315,7 @@ your dataset. It should be descriptive but succinct, lack acronyms, and include some indication of the temporal and geospatial coverage of the data. -The abstract should be sufficently descriptive for a general scientific +The abstract should be sufficiently descriptive for a general scientific audience to understand your dataset at a high level. It should provide an overview of the scientific context/project/hypotheses, how this data package fits into the larger context, a synopsis of the experimental or From 5c8e243052a8523a600e875f51bb0a78f1ac6a21 Mon Sep 17 00:00:00 2001 From: camilavargasp Date: Thu, 19 Oct 2023 15:28:28 -0700 Subject: [PATCH 2/2] fixing header format in parallel processing and publishing to the web --- materials/sections/parallel-computing-in-r.qmd | 18 +++++++++--------- materials/sections/publishing-data-knb-edi.qmd | 12 ++++++------ materials/session_09.qmd | 1 - materials/session_11.qmd | 2 ++ 4 files changed, 17 insertions(+), 16 deletions(-) diff --git a/materials/sections/parallel-computing-in-r.qmd b/materials/sections/parallel-computing-in-r.qmd index b0834d18..647f715b 100644 --- a/materials/sections/parallel-computing-in-r.qmd +++ b/materials/sections/parallel-computing-in-r.qmd @@ -1,5 +1,5 @@ -### Learning Outcomes +## Learning Objectives {.unnumbered} - Understand what parallel computing is and when it may be useful - Understand how parallelism can work @@ -7,7 +7,7 @@ - Understand and use the `parallel` package multicore functions - Understand and use the `foreach` package functions -### Introduction +## Introduction Processing large amounts of data with complex models can be time consuming. New types of sensing means the scale of data collection today is massive. And modeled outputs can be large as well. For example, here's a 2 TB (that's Terabyte) set of modeled output data from [Ofir Levy et al. 2016](https://doi.org/10.5063/F1Z899CZ) that models 15 environmental variables at hourly time scales for hundreds of years across a regular grid spanning a good chunk of North America: @@ -20,7 +20,7 @@ Alternatively, think of remote sensing data. Processing airborne hyperspectral ![NEON Data Cube](images/DataCube.png) -### Why parallelism? +## Why parallelism? Much R code runs fast and fine on a single processor. But at times, computations can be: @@ -32,7 +32,7 @@ can be: To help with **cpu-bound** computations, one can take advantage of modern processor architectures that provide multiple cores on a single processor, and thereby enable multiple computations to take place at the same time. In addition, some machines ship with multiple processors, allowing large computations to occur across the entire cluster of those computers. Plus, these machines also have large amounts of memory to avoid **memory-bound** computing jobs. -### Processors (CPUs) and Cores +## Processors (CPUs) and Cores A modern CPU (Central Processing Unit) is at the heart of every computer. While traditional computers had a single CPU, modern computers can ship with mutliple @@ -85,7 +85,7 @@ However, maybe one of these NSF-sponsored high performance computing clusters (H Note that these clusters have multiple nodes (hosts), and each host has multiple cores. So this is really multiple computers clustered together to act in a coordinated fashion, but each node runs its own copy of the operating system, and is in many ways independent of the other nodes in the cluster. One way to use such a cluster would be to use just one of the nodes, and use a multi-core approach to parallelization to use all of the cores on that single machine. But to truly make use of the whole cluster, one must use parallelization tools that let us spread out our computations across multiple host nodes in the cluster. -### When to parallelize +## When to parallelize It's not as simple as it may seem. While in theory each added processor would linearly increase the throughput of a computation, there is overhead that reduces that efficiency. For example, the code and, importantly, the data need to be copied to each additional CPU, and this takes time and bandwidth. Plus, new processes and/or threads need to be created by the operating system, which also takes time. This overhead reduces the efficiency enough that realistic performance gains are much less than theoretical, and usually do not scale linearly as a function of processing power. For example, if the time that a computation takes is short, then the overhead of setting up these additional resources may actually overwhelm any advantages of the additional processing power, and the computation could potentially take longer! @@ -116,7 +116,7 @@ ggplot(cpu_perf, aes(cpus, speedup, color=prop)) + So, its important to evaluate the computational efficiency of requests, and work to ensure that additional compute resources brought to bear will pay off in terms of increased work being done. With that, let's do some parallel computing... -### Loops and repetitive tasks using lapply +## Loops and repetitive tasks using lapply When you have a list of repetitive tasks, you may be able to speed it up by adding more computing power. If each task is completely independent of the others, then it is a prime candidate for executing those tasks in parallel, each on its own core. For example, let's build a simple loop that uses sample with replacement to do a bootstrap analysis. In this case, we select `Sepal.Length` and `Species` from the `iris` dataset, subset it to 100 observations, and then iterate across 10,000 trials, each time resampling the observations with replacement. We then run a logistic regression fitting species as a function of length, and record the coefficients for each trial to be returned. @@ -153,7 +153,7 @@ system.time({ }) ``` -### Approaches to parallelization +## Approaches to parallelization When parallelizing jobs, one can: - Use the multiple cores on a local computer through `mclapply` @@ -268,11 +268,11 @@ stopImplicitCluster() ``` -### Summary +## Summary In this lesson, we showed examples of computing tasks that are likely limited by the number of CPU cores that can be applied, and we reviewed the architecture of computers to understand the relationship between CPU processors and cores. Next, we reviewed the way in which traditional `for` loops in R can be rewritten as functions that are applied to a list serially using `lapply`, and then how the `parallel` package `mclapply` function can be substituted in order to utilize multiple cores on the local computer to speed up computations. Finally, we installed and reviewed the use of the `foreach` package with the `%dopar` operator to accomplish a similar parallelization using multiple cores. -### Readings and tutorials +## Readings and tutorials - [Multicore Data Science with R and Python](https://blog.dominodatalab.com/multicore-data-science-r-python/) - [Beyond Single-Core R](https://ljdursi.github.io/beyond-single-core-R/#/) by Jonoathan Dursi (also see [GitHub repo for slide source](https://github.com/ljdursi/beyond-single-core-R)) diff --git a/materials/sections/publishing-data-knb-edi.qmd b/materials/sections/publishing-data-knb-edi.qmd index fe480a0c..2c7d1290 100644 --- a/materials/sections/publishing-data-knb-edi.qmd +++ b/materials/sections/publishing-data-knb-edi.qmd @@ -1,4 +1,4 @@ -## Learning Objectives {.unnmbered} +## Learning Objectives {.unnumbered} - Overview best practices for organizing data for publication\ - Review what science metadata is and how it can be used @@ -157,9 +157,8 @@ Finally, stressed that researchers should get in the habit of citing the data th ## Provenance and Computational Workflow -::: column-margin ![](images/comp-repro.png) -::: + While the [Knowledge Network for Biocomplexity](https://knb.ecoinformatiocs.org), and similar repositories do focus on preserving data, we really set our sights much more broadly on preserving entire computational workflows that are instrumental to advances in science. A computational workflow represents the sequence of computational tasks that are performed from raw data acquisition through data quality control, integration, analysis, modeling, and visualization. @@ -243,11 +242,12 @@ and the EDI, they provide some easy to use web forms for editing and submitting a data package. -## Publishing Data to EDI +### Publishing Data to EDI Short explanations and DEMO -## Publishing Data to KNB +### Publishing Data to KNB + Short Explanation and Practice @@ -268,7 +268,7 @@ folder. ![](images/hatfield-knb-01.png) ::: -## Login via ORCID +#### Login via ORCID We will walk through web submission on https://demo.nceas.ucsb.edu, and start by logging in with an ORCID account. ORCID provides a common diff --git a/materials/session_09.qmd b/materials/session_09.qmd index 00dcf041..9225ce2a 100644 --- a/materials/session_09.qmd +++ b/materials/session_09.qmd @@ -6,4 +6,3 @@ title-block-banner: true {{< include /sections/parallel-computing-in-r.qmd >}} - diff --git a/materials/session_11.qmd b/materials/session_11.qmd index 02388845..4c213148 100644 --- a/materials/session_11.qmd +++ b/materials/session_11.qmd @@ -4,6 +4,8 @@ title-block-banner: true --- + + {{< include /sections/publishing-data-knb-edi.qmd >}}