diff --git a/vignettes/dataset-investigation.qmd b/vignettes/dataset-investigation.qmd new file mode 100644 index 0000000..f05286a --- /dev/null +++ b/vignettes/dataset-investigation.qmd @@ -0,0 +1,79 @@ +--- +title: "Dataset investigation: What to do when you get your data" +format: html +editor: visual +minimal: true +date: 'Compiled: `r format(Sys.Date(), "%B %d, %Y")`' +vignette: > + %\VignetteIndexEntry{Dataset investigation: What to do when you get your data} + %\VignetteEngine{quarto::html} + %\VignetteEncoding{UTF-8} +--- + +## In Construction + +```{r setup, include=FALSE} +library(knitr) +library(quarto) +knitr::opts_knit$set(root.dir = './') +``` + +# Introduction + +So, you (or your amazing lab mate) have finally finished the data acquisition, +and now you have a dataset in hand. What's next? Unfortunately, the work isn't +over yet. Before diving into any analysis, it's crucial to understand the +dataset itself. This is the first step in any data analysis workflow, ensuring +that the data is of good quality and is well-prepared for preprocessing and any +downstream analysis you plan to perform. + +In this vignette, we present the dataset used throughout the different vignettes +of this website. It's far from a *perfect* dataset, which actually mirrors the +reality of most datasets you'll encounter in research. + +Some issues will indeed be specific to this described dataset. However, the +purpose of this vignette is to encourage you to think critically about your data +and guide you through steps that can help you avoid spending hours on an +analysis, only to realize later that some samples or features should have been +removed or flagged earlier on. + +# Dataset Description + +In this workflow, two datasets are used: + +1. An LC-MS-based (MS1 level only) untargeted metabolomics dataset to quantify + small polar metabolites in human plasma samples. +2. An additional LC-MS/MS dataset of selected samples from the former study for + the identification and annotation of significant features. + +The samples were randomly selected from a larger study aimed at identifying +metabolites with varying abundances between individuals suffering from +cardiovascular disease (CVD) and healthy controls (CTR). The subset analyzed +here includes data for three CVD patients, three CTR individuals, and four +quality control (QC) samples. The QC samples, representing a pooled serum sample +from a large cohort, were measured repeatedly throughout the experiment to +monitor signal stability. + +The data and metadata for this workflow are available on the MetaboLights +database under the ID: MTBLS8735. + +The detailed materials and methods used for the sample analysis are also +available in the MetaboLights entry. This is particularly important for +understanding the analysis and the parameters used. It should be noted that the +samples were analyzed using ultra-high-performance liquid chromatography (UHPLC) +coupled to a Q-TOF mass spectrometer (TripleTOF 5600+), and chromatographic +separation was achieved using hydrophilic interaction liquid chromatography +(HILIC). + +- Consider moving visualizations from the end-to-end vignette to here for a + clearer understanding of the dataset. +- Provide more in-depth visualizations to explore and understand the dataset + quality. +- Compare pool lc-ms and pool lc-ms/ms and show that we have better separation + on the second run. + +```{r} +getwd() +list.files() +list.dirs() +```