LSPAN24-practical1.qmd

---
title: "Handling different data types, quality checking and preparing data files"
author: "Barbara Lazzari, Matilde Passamonti, Paolo Cozzi, Stefano Capomaccio"
css: styles.css
format:
  revealjs:
    logo: "img/Logo UC orizzontale POSITIVO 540C.png"
    slide-number: c/t
    chalkboard: true
    footer: "Livestock pangenomes 2024 - Practical 1 - 2024/07/22"
---

## Practical introduction

To follow this and other practicals, a working *NIX environment is required.
We suggest to use a Linux distribution, MacOS or Windows with
[WSL2](https://learn.microsoft.com/en-us/windows/wsl/about).

If you don't have a working environment, you can use [GitPod](https://www.gitpod.io/),
a cloud-based IDE that allows you to work on your projects from anywhere. We suggest:

*   A [GitHub](https://github.com/) account to manage authentication.
*   A [Linkedin](https://www.linkedin.com/) account to get access to 50 hours of
    free usage per month.

::: notes
Introducing GitPod: is not mandatory for the practicals, but it is a good way to
work on the practicals without the need to install anything on your local machine.
This can be useful if you have a slow internet connection, or if you are not allowed
to install software on your machine.
In the GitPod environment, we manage the software installation and the environment
but you can install additional software if needed.
:::

## Introduction to GitPod

Go to <https://www.gitpod.io/> and select the **Try for free** button

![](img/gitpod_home.png){fig-align="center" .r-stretch}

## Link your GitHub account to GitPod (recommended)

Add your *linkedin* account to get 50 hours of free usage per month

![](img/gitpod_login.png){fig-align="center" .r-stretch}

::: notes
Select your preferred editor (VS code is recommended),
theme, and profile details, click continue and your
account will be created and ready to use.
:::

## GitPod pricing plans

![](img/gitpod_pricing.png){fig-align="center" .r-stretch}

::: notes
GitPod has a free tier that allows you to use the service for 50 hours per month.
You don't need to provide any payment details to follow these practicals.
:::

See your resource usage at <https://gitpod.io/billing>

## GitPod Workspaces

<https://gitpod.io/workspaces>

![](img/gitpod_workspace.png){fig-align="center" .r-stretch}

::: notes
A workspace is a cloud-based development environment that allows you to work on your
projects from anywhere. You can create a workspace from a GitHub repository, or you
can create an empty workspace and clone a repository from the terminal.
Here you can see a list of your active workspaces, the status, the time of the last
activity, and the instance size (if any).
:::

## Create a workspace

![](img/gitpod_create_workspace.png){fig-align="center" .r-stretch}

Click [here](https://gitpod.io/#https://github.com/bunop/LSPAN24-practical1)
to create a new workspace for this project!

::: notes
when creating a workspace, you can select a repository from GitHub, the default
editor and the instance size. If you use the *Large* instance, you will consume
your free credits **faster**
:::

## VS Code in GitPod

![](img/gitpod_vscode.png){fig-align="center" .r-stretch}

Take a look at VS Code tutorial, for example [this one](https://www.youtube.com/watch?v=B-s71n0dHUk)

::: notes
Please note the 4 different sections of the VS Code interface: at the bottom is
the terminal, on the left is the file explorer, in the middle is the editor, and
on the left-up corner is the hamburger menu that allows you to access the settings
and the extensions.
:::

## Turning off Workspaces

![](img/gitpod_stopworkspace.1.png){.absolute top=100 left=0 width="auto" height="500"}
![](img/gitpod_stopworkspace.2.png){.absolute bottom=0 right=0 width="700" height="auto"}

::: notes
Remember to stop your workspace when you are not using it, to save your free credits.
You can stop your current workspace by clicking on the hamburger menu from VScode
or from the hamburger menu in the workspace list
:::

## GitPod CLI

You can install the [GitPod CLI ](https://www.gitpod.io/docs/references/gitpod-cli)
to manage your workspaces from the terminal

```{.bash}
# login to GitPod
gitpod login

# list available workspaces
gitpod workspace list

# start a workspace
gitpod workspace start <workspace ID>

# open a linux terminal into workspace
gitpod workspace ssh <workspace ID>
```


## Workspaces lifecycle

::: incremental
*   Workspaces are automatically stopped after 30 minutes of inactivity or after
    8 hours (using a free plan)
    +   Only files and directories inside `/workspace` directory are preserved.
*   Workspaces are deleted after **14 days**
    +   remember to save your work
*   Please see
    [Workspace Lifecycle](https://www.gitpod.io/docs/configure/workspaces/workspace-lifecycle)
    for more information
:::

## Using the Command Line

For scientific computing, a *command-line interface* (CLI) is often essential.
This means typing out commands instead of using a *graphical user interface* (GUI).

::: incremental
*   Numerous tools are exclusively available as command-line utilities.
*   Once familiar, the CLI can be faster and more precise.
*   Tasks can be easily automated through scripting.
*   *High-performance computing* (HPC) environments often rely solely on CLI access.
:::

## Filesystem Hierarchy {.smaller}

::: columns

::: {.column width="50%"}
The filesystem is organized in a tree-like structure, with the root directory
`/` at the top.

::: incremental
*   `/` is the root directory.
*   `/home` contains user directories.
*   `/usr` contains user programs.
*   `/bin` contains essential binaries.
*   `/etc` contains system configuration files.
*   `/var` contains variable data.
*   `/tmp` contains temporary files.
*   `/workspace` specific to GitPod.
:::

:::

::: {.column width="50%" .absolute top=200 right=0}
![Dhanusha Dhananjaya, [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0), via Wikimedia Commons](https://upload.wikimedia.org/wikipedia/commons/4/4b/Linux_file_system_foto_no_exif_%281%29.jpg){fig-width=100% fig-cap-location=bottom}
:::

:::

## Absolute vs. Relative Paths {.smaller}

::: incremental
*   **Absolute paths** start from the root directory `/`.
    +   e.g., `/`, `/home/user/file.txt`, `/home/user/data/`.
*   **Relative paths** start from the current directory.
    +   e.g., `file.txt`, `user/data/file.txt`.
    +   `.` refers to the current directory (e.g., `./file.txt`)
    +   `..` refers to the parent directory (e.g., `../file.txt`).
        Can be chained multiple times (e.g., `../../file.txt`)
:::

. . .

Consider this example:

```{.txt}
├── dir1
│   ├── dir2
│   │   └── file2
│   └── file1
└── dir3
```

`../file1` is relative to `file2` current position (`dir2`)

`../../dir3` is relative to `file2` current position (`dir2`)

## Navigate the Filesystem {.smaller}

Here are some basic commands to navigate the filesystem: each command can
accept a **path** and **additional option(s)** as an argument. The general
rule is:

`command [option(s)] [path]`

::: incremental
*   `pwd`: Print current working directory (absolute path).
*   `ls`: List files and directories.
*   `cd`: Change directory.
*   `mkdir`: Make directory.
*   `rmdir`: Remove directory (safer).
*   `touch`: Create an empty file - set the timestamp of a file to the current time.
*   `rm`: Remove files.
*   `mv`: Move or rename files.
*   `cp`: Copy files.
:::

## `ls` useful options {.smaller}

`ls` shows files and directories in the current directory. You can provide a
**path** to list files in a different directory. Here are some useful options:

*   `ls -l`: List files with details.
*   `ls -a`: List all files, including hidden ones.
*   `ls -lh`: List files with human-readable sizes.
*   `ls -t`: List files sorted by modification time.
*   `ls -S`: List files sorted by size.
*   `ls -R`: List files recursively.
*   `ls -r`: List sorted files in reverse order.
*   `ls -1`: List files in a single column.

You can combine options, e.g., `ls -lhSr /home/user/data` to list files
in `/home/user/data` folder with *human-readable sizes*
ordered by *size* in *ascending* order (bigger files on *bottom*).

## Useful options for other commands {.smaller}

*   Without a path:
    +   `cd`: (without any arguments) change to **your home** directory
    +   `cd -`: change to the previous directory
*   With a path (*relative or absolute*)
    +   `mkdir -p`: create a directory with its parents if they do not exist
    +   `rm -r`: remove directories and their contents recursively (**use with caution**)
    +   `rm -i`: prompt before removing files
    +   `mv -i`: prompt before overwriting files
    +   `cp -r`: copy directories and their contents recursively
    +   `cp -i`: prompt before overwriting files

. . .

Can you guess the difference between `rm -r` and `rmdir`?

## Navigate the Filesystem (exercise) {.smaller}

1.  Open a terminal (if not yet open)
2.  Print the current working directory
3.  List the files and directories
4.  Change to your *home* directory
5.  Create a new directory called `mydir`
6.  Change to the `mydir` directory
7.  Create a new file called `myfile.txt`
8.  List the files and directories
9.  Move `myfile.txt` to the *home* directory
10.  Remove the `mydir` directory
11.  List the files and directories

## Wildcards {.smaller}

Wildcards are characters that help match file names based on patterns. Ex:

::: incremental
-   `*`: Matches any number of characters (`file*` matches `file1`, `file2`, `fileverylong` etc.)
-   `?`: Matches a single character (`file?` matches `file1`, `file2`, but not `fileverylong`)
-   `[ ]`: Matches any character within the brackets (`file[12]` matches `file1`, `file2`, but
    not `file3` or `fileverylong`. `file[1-9]` to match file with any digit)
-   `{ }`: Matches any of the comma-separated word (`file{1,2,verylong}`
    matches `file1`, `file2` and `fileverylong`)
:::

## Special characters {.smaller}

Some characters have special meanings in the shell:

::: incremental
*   `~`: expands to the home directory (`cd ~`, `cd ~/data`)
*   `$`: refers to an environment variable (`echo $HOME`, `echo $PWD`)
*   `;`: separates commands in one line (execute commands with this order e.g.
    `cd /tmp; ls` first change to `/tmp` then list files)
*   `\`: escapes the next character (`ls file\ with\ spaces.txt`)
*   `'`: preserves the literal value of all characters enclosed
    (`echo 'Today is $(date)'` will print `Today is $(date)`)
*   `"`: preserves the literal value of all characters enclosed, but
    allow for variable expansion, command substitution, and escape sequences
    (`echo "Today is $(date)"` will print `Today is <current date>`)
*   `#`: comments the rest of the line (`# this is a comment`). Not executed.
:::

## File Permissions {.smaller}

Each file has three types of permissions: **read**, **write**, and **execute**.
These permissions are set for three types of users: **owner**, **group**, and **others**.

::: incremental
*   `r`: read permission
*   `w`: write permission
*   `x`: execute permission
*   `-`: no permission
:::

. . .

You can inspect permissions using `ls -l`, near the file name. For example, `rwxr-xr--` means:

*   Owner has read, write, and execute permissions.
*   Group has read and execute permissions.
*   Others have only read permission.

## Change File Permissions (UGO) {.smaller}

You can change file permissions using `chmod` command. The general syntax is:

`chmod [options] mode file`

where `mode` can be:

::: incremental
*   `u`: user (owner)
*   `g`: group
*   `o`: others
*   `a`: all (u, g, o)
*   `+`: add permission
*   `-`: remove permission
*   `=`: set permission
:::

. . .

For example, to give execute permission to the owner of a file:

`chmod u+x file`

## Stdin, Stdout, Stderr {.smaller}

Every process in Unix has three standard streams:

::: incremental
*   **Standard Input (stdin)**: Input from the keyboard or another process.
*   **Standard Output (stdout)**: Output to the terminal or another process.
*   **Standard Error (stderr)**: Error messages to the terminal or another process.
:::

. . .

By default, stdin is the keyboard, and stdout and stderr are the terminal. You can redirect these streams:

::: incremental
*   `>`: Redirect stdout to a file (`ls > files.txt`)
*   `>>`: Append stdout to a file (`ls >> files.txt`)
*   `2>`: Redirect stderr to a file (`ls non_existent_file 2> errors.txt`)
*   `&>`: Redirect both stdout and stderr to a file (`ls non_existent_file &> output.txt`).
    Its equivalent to `> output.txt 2>&1`
*   `<`: Redirect stdin from a file (`wc -l < files.txt`)
:::

## Pipes {.smaller}

Pipes (`|`) connect the stdout of one command to the stdin of another. For example:

`ls -l | wc -l`

This command lists files in the current directory and counts the number of lines in the output.

. . .

You can chain multiple commands using pipes:

`ls -l | grep myfile | wc -l`

This command lists files in the current directory, filters lines containing `myfile`, and counts the number of lines.

## Environment Variables {.smaller}

Environment variables are key-value pairs that store information about the environment. Some common environment variables:

::: incremental
*   `HOME`: Home directory
*   `PATH`: List of directories to search for executable files
*   `PWD`: Present working directory
*   `OLDPWD`: Previous working directory
*   `USER`: Current user
*   `SHELL`: Current shell
:::

. . .

You can access environment variables using `$`, for example `echo $HOME` prints
the home directory (to *stdout*).

You can use environment variables in scripts or commands, for example:
`cd $HOME` or `cp $OLDPWD/file.txt .`

## Aliases {.smaller}

Aliases are shortcuts for commands. You can define aliases in the shell configuration file (e.g. `~/.bashrc`). For example:

```bash
alias ll='ls -l'
```

This command creates an alias `ll` for `ls -l`. You can use `ll` instead of `ls -l`.

. . .

You can list all aliases using `alias` command. To remove an alias, use `unalias` command:

```bash
unalias ll
```
### Exercise

* How many *aliases* are defined in your terminal?
* How many commands do you know?

## Access file content {.smaller}

You can access the content of a file using `cat`, `less`, `more`, or `head` and `tail` commands.

::: incremental
*   `cat`: Concatenate and display file content
*   `less`: Display file content page by page
*   `more`: Display file content page by page
*   `head`: Display the first lines of a file
*   `tail`: Display the last lines of a file
:::

. . .

For example, to display the first 10 lines of a file:

`head file.txt`

## Search file content {.smaller}

You can search for text in files using `grep` command. The general syntax is:

`grep [options] pattern file`

where `pattern` is the text to search for. For example:

`grep 'pattern' file.txt`

This command searches for `pattern` in `file.txt` and prints matching lines.

. . .

You can use regular expressions in `grep` to search for more complex patterns. For example:

`grep -E 'pattern1|pattern2' file.txt`

This command searches for `pattern1` or `pattern2` in `file.txt`.

## Find files {.smaller}

You can find files in the filesystem using `find` command. The general syntax is:

`find [path] [options]`

where `path` is the directory to search in. For example:

`find /tmp -iname '*.txt'`

This command searches for files with `.txt` extension in `/tmp` directory. More options:

::: incremental
*   `-name`: Search by exact name
*   `-type`: Search by file type (e.g. `f` for file, `d` for directory)
*   `-size`: Search by file size (`+` for larger, `-` for smaller. e.g. `+1M`)
*   `-exec`: Execute a command on found files (requires `{}` as a placeholder and `\;` at the end of the command)
:::

## Utilities {.smaller}

There are many utilities available in Unix-like systems. Some common utilities:

::: incremental
*   `awk`: A powerful text processing tool
*   `sed`: A stream editor for filtering and transforming text
*   `cut`: Extract columns from each line of files
*   `sort`: Sort lines of text files
*   `uniq`: Report or omit repeated lines
*   `wc`: Print newline, word, and byte counts for each file
*   `diff`: Compare files line by line
*   `file`: Determine file type
*   `du`: Estimate file space usage
*   `man`: Display manual pages
*   `which`: Locate a command
:::

## Utilities exercise {.smaller}

*   List current directory contents using `ls` command and redirect the output to `files.txt`.
*   Count the number of files in the current directory using `ls` and `wc` commands.
*   Find all files with `.txt` extension in the current directory using `find` command.
*   Display the first 5 lines of `files.txt` using `head` command.
*   Rename `files.txt` to `all_files.txt` using `mv` command.
*   `ls` a non-existent file and redirect the error to `errors.txt`.
*   Concatenate `all_files.txt` and `errors.txt` using `cat` command and redirect
    the output to `all_files_errors.txt`.
*   Display manual pages for `bash` (`man bash`). Search for `Commands for Moving` section:
    How I can move to the beginning of the line? How I can move to the next word?
*   GitPod users: locate file `LSPAN24-practical1.qmd`, `grep` for `##` characters (mind `#`
    is a comment character), then order titles alphabetically using `sort`

## Curl vs Wget {.smaller}

`curl` and `wget` are command-line tools for transferring data with URLs. Some differences:

::: incremental
*   `curl`: Supports multiple protocols (HTTP, HTTPS, FTP, etc.), more flexible, but less user-friendly.
*   `wget`: Supports HTTP and FTP, more user-friendly, but less flexible.
*   `curl` is more suitable for scripting and automation.
:::

. . .

You can use `curl` and `wget` to download files from the web. For example:

`curl -O https://example.com/file.txt`

This command downloads `file.txt` from `https://example.com` to the current directory.

## Compression {.smaller}

You can compress and decompress files using `gzip`, `bzip2`, and `xz` commands. For example:

`gzip file.txt`

This command compresses `file.txt` to `file.txt.gz`. To decompress:

`gunzip file.txt.gz` (or `gzip -d file.txt.gz`)

. . .

You can use `bzip2` and `xz` commands similarly. For example to compress:

`bzip2 file.txt`

`xz file.txt`

To decompress:

`bunzip2 file.txt.bz2` (or `bzip2 -d file.txt.bz2`)

`unxz file.txt.xz` (or `xz -d file.txt.xz`)

## Tar {.smaller}

You can create (`-c`) and extract (`-x`) tar archives using `tar` command. For example:

`tar -cvf archive.tar file1 file2`

This command creates `archive.tar` containing `file1` and `file2`.
**This archive is not compressed** and have the same size of the sum of
`file1` and `file2` sizes. To extract:

`tar -xvf archive.tar`

. . .

You can compress and extract compressed tar archives in one step. For example:

`tar -czvf archive.tar.gz file1 file2`

This command creates `archive.tar.gz` containing `file1` and `file2`. To extract:

`tar -xzvf archive.tar.gz`

Mind to the `z` option for gzip, `j` for bzip2, and `J` for xz. `f` need to be
followed by the archive name.

## Tar (remember) {.smaller}

![Credits [programmerhumor.io](https://programmerhumor.io/programming-memes/for-the-programmers-who-use-tar/)](https://programmerhumor.io/wp-content/uploads/2023/11/programmerhumor-io-programming-memes-4afd15c9908ddc5-758x809.jpg){fig-align="center" .r-stretch fig-cap-location=bottom}

## Tar (remember /2) {.smaller}

![Credits [xkcd.com](https://xkcd.com/1168/)](https://imgs.xkcd.com/comics/tar_2x.png){fig-align="center" .r-stretch fig-cap-location=bottom}

## Editors

There are many text editors available in Unix-like systems.
Here some common editors available in terminal:

::: incremental
*   `nano`: Simple and user-friendly text editor
*   `vim`: Powerful and highly configurable text editor
*   `emacs`: Extensible and customizable text editor
:::

. . .

::: {.callout-tip}
GitPod users have access to [Visual Studio Code](https://code.visualstudio.com/),
a powerful and highly configurable text editor.
:::

## Introduction to virtual environment

Virtual environments are isolated environments for software development.
They allow you to install dependencies and packages without affecting the system-wide installation.
Some common tools for creating virtual environments:

-   Python: [virtualenv](https://virtualenv.pypa.io/en/latest/),
    [pyenv](https://github.com/pyenv/pyenv), [poetry](https://python-poetry.org/)
-   R: [packrat](https://rstudio.github.io/packrat/), [renv](https://rstudio.github.io/renv/articles/renv.html)
-   Perl: [local::lib](https://metacpan.org/pod/local::lib), [perlbrew](https://perlbrew.pl/)
-   Ruby: [rvm](https://rvm.io/)

::: notes
something on reproducibility and the need for virtual environments.
:::

## Introduction to Conda

-   [Conda](https://docs.anaconda.com/anaconda/install/): A package manager and environment management system for installing and managing software packages and dependencies, particularly for Python and R.

::: incremental
-   [Miniconda](https://docs.anaconda.com/miniconda/): A minimal installer for Conda that includes Conda, Python, and other essential packages.
-   [Mamba](https://mamba.readthedocs.io/en/latest/installation/mamba-installation.html): A faster, drop-in replacement for Conda. You can swap almost all commands between `conda` & `mamba`
:::

::: notes
conda to install dependencies and softwares
Conda: has navigators, softwares for installing packages using GUI and editors like spyder, jupyter notebook, etc. Suggested on local machines. Miniconda: is a lightweight version of Conda, suggested on servers or where the GUI is not available. Mamba: implemented the package resolvers in C++ and is faster than Conda. However conda has introduced a new package resolver last year. It has some additional features like `repoquery`.
:::

## Create an environment (from scratch)

We use `conda` command to manage environments with `conda` / `miniconda`:

-   replace with `mamba` if you have `mamba` installed.
-   `--name`: required to specify the name of the environment.
-   specifying versions helps the package resolver

``` {.bash code-line-numbers="1-2|4-5|7-8"}
# create an empty environment
conda create --name empty

# create an environment with some software
conda create --name python python ipython

# create an environment with a specific software version
conda create --name python3.10 python=3.10 ipython=8.25
```

## Create an environment (from file)

We can also create an environment from a file. Note the `env` before the `create` command:

-   `--file`: specify the file containing the environment specifications
-   `--name`: override the name of the environment in the file

``` {.bash code-line-numbers="1-2|4-5"}
# create an environment from a file (use name in file)
conda env create --file environment.yml

# create an environment from a file (override name in file)
conda env create --file environment.yml --name myname
```

::: notes
without the `--file` option, the current `environment.yml` file will be used.
:::

## List and activate environments {.smaller}

-   `conda env list`: list all environments
-   `conda activate <env_name>`: activate an environment

Normally the prompt will change to show the active environment. Default installation
have the `base` environment active at login

```bash
# Activate an environment
(base) gitpod ~ $ conda activate ognigenoma
(ognigenoma) gitpod ~ $
```

. . .

To deactivate an environment, use `conda deactivate`:

```bash
# Deactivate the current environment
(ognigenoma) gitpod ~ $ conda deactivate
(base) gitpod ~ $
```

You can activate more than one environment at a time.
The last activated environment will be active.
When you deactivate an environment, the previous environment will be activated.

## Channels and repositories {.smaller}

-   `defaults`: the default channel for conda packages
-   `R`: a channel for R packages
-   `bioconda`: a channel for bioinformatics software
-   `conda-forge`: a community-driven collection of conda packages

``` {.bash}
# add a channel to the list of channels
conda config --add channels conda-forge

# add a channel to the list of channels
conda config --add channels bioconda
```

::: {.callout-tip}
GitPod users: we have set up channels required for the practical with
the suggested *priority* settings. You can see the settings in the
`.condarc` file in your home directory. See more information on
conda [Managing Channels](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-channels.html)
:::

## Search for packages {.smaller}

Use `conda search` to search for packages in the configured channels.
Wildcards can be used in the search:

``` {.bash code-line-numbers="1-2|4-5"}
# search for a package in the configured channels
conda search samtools

# search for a package in a specific channel
conda search --channel bioconda samtools
```
::: {.callout-tip}
GitPod users: the channels are already configured for the practicals.
You can search any package in the configured channels without specifying
the channel.
:::

. . .

### Exercise

Search for the package `samtools` in the `bioconda` channel:

1. What versions are available? What is the latest version? In which channel?
2. Do the same for `r-base` package.

## Install packages {.smaller}

*   `conda install`: install packages in the active environment
    +   `--name`: specify the environment to install the package
    +   `--file`: specify a file with the list of packages to install

``` {.bash code-line-numbers="1-3|5-6|8-10"}
# install a package in the active environment
# base environment is read-only for GitPod users!
conda install pandas

# install a package in a specific environment
conda install --name python3.10 pandas

# install packages from a file in the active environment
# (python format)
conda install --file requirements.txt
```

### Exercise

Create an environment with `samtools`, `tabix` and `bcftools` packages.
Activate the environment and check if the packages are installed. Then
install `seqkit` package in the same environment.

::: notes
We are not creating an environment, the environment is already
present when I install a package
:::

## Understanding the environment (exercise) {.smaller}

1.    Ensure the environment created in the previous exercise is active. Can
      you find where the `samtools` executable is located?
2.    Deactivate the environment and check if the `samtools` executable is
      available. Why is it not available?
3.    Can you figure out how conda makes the executables available in the
      active environment?

. . .

::: {.callout-tip}
-   Use `which` to find the location of the executable.
-   Take a look to the environment variables in the active environment, for
    example `$PATH`.
:::

## Exporting an environment {.smaller}

-   `conda env export`: export the environment to a file
-   `--name`: specify the environment to export
-   `--file`: specify the file to export the environment to
-   `--no-builds`: exclude the build string from the exported file

``` {.bash code-line-numbers="1-2|4-5"}
# export the active environment to a file (using STDOUT)
conda env export > environment.yml

# export a specific environment to a specific file
conda env export --name python3.10 --file environment.yml
```

. . .

### Exercise

Export an environment to a file; then export the same environment
with `--no-builds` option to another file. Compare the two files
with `diff` (try `diff -y` for a more readable output).

::: notes
does people know STDOUT/STDERR/STDIN?
:::

## Some tips {.smaller}

::: incremental
-   `conda list`: list all packages installed in the active environment
-   Install packages with versions: `conda install pandas=1.3.3`
-   Create environments by scope: don't pull all packages in the same environment
-   Limit the `conda-forge` channel usage if possible
-   `conda list --revision`: list all revisions of the environment
-   `conda install --revision 1`: revert environment to a previous revision
-   `conda env remove --name <env_name>`: remove an environment
-   `conda create --clone <env_name> --name <new_env>`: clone an environment
-   `conda clean --all`: clean the cache and unused packages
:::

## Practical with conda: get genome data {.smaller}

Let's collect some genome data to make an example. We will use the CLI tools
made available by NCBI, `datasets` and `dataformat`, to collect ARS-UCD1.2
data from the NCBI database. Next, we will use `seqkit` to manipulate file
headers and then we will `bgzip` to pack the sequence files.

1.    Create one conda environment with `ncbi-datasets-cli`, `jq`, `seqkit` and `tabix`
      packages

2.    Go the [NCBI Datasets](https://www.ncbi.nlm.nih.gov/datasets/) page
      and search for *cow* (`Bos taurus (cattle)` will be suggested). You
      will open the new page for
      [NCBI Taxonomy ID: `9913`](https://www.ncbi.nlm.nih.gov/datasets/taxonomy/9913/).
3.    Click on [ARS-UCD2.0](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_002263795.3/)
      link, below the *Genome* section an over the *Download* button (don't
      download the genome from this page)

## Genome assembly and legacy versions {.smaller}

In the practical we will use the ARS-UCD1.2 (`GCF_002263795.1`) version of the genome,
however the latest version is ARS-UCD2.0 (`GCF_002263795.3`):

1.    Go to the bottom of the page and search for the
      [GCF_002263795.1](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_002263795.1/)
      accession for the ARS-UCD1.2 version (will be in the refseq column)
      and click on it.
2.    In the top of the page, near the *Download* button we have some
      different ways to collect data. We don't want to download the genome
      on on our laptop, so click on the *datasets* tab, just near the download
      button.
3.    A pop-up window will appear with the command to download the genome
      data. Copy the command by clicking on the *Copy* button.

## NCBI Datasets CLI {.smaller}

`datasets` is a command-line tool that is used to query and download biological
sequence data across all domains of life from NCBI databases. See the
[documentation](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/datasets/)
for more information.

For example, retrieve the same information as before using the accession number:
*pipe* the result to [jq](https://jqlang.github.io/jq/manual/) to format the output:

```{.bash}
datasets summary genome accession GCF_002263795.1 | jq
```

The same command can be use to download the genome data: paste the command you've
copied from the NCBI Taxonomy page and add two additional options:

```{.bash}
datasets download genome accession GCF_002263795.1 \
  --include gff3,rna,cds,protein,genome,seq-report \
  --dehydrated --filename ARS-UCD1.2.zip
```

::: {.callout-note}
The `\` (escape character) is used to break the command in multiple lines:
it is not necessary if you paste the command in one line: but if you paste
the command in multiple lines it prevents the command from being executed
before you finish typing it.
:::

## NCBI Datasets CLI (2) {.smaller}

* `--dehydrated`: download the data in a dehydrated format: the data is
  downloaded in a format that can be rehydrated using the `datasets` tool.
  This is required if you need to download a lot of data. See
  [Download large genome data packages](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/how-tos/genomes/large-download/)
  for more information
* `--filename`: specify the name of the file to download the data to
* `--include`: specify the data types to download: if we are not interested
  in all data types we can exclude some of them

Now unzip the downloaded archive in a new directory (since the archive
will place stuff in the current directory):

```{.bash}
unzip ARS-UCD1.2.zip -d ARS-UCD1.2
```

* `-d`: specify the directory to extract the files to

## Rehydrate the data {.smaller}

Take a look to see the downloaded data: we have not any data! the only
data we have are some metadata and the URL were files can be downloaded.

Is now time to rehydrate data:

```{.bash}
datasets rehydrate --directory ARS-UCD1.2
```

* `--directory`: this is the top level directory in which you have
  decompressed data with `unzip`

Now the data should be downloaded in the same directory where the
metadata is located. You can check the data with `ls` or `tree`:

```{.bash}
tree -phC ARS-UCD1.2
```

* `tree`: list the directory structure in a tree-like format
  + `-p`: show permissions
  + `-h`: show sizes in human readable format
  + `-C`: colorize the output

## Deal with genome sequence {.smaller}

Take a minute to look at the files you have downloaded, especially the
genome sequence: how many sequences are in the file? what is the format
of the file?

We can use grep `'>'` to extract the sequence names from a fasta file
and then counting the number of lines, however
we can use a fasta/fastq manipulation program like [seqkit](https://bioinf.shenwei.me/seqkit/),
which can do this and much more:

```{.bash}
seqkit stats GCF_002263795.1_ARS-UCD1.2_genomic.fna
```

* `seqkit stats`: show statistics for the input file

There are a lot of sequences in the file, you can inspect the sequence
names with

```{.bash}
seqkit seq -n GCF_002263795.1_ARS-UCD1.2_genomic.fna
```

*   `seqkit seq`: transform sequences (extract ID, filter by length,
    remove gaps, reverse complement...)
      + `-n`: only print names/sequence headers
      + `-i`: print IDs instead of full headers

## Sequence report {.smaller}

The `sequence_report.jsonl` file contains information about the sequences in the
genome assembly. We can use `jq` to inspect the file, or we can use the
[dataformat](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/dataformat/)
command to transform this information in a table:

```{.bash}
dataformat tsv genome-seq --inputfile sequence_report.jsonl
```

* `dataformat tsv`: transform the input data in a tab-separated format
  + `genome-seq`: the type of source data to transform
  + `--inputfile`: the input file to transform
  + `--fields`: specify the fields to include in the output

::: {.callout-note}
The `dataformat` command requires JSON-lines formatted NCBI Datasets reports
into other formats. It can work in a pipe with the `datasets` command, however
it requires the `--as-json-lines` option:

```{.bash}
datasets summary genome accession GCF_002263795.1 --as-json-lines | dataformat tsv genome | less -S
```
:::

## Extract chromosomes from genome {.smaller}

Suppose we need to get rid of *un-assembled sequences* from the genome:
we can use `seqkit` to extract sequences by name.
First, extract the sequences we want *by ids*:

```{.bash}
jq -c 'select(.role == "assembled-molecule")' sequence_report.jsonl \
  | dataformat tsv genome-seq --fields refseq-seq-acc > ids.txt
```

* `-c`: compact output (required by `dataformat` as input)
* `select(.role == "assembled-molecule")`: select only sequences that are
  assembled molecules by their role
* `refseq-seq-acc`: the field to extract from the input data

Now we can use `seqkit` to extract the sequences by ids:

```{.bash}
seqkit grep -f ids.txt GCF_002263795.1_ARS-UCD1.2_genomic.fna > ARS-UCD1.2_chromosomes.fna
```

### Exercise

* Can you create the *sequence ids* file using `awk`, `cut` or `grep` instead of `jq`?

## Rename sequences {.smaller}

Now the sequence names are the RefSeq accession numbers:
Let's rename the sequences to include the chromosome name as *id*: this
could be done again with `seqkit`, with a text file with the
old name as key and the new name as value:

```{.bash}
jq -c 'select(.role == "assembled-molecule")' sequence_report.jsonl \
  | dataformat tsv genome-seq --fields refseq-seq-acc,chr-name > alias.txt
```

This is the same command we use before, but now we are extracting also the
chromosome number. Now we can use `seqkit` to rename the sequences:

```{.bash}
seqkit replace -p '^(\S+)(.+?)$' -r '{kv} \$1\${2}' -k alias.txt \
  -o ARS-UCD1.2_chromosomes_renamed.fna ARS-UCD1.2_chromosomes.fna
```

* `seqkit replace`: replace patterns in sequences
  + `-p`: the pattern to search for
  + `-r`: the replacement pattern
  + `-k`: the key-value file with the replacements
  + `-o`: the output file


## Compress the genome sequence {.smaller}

The last operation can be to compress the genome sequence file:
we can do this with [`bgzip`](https://www.htslib.org/doc/bgzip.html)
in order to have a compressed file that can be indexed with
other software like `samtools`:

```{.bash}
bgzip ARS-UCD1.2_chromosomes_renamed.fna
```

Check that sequence names are in the proper format

```{.bash}
seqkit stats ARS-UCD1.2_chromosomes_renamed.fna.gz
seqkit seq -n ARS-UCD1.2_chromosomes_renamed.fna.gz
```

### Exercise

Can you optimize the previous commands to avoid creating intermediate files?

:::{.callout-tip}
You can use pipes to connect the output of a command to the input of another command.
Ideally, the original file downloaded from NCBI should not be modified (but can
be compressed)
:::