-
Notifications
You must be signed in to change notification settings - Fork 1
/
LSPAN24-practical1.qmd
1101 lines (788 loc) · 36.4 KB
/
LSPAN24-practical1.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Handling different data types, quality checking and preparing data files"
author: "Barbara Lazzari, Matilde Passamonti, Paolo Cozzi, Stefano Capomaccio"
css: styles.css
format:
revealjs:
logo: "img/Logo UC orizzontale POSITIVO 540C.png"
slide-number: c/t
chalkboard: true
footer: "Livestock pangenomes 2024 - Practical 1 - 2024/07/22"
---
## Practical introduction
To follow this and other practicals, a working *NIX environment is required.
We suggest to use a Linux distribution, MacOS or Windows with
[WSL2](https://learn.microsoft.com/en-us/windows/wsl/about).
If you don't have a working environment, you can use [GitPod](https://www.gitpod.io/),
a cloud-based IDE that allows you to work on your projects from anywhere. We suggest:
* A [GitHub](https://github.com/) account to manage authentication.
* A [Linkedin](https://www.linkedin.com/) account to get access to 50 hours of
free usage per month.
::: notes
Introducing GitPod: is not mandatory for the practicals, but it is a good way to
work on the practicals without the need to install anything on your local machine.
This can be useful if you have a slow internet connection, or if you are not allowed
to install software on your machine.
In the GitPod environment, we manage the software installation and the environment
but you can install additional software if needed.
:::
## Introduction to GitPod
Go to <https://www.gitpod.io/> and select the **Try for free** button
![](img/gitpod_home.png){fig-align="center" .r-stretch}
## Link your GitHub account to GitPod (recommended)
Add your *linkedin* account to get 50 hours of free usage per month
![](img/gitpod_login.png){fig-align="center" .r-stretch}
::: notes
Select your preferred editor (VS code is recommended),
theme, and profile details, click continue and your
account will be created and ready to use.
:::
## GitPod pricing plans
![](img/gitpod_pricing.png){fig-align="center" .r-stretch}
::: notes
GitPod has a free tier that allows you to use the service for 50 hours per month.
You don't need to provide any payment details to follow these practicals.
:::
See your resource usage at <https://gitpod.io/billing>
## GitPod Workspaces
<https://gitpod.io/workspaces>
![](img/gitpod_workspace.png){fig-align="center" .r-stretch}
::: notes
A workspace is a cloud-based development environment that allows you to work on your
projects from anywhere. You can create a workspace from a GitHub repository, or you
can create an empty workspace and clone a repository from the terminal.
Here you can see a list of your active workspaces, the status, the time of the last
activity, and the instance size (if any).
:::
## Create a workspace
![](img/gitpod_create_workspace.png){fig-align="center" .r-stretch}
Click [here](https://gitpod.io/#https://github.com/bunop/LSPAN24-practical1)
to create a new workspace for this project!
::: notes
when creating a workspace, you can select a repository from GitHub, the default
editor and the instance size. If you use the *Large* instance, you will consume
your free credits **faster**
:::
## VS Code in GitPod
![](img/gitpod_vscode.png){fig-align="center" .r-stretch}
Take a look at VS Code tutorial, for example [this one](https://www.youtube.com/watch?v=B-s71n0dHUk)
::: notes
Please note the 4 different sections of the VS Code interface: at the bottom is
the terminal, on the left is the file explorer, in the middle is the editor, and
on the left-up corner is the hamburger menu that allows you to access the settings
and the extensions.
:::
## Turning off Workspaces
![](img/gitpod_stopworkspace.1.png){.absolute top=100 left=0 width="auto" height="500"}
![](img/gitpod_stopworkspace.2.png){.absolute bottom=0 right=0 width="700" height="auto"}
::: notes
Remember to stop your workspace when you are not using it, to save your free credits.
You can stop your current workspace by clicking on the hamburger menu from VScode
or from the hamburger menu in the workspace list
:::
## GitPod CLI
You can install the [GitPod CLI ](https://www.gitpod.io/docs/references/gitpod-cli)
to manage your workspaces from the terminal
```{.bash}
# login to GitPod
gitpod login
# list available workspaces
gitpod workspace list
# start a workspace
gitpod workspace start <workspace ID>
# open a linux terminal into workspace
gitpod workspace ssh <workspace ID>
```
## Workspaces lifecycle
::: incremental
* Workspaces are automatically stopped after 30 minutes of inactivity or after
8 hours (using a free plan)
+ Only files and directories inside `/workspace` directory are preserved.
* Workspaces are deleted after **14 days**
+ remember to save your work
* Please see
[Workspace Lifecycle](https://www.gitpod.io/docs/configure/workspaces/workspace-lifecycle)
for more information
:::
## Using the Command Line
For scientific computing, a *command-line interface* (CLI) is often essential.
This means typing out commands instead of using a *graphical user interface* (GUI).
::: incremental
* Numerous tools are exclusively available as command-line utilities.
* Once familiar, the CLI can be faster and more precise.
* Tasks can be easily automated through scripting.
* *High-performance computing* (HPC) environments often rely solely on CLI access.
:::
## Filesystem Hierarchy {.smaller}
::: columns
::: {.column width="50%"}
The filesystem is organized in a tree-like structure, with the root directory
`/` at the top.
::: incremental
* `/` is the root directory.
* `/home` contains user directories.
* `/usr` contains user programs.
* `/bin` contains essential binaries.
* `/etc` contains system configuration files.
* `/var` contains variable data.
* `/tmp` contains temporary files.
* `/workspace` specific to GitPod.
:::
:::
::: {.column width="50%" .absolute top=200 right=0}
![Dhanusha Dhananjaya, [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0), via Wikimedia Commons](https://upload.wikimedia.org/wikipedia/commons/4/4b/Linux_file_system_foto_no_exif_%281%29.jpg){fig-width=100% fig-cap-location=bottom}
:::
:::
## Absolute vs. Relative Paths {.smaller}
::: incremental
* **Absolute paths** start from the root directory `/`.
+ e.g., `/`, `/home/user/file.txt`, `/home/user/data/`.
* **Relative paths** start from the current directory.
+ e.g., `file.txt`, `user/data/file.txt`.
+ `.` refers to the current directory (e.g., `./file.txt`)
+ `..` refers to the parent directory (e.g., `../file.txt`).
Can be chained multiple times (e.g., `../../file.txt`)
:::
. . .
Consider this example:
```{.txt}
├── dir1
│ ├── dir2
│ │ └── file2
│ └── file1
└── dir3
```
`../file1` is relative to `file2` current position (`dir2`)
`../../dir3` is relative to `file2` current position (`dir2`)
## Navigate the Filesystem {.smaller}
Here are some basic commands to navigate the filesystem: each command can
accept a **path** and **additional option(s)** as an argument. The general
rule is:
`command [option(s)] [path]`
::: incremental
* `pwd`: Print current working directory (absolute path).
* `ls`: List files and directories.
* `cd`: Change directory.
* `mkdir`: Make directory.
* `rmdir`: Remove directory (safer).
* `touch`: Create an empty file - set the timestamp of a file to the current time.
* `rm`: Remove files.
* `mv`: Move or rename files.
* `cp`: Copy files.
:::
## `ls` useful options {.smaller}
`ls` shows files and directories in the current directory. You can provide a
**path** to list files in a different directory. Here are some useful options:
* `ls -l`: List files with details.
* `ls -a`: List all files, including hidden ones.
* `ls -lh`: List files with human-readable sizes.
* `ls -t`: List files sorted by modification time.
* `ls -S`: List files sorted by size.
* `ls -R`: List files recursively.
* `ls -r`: List sorted files in reverse order.
* `ls -1`: List files in a single column.
You can combine options, e.g., `ls -lhSr /home/user/data` to list files
in `/home/user/data` folder with *human-readable sizes*
ordered by *size* in *ascending* order (bigger files on *bottom*).
## Useful options for other commands {.smaller}
* Without a path:
+ `cd`: (without any arguments) change to **your home** directory
+ `cd -`: change to the previous directory
* With a path (*relative or absolute*)
+ `mkdir -p`: create a directory with its parents if they do not exist
+ `rm -r`: remove directories and their contents recursively (**use with caution**)
+ `rm -i`: prompt before removing files
+ `mv -i`: prompt before overwriting files
+ `cp -r`: copy directories and their contents recursively
+ `cp -i`: prompt before overwriting files
. . .
Can you guess the difference between `rm -r` and `rmdir`?
## Navigate the Filesystem (exercise) {.smaller}
1. Open a terminal (if not yet open)
2. Print the current working directory
3. List the files and directories
4. Change to your *home* directory
5. Create a new directory called `mydir`
6. Change to the `mydir` directory
7. Create a new file called `myfile.txt`
8. List the files and directories
9. Move `myfile.txt` to the *home* directory
10. Remove the `mydir` directory
11. List the files and directories
## Wildcards {.smaller}
Wildcards are characters that help match file names based on patterns. Ex:
::: incremental
- `*`: Matches any number of characters (`file*` matches `file1`, `file2`, `fileverylong` etc.)
- `?`: Matches a single character (`file?` matches `file1`, `file2`, but not `fileverylong`)
- `[ ]`: Matches any character within the brackets (`file[12]` matches `file1`, `file2`, but
not `file3` or `fileverylong`. `file[1-9]` to match file with any digit)
- `{ }`: Matches any of the comma-separated word (`file{1,2,verylong}`
matches `file1`, `file2` and `fileverylong`)
:::
## Special characters {.smaller}
Some characters have special meanings in the shell:
::: incremental
* `~`: expands to the home directory (`cd ~`, `cd ~/data`)
* `$`: refers to an environment variable (`echo $HOME`, `echo $PWD`)
* `;`: separates commands in one line (execute commands with this order e.g.
`cd /tmp; ls` first change to `/tmp` then list files)
* `\`: escapes the next character (`ls file\ with\ spaces.txt`)
* `'`: preserves the literal value of all characters enclosed
(`echo 'Today is $(date)'` will print `Today is $(date)`)
* `"`: preserves the literal value of all characters enclosed, but
allow for variable expansion, command substitution, and escape sequences
(`echo "Today is $(date)"` will print `Today is <current date>`)
* `#`: comments the rest of the line (`# this is a comment`). Not executed.
:::
## File Permissions {.smaller}
Each file has three types of permissions: **read**, **write**, and **execute**.
These permissions are set for three types of users: **owner**, **group**, and **others**.
::: incremental
* `r`: read permission
* `w`: write permission
* `x`: execute permission
* `-`: no permission
:::
. . .
You can inspect permissions using `ls -l`, near the file name. For example, `rwxr-xr--` means:
* Owner has read, write, and execute permissions.
* Group has read and execute permissions.
* Others have only read permission.
## Change File Permissions (UGO) {.smaller}
You can change file permissions using `chmod` command. The general syntax is:
`chmod [options] mode file`
where `mode` can be:
::: incremental
* `u`: user (owner)
* `g`: group
* `o`: others
* `a`: all (u, g, o)
* `+`: add permission
* `-`: remove permission
* `=`: set permission
:::
. . .
For example, to give execute permission to the owner of a file:
`chmod u+x file`
## Stdin, Stdout, Stderr {.smaller}
Every process in Unix has three standard streams:
::: incremental
* **Standard Input (stdin)**: Input from the keyboard or another process.
* **Standard Output (stdout)**: Output to the terminal or another process.
* **Standard Error (stderr)**: Error messages to the terminal or another process.
:::
. . .
By default, stdin is the keyboard, and stdout and stderr are the terminal. You can redirect these streams:
::: incremental
* `>`: Redirect stdout to a file (`ls > files.txt`)
* `>>`: Append stdout to a file (`ls >> files.txt`)
* `2>`: Redirect stderr to a file (`ls non_existent_file 2> errors.txt`)
* `&>`: Redirect both stdout and stderr to a file (`ls non_existent_file &> output.txt`).
Its equivalent to `> output.txt 2>&1`
* `<`: Redirect stdin from a file (`wc -l < files.txt`)
:::
## Pipes {.smaller}
Pipes (`|`) connect the stdout of one command to the stdin of another. For example:
`ls -l | wc -l`
This command lists files in the current directory and counts the number of lines in the output.
. . .
You can chain multiple commands using pipes:
`ls -l | grep myfile | wc -l`
This command lists files in the current directory, filters lines containing `myfile`, and counts the number of lines.
## Environment Variables {.smaller}
Environment variables are key-value pairs that store information about the environment. Some common environment variables:
::: incremental
* `HOME`: Home directory
* `PATH`: List of directories to search for executable files
* `PWD`: Present working directory
* `OLDPWD`: Previous working directory
* `USER`: Current user
* `SHELL`: Current shell
:::
. . .
You can access environment variables using `$`, for example `echo $HOME` prints
the home directory (to *stdout*).
You can use environment variables in scripts or commands, for example:
`cd $HOME` or `cp $OLDPWD/file.txt .`
## Aliases {.smaller}
Aliases are shortcuts for commands. You can define aliases in the shell configuration file (e.g. `~/.bashrc`). For example:
```bash
alias ll='ls -l'
```
This command creates an alias `ll` for `ls -l`. You can use `ll` instead of `ls -l`.
. . .
You can list all aliases using `alias` command. To remove an alias, use `unalias` command:
```bash
unalias ll
```
### Exercise
* How many *aliases* are defined in your terminal?
* How many commands do you know?
## Access file content {.smaller}
You can access the content of a file using `cat`, `less`, `more`, or `head` and `tail` commands.
::: incremental
* `cat`: Concatenate and display file content
* `less`: Display file content page by page
* `more`: Display file content page by page
* `head`: Display the first lines of a file
* `tail`: Display the last lines of a file
:::
. . .
For example, to display the first 10 lines of a file:
`head file.txt`
## Search file content {.smaller}
You can search for text in files using `grep` command. The general syntax is:
`grep [options] pattern file`
where `pattern` is the text to search for. For example:
`grep 'pattern' file.txt`
This command searches for `pattern` in `file.txt` and prints matching lines.
. . .
You can use regular expressions in `grep` to search for more complex patterns. For example:
`grep -E 'pattern1|pattern2' file.txt`
This command searches for `pattern1` or `pattern2` in `file.txt`.
## Find files {.smaller}
You can find files in the filesystem using `find` command. The general syntax is:
`find [path] [options]`
where `path` is the directory to search in. For example:
`find /tmp -iname '*.txt'`
This command searches for files with `.txt` extension in `/tmp` directory. More options:
::: incremental
* `-name`: Search by exact name
* `-type`: Search by file type (e.g. `f` for file, `d` for directory)
* `-size`: Search by file size (`+` for larger, `-` for smaller. e.g. `+1M`)
* `-exec`: Execute a command on found files (requires `{}` as a placeholder and `\;` at the end of the command)
:::
## Utilities {.smaller}
There are many utilities available in Unix-like systems. Some common utilities:
::: incremental
* `awk`: A powerful text processing tool
* `sed`: A stream editor for filtering and transforming text
* `cut`: Extract columns from each line of files
* `sort`: Sort lines of text files
* `uniq`: Report or omit repeated lines
* `wc`: Print newline, word, and byte counts for each file
* `diff`: Compare files line by line
* `file`: Determine file type
* `du`: Estimate file space usage
* `man`: Display manual pages
* `which`: Locate a command
:::
## Utilities exercise {.smaller}
* List current directory contents using `ls` command and redirect the output to `files.txt`.
* Count the number of files in the current directory using `ls` and `wc` commands.
* Find all files with `.txt` extension in the current directory using `find` command.
* Display the first 5 lines of `files.txt` using `head` command.
* Rename `files.txt` to `all_files.txt` using `mv` command.
* `ls` a non-existent file and redirect the error to `errors.txt`.
* Concatenate `all_files.txt` and `errors.txt` using `cat` command and redirect
the output to `all_files_errors.txt`.
* Display manual pages for `bash` (`man bash`). Search for `Commands for Moving` section:
How I can move to the beginning of the line? How I can move to the next word?
* GitPod users: locate file `LSPAN24-practical1.qmd`, `grep` for `##` characters (mind `#`
is a comment character), then order titles alphabetically using `sort`
## Curl vs Wget {.smaller}
`curl` and `wget` are command-line tools for transferring data with URLs. Some differences:
::: incremental
* `curl`: Supports multiple protocols (HTTP, HTTPS, FTP, etc.), more flexible, but less user-friendly.
* `wget`: Supports HTTP and FTP, more user-friendly, but less flexible.
* `curl` is more suitable for scripting and automation.
:::
. . .
You can use `curl` and `wget` to download files from the web. For example:
`curl -O https://example.com/file.txt`
This command downloads `file.txt` from `https://example.com` to the current directory.
## Compression {.smaller}
You can compress and decompress files using `gzip`, `bzip2`, and `xz` commands. For example:
`gzip file.txt`
This command compresses `file.txt` to `file.txt.gz`. To decompress:
`gunzip file.txt.gz` (or `gzip -d file.txt.gz`)
. . .
You can use `bzip2` and `xz` commands similarly. For example to compress:
`bzip2 file.txt`
`xz file.txt`
To decompress:
`bunzip2 file.txt.bz2` (or `bzip2 -d file.txt.bz2`)
`unxz file.txt.xz` (or `xz -d file.txt.xz`)
## Tar {.smaller}
You can create (`-c`) and extract (`-x`) tar archives using `tar` command. For example:
`tar -cvf archive.tar file1 file2`
This command creates `archive.tar` containing `file1` and `file2`.
**This archive is not compressed** and have the same size of the sum of
`file1` and `file2` sizes. To extract:
`tar -xvf archive.tar`
. . .
You can compress and extract compressed tar archives in one step. For example:
`tar -czvf archive.tar.gz file1 file2`
This command creates `archive.tar.gz` containing `file1` and `file2`. To extract:
`tar -xzvf archive.tar.gz`
Mind to the `z` option for gzip, `j` for bzip2, and `J` for xz. `f` need to be
followed by the archive name.
## Tar (remember) {.smaller}
![Credits [programmerhumor.io](https://programmerhumor.io/programming-memes/for-the-programmers-who-use-tar/)](https://programmerhumor.io/wp-content/uploads/2023/11/programmerhumor-io-programming-memes-4afd15c9908ddc5-758x809.jpg){fig-align="center" .r-stretch fig-cap-location=bottom}
## Tar (remember /2) {.smaller}
![Credits [xkcd.com](https://xkcd.com/1168/)](https://imgs.xkcd.com/comics/tar_2x.png){fig-align="center" .r-stretch fig-cap-location=bottom}
## Editors
There are many text editors available in Unix-like systems.
Here some common editors available in terminal:
::: incremental
* `nano`: Simple and user-friendly text editor
* `vim`: Powerful and highly configurable text editor
* `emacs`: Extensible and customizable text editor
:::
. . .
::: {.callout-tip}
GitPod users have access to [Visual Studio Code](https://code.visualstudio.com/),
a powerful and highly configurable text editor.
:::
## Introduction to virtual environment
Virtual environments are isolated environments for software development.
They allow you to install dependencies and packages without affecting the system-wide installation.
Some common tools for creating virtual environments:
- Python: [virtualenv](https://virtualenv.pypa.io/en/latest/),
[pyenv](https://github.com/pyenv/pyenv), [poetry](https://python-poetry.org/)
- R: [packrat](https://rstudio.github.io/packrat/), [renv](https://rstudio.github.io/renv/articles/renv.html)
- Perl: [local::lib](https://metacpan.org/pod/local::lib), [perlbrew](https://perlbrew.pl/)
- Ruby: [rvm](https://rvm.io/)
::: notes
something on reproducibility and the need for virtual environments.
:::
## Introduction to Conda
- [Conda](https://docs.anaconda.com/anaconda/install/): A package manager and environment management system for installing and managing software packages and dependencies, particularly for Python and R.
::: incremental
- [Miniconda](https://docs.anaconda.com/miniconda/): A minimal installer for Conda that includes Conda, Python, and other essential packages.
- [Mamba](https://mamba.readthedocs.io/en/latest/installation/mamba-installation.html): A faster, drop-in replacement for Conda. You can swap almost all commands between `conda` & `mamba`
:::
::: notes
conda to install dependencies and softwares
Conda: has navigators, softwares for installing packages using GUI and editors like spyder, jupyter notebook, etc. Suggested on local machines. Miniconda: is a lightweight version of Conda, suggested on servers or where the GUI is not available. Mamba: implemented the package resolvers in C++ and is faster than Conda. However conda has introduced a new package resolver last year. It has some additional features like `repoquery`.
:::
## Create an environment (from scratch)
We use `conda` command to manage environments with `conda` / `miniconda`:
- replace with `mamba` if you have `mamba` installed.
- `--name`: required to specify the name of the environment.
- specifying versions helps the package resolver
``` {.bash code-line-numbers="1-2|4-5|7-8"}
# create an empty environment
conda create --name empty
# create an environment with some software
conda create --name python python ipython
# create an environment with a specific software version
conda create --name python3.10 python=3.10 ipython=8.25
```
## Create an environment (from file)
We can also create an environment from a file. Note the `env` before the `create` command:
- `--file`: specify the file containing the environment specifications
- `--name`: override the name of the environment in the file
``` {.bash code-line-numbers="1-2|4-5"}
# create an environment from a file (use name in file)
conda env create --file environment.yml
# create an environment from a file (override name in file)
conda env create --file environment.yml --name myname
```
::: notes
without the `--file` option, the current `environment.yml` file will be used.
:::
## List and activate environments {.smaller}
- `conda env list`: list all environments
- `conda activate <env_name>`: activate an environment
Normally the prompt will change to show the active environment. Default installation
have the `base` environment active at login
```bash
# Activate an environment
(base) gitpod ~ $ conda activate ognigenoma
(ognigenoma) gitpod ~ $
```
. . .
To deactivate an environment, use `conda deactivate`:
```bash
# Deactivate the current environment
(ognigenoma) gitpod ~ $ conda deactivate
(base) gitpod ~ $
```
You can activate more than one environment at a time.
The last activated environment will be active.
When you deactivate an environment, the previous environment will be activated.
## Channels and repositories {.smaller}
- `defaults`: the default channel for conda packages
- `R`: a channel for R packages
- `bioconda`: a channel for bioinformatics software
- `conda-forge`: a community-driven collection of conda packages
``` {.bash}
# add a channel to the list of channels
conda config --add channels conda-forge
# add a channel to the list of channels
conda config --add channels bioconda
```
::: {.callout-tip}
GitPod users: we have set up channels required for the practical with
the suggested *priority* settings. You can see the settings in the
`.condarc` file in your home directory. See more information on
conda [Managing Channels](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-channels.html)
:::
## Search for packages {.smaller}
Use `conda search` to search for packages in the configured channels.
Wildcards can be used in the search:
``` {.bash code-line-numbers="1-2|4-5"}
# search for a package in the configured channels
conda search samtools
# search for a package in a specific channel
conda search --channel bioconda samtools
```
::: {.callout-tip}
GitPod users: the channels are already configured for the practicals.
You can search any package in the configured channels without specifying
the channel.
:::
. . .
### Exercise
Search for the package `samtools` in the `bioconda` channel:
1. What versions are available? What is the latest version? In which channel?
2. Do the same for `r-base` package.
## Install packages {.smaller}
* `conda install`: install packages in the active environment
+ `--name`: specify the environment to install the package
+ `--file`: specify a file with the list of packages to install
``` {.bash code-line-numbers="1-3|5-6|8-10"}
# install a package in the active environment
# base environment is read-only for GitPod users!
conda install pandas
# install a package in a specific environment
conda install --name python3.10 pandas
# install packages from a file in the active environment
# (python format)
conda install --file requirements.txt
```
### Exercise
Create an environment with `samtools`, `tabix` and `bcftools` packages.
Activate the environment and check if the packages are installed. Then
install `seqkit` package in the same environment.
::: notes
We are not creating an environment, the environment is already
present when I install a package
:::
## Understanding the environment (exercise) {.smaller}
1. Ensure the environment created in the previous exercise is active. Can
you find where the `samtools` executable is located?
2. Deactivate the environment and check if the `samtools` executable is
available. Why is it not available?
3. Can you figure out how conda makes the executables available in the
active environment?
. . .
::: {.callout-tip}
- Use `which` to find the location of the executable.
- Take a look to the environment variables in the active environment, for
example `$PATH`.
:::
## Exporting an environment {.smaller}
- `conda env export`: export the environment to a file
- `--name`: specify the environment to export
- `--file`: specify the file to export the environment to
- `--no-builds`: exclude the build string from the exported file
``` {.bash code-line-numbers="1-2|4-5"}
# export the active environment to a file (using STDOUT)
conda env export > environment.yml
# export a specific environment to a specific file
conda env export --name python3.10 --file environment.yml
```
. . .
### Exercise
Export an environment to a file; then export the same environment
with `--no-builds` option to another file. Compare the two files
with `diff` (try `diff -y` for a more readable output).
::: notes
does people know STDOUT/STDERR/STDIN?
:::
## Some tips {.smaller}
::: incremental
- `conda list`: list all packages installed in the active environment
- Install packages with versions: `conda install pandas=1.3.3`
- Create environments by scope: don't pull all packages in the same environment
- Limit the `conda-forge` channel usage if possible
- `conda list --revision`: list all revisions of the environment
- `conda install --revision 1`: revert environment to a previous revision
- `conda env remove --name <env_name>`: remove an environment
- `conda create --clone <env_name> --name <new_env>`: clone an environment
- `conda clean --all`: clean the cache and unused packages
:::
## Practical with conda: get genome data {.smaller}
Let's collect some genome data to make an example. We will use the CLI tools
made available by NCBI, `datasets` and `dataformat`, to collect ARS-UCD1.2
data from the NCBI database. Next, we will use `seqkit` to manipulate file
headers and then we will `bgzip` to pack the sequence files.
1. Create one conda environment with `ncbi-datasets-cli`, `jq`, `seqkit` and `tabix`
packages
2. Go the [NCBI Datasets](https://www.ncbi.nlm.nih.gov/datasets/) page
and search for *cow* (`Bos taurus (cattle)` will be suggested). You
will open the new page for
[NCBI Taxonomy ID: `9913`](https://www.ncbi.nlm.nih.gov/datasets/taxonomy/9913/).
3. Click on [ARS-UCD2.0](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_002263795.3/)
link, below the *Genome* section an over the *Download* button (don't
download the genome from this page)
## Genome assembly and legacy versions {.smaller}
In the practical we will use the ARS-UCD1.2 (`GCF_002263795.1`) version of the genome,
however the latest version is ARS-UCD2.0 (`GCF_002263795.3`):
1. Go to the bottom of the page and search for the
[GCF_002263795.1](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_002263795.1/)
accession for the ARS-UCD1.2 version (will be in the refseq column)
and click on it.
2. In the top of the page, near the *Download* button we have some
different ways to collect data. We don't want to download the genome
on on our laptop, so click on the *datasets* tab, just near the download
button.
3. A pop-up window will appear with the command to download the genome
data. Copy the command by clicking on the *Copy* button.
## NCBI Datasets CLI {.smaller}
`datasets` is a command-line tool that is used to query and download biological
sequence data across all domains of life from NCBI databases. See the
[documentation](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/datasets/)
for more information.
For example, retrieve the same information as before using the accession number:
*pipe* the result to [jq](https://jqlang.github.io/jq/manual/) to format the output:
```{.bash}
datasets summary genome accession GCF_002263795.1 | jq
```
The same command can be use to download the genome data: paste the command you've
copied from the NCBI Taxonomy page and add two additional options:
```{.bash}
datasets download genome accession GCF_002263795.1 \
--include gff3,rna,cds,protein,genome,seq-report \
--dehydrated --filename ARS-UCD1.2.zip
```
::: {.callout-note}
The `\` (escape character) is used to break the command in multiple lines:
it is not necessary if you paste the command in one line: but if you paste
the command in multiple lines it prevents the command from being executed
before you finish typing it.
:::
## NCBI Datasets CLI (2) {.smaller}
* `--dehydrated`: download the data in a dehydrated format: the data is
downloaded in a format that can be rehydrated using the `datasets` tool.
This is required if you need to download a lot of data. See
[Download large genome data packages](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/how-tos/genomes/large-download/)
for more information
* `--filename`: specify the name of the file to download the data to
* `--include`: specify the data types to download: if we are not interested
in all data types we can exclude some of them
Now unzip the downloaded archive in a new directory (since the archive
will place stuff in the current directory):
```{.bash}
unzip ARS-UCD1.2.zip -d ARS-UCD1.2
```
* `-d`: specify the directory to extract the files to
## Rehydrate the data {.smaller}
Take a look to see the downloaded data: we have not any data! the only
data we have are some metadata and the URL were files can be downloaded.
Is now time to rehydrate data:
```{.bash}
datasets rehydrate --directory ARS-UCD1.2
```
* `--directory`: this is the top level directory in which you have
decompressed data with `unzip`
Now the data should be downloaded in the same directory where the
metadata is located. You can check the data with `ls` or `tree`:
```{.bash}
tree -phC ARS-UCD1.2
```
* `tree`: list the directory structure in a tree-like format
+ `-p`: show permissions
+ `-h`: show sizes in human readable format
+ `-C`: colorize the output
## Deal with genome sequence {.smaller}
Take a minute to look at the files you have downloaded, especially the
genome sequence: how many sequences are in the file? what is the format
of the file?
We can use grep `'>'` to extract the sequence names from a fasta file
and then counting the number of lines, however
we can use a fasta/fastq manipulation program like [seqkit](https://bioinf.shenwei.me/seqkit/),
which can do this and much more:
```{.bash}
seqkit stats GCF_002263795.1_ARS-UCD1.2_genomic.fna
```
* `seqkit stats`: show statistics for the input file
There are a lot of sequences in the file, you can inspect the sequence
names with
```{.bash}
seqkit seq -n GCF_002263795.1_ARS-UCD1.2_genomic.fna
```
* `seqkit seq`: transform sequences (extract ID, filter by length,
remove gaps, reverse complement...)
+ `-n`: only print names/sequence headers
+ `-i`: print IDs instead of full headers
## Sequence report {.smaller}
The `sequence_report.jsonl` file contains information about the sequences in the
genome assembly. We can use `jq` to inspect the file, or we can use the
[dataformat](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/dataformat/)