Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
update to learning resources\ update to getting started\ disable package mode
  • Loading branch information
Christian Ramirez committed Jan 4, 2025
1 parent 601d427 commit b4045c9
Show file tree
Hide file tree
Showing 3 changed files with 239 additions and 28 deletions.
13 changes: 12 additions & 1 deletion getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ ssh HOSTNAME

# SSH

## Fixing double loging issue
## Fixing double login issue
As you may have noticed, you needed to enter your password + 2FA twice!

This is because you are logging-in twice
Expand Down Expand Up @@ -160,6 +160,14 @@ If you access multiple machines on the internal network frequently (e.g. z014 &
```

# Installing Software

## Enviornment management
### Virtual environments
Typically, a collection of dependencies (could be language specific) that ensure the application or program of interest runs in isolation from global, system dependencies. In bioinformatics applciations, conda is a widely-used virtual environment manager and dependency resolver.

### Containers
A different technology altogether, 'containerization' isolates the application or program of interest in a virtual process. This method usually offers a higher level of abstraction/isolation, in which each virtual process can have its own space, file system, network space, etc. Two widely-used programs for containerization are docker and singularity.

## `conda`
Conda is a great way to quickly install software and create separate environments for projects requiring different, and potentially conflicting, pieces of software. The conda (Miniforge3) installation instructions have been adapted from default installation instructions so it is available across all our machines.

Expand Down Expand Up @@ -309,5 +317,8 @@ When you go to the github link, you should be prompted to authenticate with the

Now, go to the VSCode IDE on your client machine and open the command palette with `CMD + SHIFT + P` (macos) and type `Remote-Tunnels: Connect to Tunnel`. Select the Github authentication option. Wait a bit, and you should see one remote resource "online." Once you've added the remote connection and opened a remote directory, you should be all set!

## Reconnecting tunnel
After you close VSCode, the tunnel will automatically close. However, the server will still be running on the remote machine. To reconnect the tunnel, you will need to `ssh` back into that machine.

## Usage of jupyter notebooks and kernels
If you want to make use of jupyter notebooks, you'll need to install the `Jupyter` extension (on both local and remote machine). After this you should be able to select from the available kernels (python, bash or R).
253 changes: 226 additions & 27 deletions learning_resources.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,49 +51,248 @@ You can find a more in-depth tutorial [here](https://github.com/iggredible/Learn
## Version Control
### Git
The official Git documentation includes a tutorial [here](https://git-scm.com/docs/gittutorial).

# Data Types and File Formats

## Sequence Data Formats

### FASTQ (.fastq, .fq)
- Official spec: https://www.ncbi.nlm.nih.gov/sra/docs/submitformats/#fastq-files
- Illumina format: https://emea.support.illumina.com/help/BaseSpace_OLH_009008/Content/Source/Informatics/BS/FileFormat_FASTQ-files_swBS.htm
Text-based format for storing both biological sequence data (usually nucleotide sequences) and their corresponding quality scores.

*Each sequence entry contains four lines*:
1. Sequence identifier with description
2. Raw sequence letters
3. Plus sign (optional description)
4. Quality scores encoded in ASCII characters

| Element | Requirements | Description |
| --- | --- | --- |
| `@` | @ | Each sequence identifier line starts with @ |
| `<instrument>` | Characters allowed: a–z, A–Z, 0–9 and underscore | Instrument ID |
| `<run number>` | Numerical | Run number on instrument |
| `<flowcell ID>` | Characters allowed: a–z, A–Z, 0–9 | Flowcell ID |
| `<lane>` | Numerical | Lane number |
| `<tile>` | Numerical | Tile number |
| `<x_pos>` | Numerical | X coordinate of cluster |
| `<y_pos>` | Numerical | Y coordinate of cluster |
| `<read>` | Numerical | Read number. 1 can be single read or Read 2 of paired-end |
| `<is filtered>` | Y or N | Y if the read is filtered (did not pass), N otherwise |
| `<control number>` | Numerical | 0 when none of the control bits are on, otherwise it is an even number. On HiSeq X systems, control specification is not performed and this number is always 0 |
| `<sample number>` | Numerical | Sample number from sample sheet |

#### Example
``` {code} text
:filename: example.fastq
@071112_SLXA-EAS1_s_7:5:1:817:345
GGGTGATGGCCGCTGCCGATGGCGTC
AAATCCCACC
+
IIIIIIIIIIIIIIIIIIIIIIIIII
IIII9IG9IC
@071112_SLXA-EAS1_s_7:5:1:801:338
GTTCAGGGATACGACGTTTGTATTTTAAGAATCTGA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII6IBI
```

### SAM/BAM/CRAM (.sam, .bam, .cram)
- Official spec: https://samtools.github.io/hts-specs/
- SAM: https://samtools.github.io/hts-specs/SAMv1.pdf
- CRAM: https://samtools.github.io/hts-specs/CRAMv3.pdf
- BAM: https://samtools.github.io/hts-specs/SAMv1.pdf
- SAM (Sequence Alignment/Map): Text format for storing sequence alignments against a reference genome
- BAM: Binary version of SAM, compressed and indexed for faster processing
- CRAM: Highly compressed reference-based alternative to BAM, designed for long-term storage

### FASTA (.fasta, .fa)
- NCBI spec: https://www.ncbi.nlm.nih.gov/genbank/fastaformat/
- UniProt guide: https://www.uniprot.org/help/fasta-headers
Simple text-based format for representing nucleotide or peptide sequences. Typically, FASTA is used to store reference data (such as those from curtated databases).

*Each entry consists of*:
- A description line (starts with '>')
- The sequence data on subsequent lines

#### Example
``` {code} text
:filename: example.fasta
>Mus_musculus_tRNA-Ala-AGC-1-1 (chr13.trna34-AlaAGC)
GGGGGTGTAGCTCAGTGGTAGAGCGCGTGCTTAGCATGCACGAGGcCCTGGGTTCGATCC
CCAGCACCTCCA
>Mus_musculus_tRNA-Ala-AGC-10-1 (chr13.trna457-AlaAGC)
GGGGGATTAGCTCAAATGGTAGAGCGCTCGCTTAGCATGCAAGAGGtAGTGGGATCGATG
CCCACATCCTCCA
```

## Variant Formats

### VCF/BCF (.vcf, .bcf)
- Official spec: https://samtools.github.io/hts-specs/VCFv4.3.pdf
- Format guide: https://www.internationalgenome.org/wiki/Analysis/vcf4.0/
- VCF (Variant Call Format): Text file format for storing gene sequence variations
- BCF: Binary version of VCF

*Contains information about*:
- Genomic position of variants
- Reference and alternative alleles
- Quality scores
- Filter statuses
- Additional annotations

## Genome Annotation Formats

### BED (.bed)
- UCSC spec: https://genome.ucsc.edu/FAQ/FAQformat.html#format1
- Extended BED: https://genome.ucsc.edu/FAQ/FAQformat.html#format1.7
Browser Extensible Data format for defining genomic features.

*Contains*:
- Chromosome name
- Start position
- End position
- Optional fields (name, score, strand, etc.)
Commonly used for displaying data tracks in genome browsers

| Column number | Title | Definition | Required |
| --- | --- | --- | --- |
| **1** | **chrom** | Chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671) name | Yes |
| **2** | **chromStart** | Start coordinate on the chromosome or scaffold for the sequence considered (the first base on the chromosome is numbered 0 i.e. the number is zero-based) | Yes |
| **3** | **chromEnd** | End coordinate on the chromosome or scaffold for the sequence considered. This position is non-inclusive, unlike chromStart (the first base on the chromosome is numbered 1 i.e. the number is one-based) | Yes |
| **4** | **name** | Name of the line in the BED file | No |
| **5** | **score** | Score between 0 and 1000 | No |
| **6** | **strand** | DNA strand orientation (positive ["+"] or negative ["-"] or "." if no strand) | No |
| **7** | **thickStart** | Starting coordinate from which the annotation is displayed in a thicker way on a graphical representation (e.g.: the start codon of a gene) | No |
| **8** | **thickEnd** | End coordinates from which the annotation is no longer displayed in a thicker way on a graphical representation (e.g.: the stop codon of a gene) | No |
| **9** | **itemRgb** | RGB value in the form R, G, B (e.g. 255,0,0) determining the display color of the annotation contained in the BED file | No |
| **10** | **blockCount** | Number of blocks (e.g. exons) on the line of the BED file | No |
| **11** | **blockSizes** | List of values separated by commas corresponding to the size of the blocks (the number of values must correspond to that of the "blockCount") | No |
| **12** | **blockStarts** | List of values separated by commas corresponding to the starting coordinates of the blocks, coordinates calculated relative to those present in the chromStart column (the number of values must correspond to that of the "blockCount") | No |

### GFF/GTF (.gff, .gtf)
- GFF3 spec: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
- GTF spec: http://mblab.wustl.edu/GTF22.html
- Ensembl GTF: https://www.ensembl.org/info/website/upload/gff.html
## Expression Data
### GCT (.gct)
- Broad Institute spec: https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats
- GFF (General Feature Format): Describes genes and other features of DNA, RNA, and protein sequences
- GTF (Gene Transfer Format): More specialized version of GFF

*Contains*:
- Feature coordinates
- Feature types
- Score
- Strand information
- Frame
- Attribute-value pairs

| Column number | Title | Definition | Required |
| --- | --- | --- | --- |
| **1** | **seqname** | Name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Must be a standard chromosome name or an Ensembl identifier such as a scaffold ID, without additional content like species or assembly | Yes |
| **2** | **source** | Name of the program that generated this feature, or the data source (database or project name) | Yes |
| **3** | **feature** | Feature type name, e.g. Gene, Variation, Similarity | Yes |
| **4** | **start** | Start position of the feature, with sequence numbering starting at 1 | Yes |
| **5** | **end** | End position of the feature, with sequence numbering starting at 1 | Yes |
| **6** | **score** | A floating point value | Yes (use '.' if empty) |
| **7** | **strand** | Defined as + (forward) or - (reverse) | Yes (use '.' if empty) |
| **8** | **frame** | One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on | Yes (use '.' if empty) |
| **9** | **attribute** | A semicolon-separated list of tag-value pairs, providing additional information about each feature | Yes |

### WIG (.wig)
The WIG (wiggle) format is used for displaying continuous-valued data in track format. It is particularly useful for showing expression data, probability scores, and GC percentage.

WIG files be formatted in two main ways:
#### 1. Fixed Step
```
fixedStep chrom=chrN start=pos step=stepInterval [span=windowSize]
dataValue
dataValue
dataValue
```
##### Required Fields
| Field | Description |
| --- | --- |
| **chrom** | Chromosome name (e.g., chr1) |
| **start** | Starting position |
| **step** | Distance between starts of adjacent windows |
| **dataValue** | Numerical data value for each position |

##### Optional Fields
| Field | Description |
| --- | --- |
| **span** | Size of window (defaults to step size) |

#### 2. Variable Step
```
variableStep chrom=chrN [span=windowSize]
chromStart dataValue
chromStart dataValue
chromStart dataValue
```
##### Required Fields
| Field | Description |
| --- | --- |
| **chrom** | Chromosome name |
| **chromStart** | Start position of each window |
| **dataValue** | Numerical data value for each position |

##### Optional Fields
| Field | Description |
| --- | --- |
| **span** | Size of window (defaults to 1) |

#### Common Use Cases
1. Gene expression levels
2. ChIP-seq signal intensity
3. RNA-seq coverage
4. Conservation scores
5. GC content
6. Probability scores
7. Transcription factor binding signals

#### Example
``` {code} text
:filename: example.wig
# Fixed-step example showing expression values
fixedStep chrom=chr3 start=400601 step=100 span=100
11.0
22.0
33.0
# Variable-step example showing conservation scores
variableStep chrom=chr3 span=150
500701 5.0
500801 3.0
500901 8.0
```

## Phylogenetic Formats

### Newick (.nwk)
- Format spec: http://evolution.genetics.washington.edu/phylip/newicktree.html
- Extended spec: https://doi.org/10.1093/bioinformatics/btg190
Standard format for representing phylogenetic trees using nested parentheses.

*Contains*:
- Tree topology
- Branch lengths
- Node labels
- Bootstrap values

### NEXUS (.nex)
- Original paper: http://dx.doi.org/10.1093/sysbio/46.4.590
- Format guide: http://wiki.christophchamp.com/index.php?title=NEXUS_file_format
Rich format for storing multiple types of biological data:
- Character matrices
- Trees
- Distance matrices
- Analysis assumptions
Supports multiple data blocks and commands

## Index Formats

### BAI/CSI (.bai, .csi)
- Specs included in SAMtools: https://samtools.github.io/hts-specs/
- CSI spec: https://samtools.github.io/hts-specs/CSIv1.pdf
- BAI: Index format for BAM files
- CSI: Coordinate-sorted index format

These index files are typically required by common [CLI tools]()

*Enables*:
- Random access to compressed files
- Quick retrieval of alignments
- Efficient genome browsing

## Genome Browser Formats

### bigWig/bigBed (.bw, .bb)
- UCSC spec: https://genome.ucsc.edu/goldenPath/help/bigWig.html
- Format guide: https://genome.ucsc.edu/FAQ/FAQformat.html
- bigWig: Binary format of `.wig` files
- bigBed: Binary format of `.bed` files

*Advantages*:
- Efficient random access
- Reduced memory usage
- Fast display in genome browsers

# Pipeline Development
## Snakemake
Snakemake is a workflow management system that helps automate data analysis pipelines. The functionality of snakemake cannot be covered here in meaningful detail. Please read through the [snakemake documentation](https://snakemake.readthedocs.io/en/stable/) and play around with example pipelines to gain familiarity with this powerful tool.
Expand All @@ -104,7 +303,7 @@ Snakemake is a workflow management system that helps automate data analysis pipe
```
project_name/
├── README.md # Project overview, setup instructions, and usage
├── LICENSE # Project license (e.g., MIT, GPL)
├── LICENSE.md # Project license (e.g., MIT, GPL)
├── .gitignore # Git ignore rules
├── environment.yml # Conda environment specification
├── requirements.txt # Python package dependencies
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ version = "0.1.1"
description = ""
authors = ["Christian S. Ramirez"]
readme = "README.md"
package-mode = false

[tool.poetry.dependencies]
python = "^3.13"
Expand Down

0 comments on commit b4045c9

Please sign in to comment.