diff --git a/getting_started.md b/getting_started.md index 6b88901..b807d8f 100644 --- a/getting_started.md +++ b/getting_started.md @@ -117,7 +117,7 @@ ssh HOSTNAME # SSH -## Fixing double loging issue +## Fixing double login issue As you may have noticed, you needed to enter your password + 2FA twice! This is because you are logging-in twice @@ -160,6 +160,14 @@ If you access multiple machines on the internal network frequently (e.g. z014 & ``` # Installing Software + +## Enviornment management +### Virtual environments +Typically, a collection of dependencies (could be language specific) that ensure the application or program of interest runs in isolation from global, system dependencies. In bioinformatics applciations, conda is a widely-used virtual environment manager and dependency resolver. + +### Containers +A different technology altogether, 'containerization' isolates the application or program of interest in a virtual process. This method usually offers a higher level of abstraction/isolation, in which each virtual process can have its own space, file system, network space, etc. Two widely-used programs for containerization are docker and singularity. + ## `conda` Conda is a great way to quickly install software and create separate environments for projects requiring different, and potentially conflicting, pieces of software. The conda (Miniforge3) installation instructions have been adapted from default installation instructions so it is available across all our machines. @@ -309,5 +317,8 @@ When you go to the github link, you should be prompted to authenticate with the Now, go to the VSCode IDE on your client machine and open the command palette with `CMD + SHIFT + P` (macos) and type `Remote-Tunnels: Connect to Tunnel`. Select the Github authentication option. Wait a bit, and you should see one remote resource "online." Once you've added the remote connection and opened a remote directory, you should be all set! +## Reconnecting tunnel +After you close VSCode, the tunnel will automatically close. However, the server will still be running on the remote machine. To reconnect the tunnel, you will need to `ssh` back into that machine. + ## Usage of jupyter notebooks and kernels If you want to make use of jupyter notebooks, you'll need to install the `Jupyter` extension (on both local and remote machine). After this you should be able to select from the available kernels (python, bash or R). diff --git a/learning_resources.md b/learning_resources.md index 720f9d4..dc5a183 100644 --- a/learning_resources.md +++ b/learning_resources.md @@ -51,49 +51,248 @@ You can find a more in-depth tutorial [here](https://github.com/iggredible/Learn ## Version Control ### Git The official Git documentation includes a tutorial [here](https://git-scm.com/docs/gittutorial). + # Data Types and File Formats + ## Sequence Data Formats + ### FASTQ (.fastq, .fq) -- Official spec: https://www.ncbi.nlm.nih.gov/sra/docs/submitformats/#fastq-files -- Illumina format: https://emea.support.illumina.com/help/BaseSpace_OLH_009008/Content/Source/Informatics/BS/FileFormat_FASTQ-files_swBS.htm +Text-based format for storing both biological sequence data (usually nucleotide sequences) and their corresponding quality scores. + +*Each sequence entry contains four lines*: +1. Sequence identifier with description +2. Raw sequence letters +3. Plus sign (optional description) +4. Quality scores encoded in ASCII characters + +| Element | Requirements | Description | +| --- | --- | --- | +| `@` | @ | Each sequence identifier line starts with @ | +| `` | Characters allowed: a–z, A–Z, 0–9 and underscore | Instrument ID | +| `` | Numerical | Run number on instrument | +| `` | Characters allowed: a–z, A–Z, 0–9 | Flowcell ID | +| `` | Numerical | Lane number | +| `` | Numerical | Tile number | +| `` | Numerical | X coordinate of cluster | +| `` | Numerical | Y coordinate of cluster | +| `` | Numerical | Read number. 1 can be single read or Read 2 of paired-end | +| `` | Y or N | Y if the read is filtered (did not pass), N otherwise | +| `` | Numerical | 0 when none of the control bits are on, otherwise it is an even number. On HiSeq X systems, control specification is not performed and this number is always 0 | +| `` | Numerical | Sample number from sample sheet | + +#### Example +``` {code} text +:filename: example.fastq +@071112_SLXA-EAS1_s_7:5:1:817:345 +GGGTGATGGCCGCTGCCGATGGCGTC +AAATCCCACC ++ +IIIIIIIIIIIIIIIIIIIIIIIIII +IIII9IG9IC +@071112_SLXA-EAS1_s_7:5:1:801:338 +GTTCAGGGATACGACGTTTGTATTTTAAGAATCTGA ++ +IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII6IBI +``` + ### SAM/BAM/CRAM (.sam, .bam, .cram) -- Official spec: https://samtools.github.io/hts-specs/ -- SAM: https://samtools.github.io/hts-specs/SAMv1.pdf -- CRAM: https://samtools.github.io/hts-specs/CRAMv3.pdf -- BAM: https://samtools.github.io/hts-specs/SAMv1.pdf +- SAM (Sequence Alignment/Map): Text format for storing sequence alignments against a reference genome +- BAM: Binary version of SAM, compressed and indexed for faster processing +- CRAM: Highly compressed reference-based alternative to BAM, designed for long-term storage + ### FASTA (.fasta, .fa) -- NCBI spec: https://www.ncbi.nlm.nih.gov/genbank/fastaformat/ -- UniProt guide: https://www.uniprot.org/help/fasta-headers +Simple text-based format for representing nucleotide or peptide sequences. Typically, FASTA is used to store reference data (such as those from curtated databases). + +*Each entry consists of*: +- A description line (starts with '>') +- The sequence data on subsequent lines + +#### Example +``` {code} text +:filename: example.fasta +>Mus_musculus_tRNA-Ala-AGC-1-1 (chr13.trna34-AlaAGC) +GGGGGTGTAGCTCAGTGGTAGAGCGCGTGCTTAGCATGCACGAGGcCCTGGGTTCGATCC +CCAGCACCTCCA +>Mus_musculus_tRNA-Ala-AGC-10-1 (chr13.trna457-AlaAGC) +GGGGGATTAGCTCAAATGGTAGAGCGCTCGCTTAGCATGCAAGAGGtAGTGGGATCGATG +CCCACATCCTCCA +``` + ## Variant Formats + ### VCF/BCF (.vcf, .bcf) -- Official spec: https://samtools.github.io/hts-specs/VCFv4.3.pdf -- Format guide: https://www.internationalgenome.org/wiki/Analysis/vcf4.0/ +- VCF (Variant Call Format): Text file format for storing gene sequence variations +- BCF: Binary version of VCF + +*Contains information about*: +- Genomic position of variants +- Reference and alternative alleles +- Quality scores +- Filter statuses +- Additional annotations + ## Genome Annotation Formats + ### BED (.bed) -- UCSC spec: https://genome.ucsc.edu/FAQ/FAQformat.html#format1 -- Extended BED: https://genome.ucsc.edu/FAQ/FAQformat.html#format1.7 +Browser Extensible Data format for defining genomic features. + +*Contains*: +- Chromosome name +- Start position +- End position +- Optional fields (name, score, strand, etc.) +Commonly used for displaying data tracks in genome browsers + +| Column number | Title | Definition | Required | +| --- | --- | --- | --- | +| **1** | **chrom** | Chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671) name | Yes | +| **2** | **chromStart** | Start coordinate on the chromosome or scaffold for the sequence considered (the first base on the chromosome is numbered 0 i.e. the number is zero-based) | Yes | +| **3** | **chromEnd** | End coordinate on the chromosome or scaffold for the sequence considered. This position is non-inclusive, unlike chromStart (the first base on the chromosome is numbered 1 i.e. the number is one-based) | Yes | +| **4** | **name** | Name of the line in the BED file | No | +| **5** | **score** | Score between 0 and 1000 | No | +| **6** | **strand** | DNA strand orientation (positive ["+"] or negative ["-"] or "." if no strand) | No | +| **7** | **thickStart** | Starting coordinate from which the annotation is displayed in a thicker way on a graphical representation (e.g.: the start codon of a gene) | No | +| **8** | **thickEnd** | End coordinates from which the annotation is no longer displayed in a thicker way on a graphical representation (e.g.: the stop codon of a gene) | No | +| **9** | **itemRgb** | RGB value in the form R, G, B (e.g. 255,0,0) determining the display color of the annotation contained in the BED file | No | +| **10** | **blockCount** | Number of blocks (e.g. exons) on the line of the BED file | No | +| **11** | **blockSizes** | List of values separated by commas corresponding to the size of the blocks (the number of values must correspond to that of the "blockCount") | No | +| **12** | **blockStarts** | List of values separated by commas corresponding to the starting coordinates of the blocks, coordinates calculated relative to those present in the chromStart column (the number of values must correspond to that of the "blockCount") | No | + ### GFF/GTF (.gff, .gtf) -- GFF3 spec: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md -- GTF spec: http://mblab.wustl.edu/GTF22.html -- Ensembl GTF: https://www.ensembl.org/info/website/upload/gff.html -## Expression Data -### GCT (.gct) -- Broad Institute spec: https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats +- GFF (General Feature Format): Describes genes and other features of DNA, RNA, and protein sequences +- GTF (Gene Transfer Format): More specialized version of GFF + +*Contains*: +- Feature coordinates +- Feature types +- Score +- Strand information +- Frame +- Attribute-value pairs + +| Column number | Title | Definition | Required | +| --- | --- | --- | --- | +| **1** | **seqname** | Name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Must be a standard chromosome name or an Ensembl identifier such as a scaffold ID, without additional content like species or assembly | Yes | +| **2** | **source** | Name of the program that generated this feature, or the data source (database or project name) | Yes | +| **3** | **feature** | Feature type name, e.g. Gene, Variation, Similarity | Yes | +| **4** | **start** | Start position of the feature, with sequence numbering starting at 1 | Yes | +| **5** | **end** | End position of the feature, with sequence numbering starting at 1 | Yes | +| **6** | **score** | A floating point value | Yes (use '.' if empty) | +| **7** | **strand** | Defined as + (forward) or - (reverse) | Yes (use '.' if empty) | +| **8** | **frame** | One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on | Yes (use '.' if empty) | +| **9** | **attribute** | A semicolon-separated list of tag-value pairs, providing additional information about each feature | Yes | + +### WIG (.wig) +The WIG (wiggle) format is used for displaying continuous-valued data in track format. It is particularly useful for showing expression data, probability scores, and GC percentage. + +WIG files be formatted in two main ways: +#### 1. Fixed Step +``` +fixedStep chrom=chrN start=pos step=stepInterval [span=windowSize] +dataValue +dataValue +dataValue +``` +##### Required Fields +| Field | Description | +| --- | --- | +| **chrom** | Chromosome name (e.g., chr1) | +| **start** | Starting position | +| **step** | Distance between starts of adjacent windows | +| **dataValue** | Numerical data value for each position | + +##### Optional Fields +| Field | Description | +| --- | --- | +| **span** | Size of window (defaults to step size) | + +#### 2. Variable Step +``` +variableStep chrom=chrN [span=windowSize] +chromStart dataValue +chromStart dataValue +chromStart dataValue +``` +##### Required Fields +| Field | Description | +| --- | --- | +| **chrom** | Chromosome name | +| **chromStart** | Start position of each window | +| **dataValue** | Numerical data value for each position | + +##### Optional Fields +| Field | Description | +| --- | --- | +| **span** | Size of window (defaults to 1) | + +#### Common Use Cases +1. Gene expression levels +2. ChIP-seq signal intensity +3. RNA-seq coverage +4. Conservation scores +5. GC content +6. Probability scores +7. Transcription factor binding signals + +#### Example +``` {code} text +:filename: example.wig +# Fixed-step example showing expression values +fixedStep chrom=chr3 start=400601 step=100 span=100 +11.0 +22.0 +33.0 + +# Variable-step example showing conservation scores +variableStep chrom=chr3 span=150 +500701 5.0 +500801 3.0 +500901 8.0 +``` + ## Phylogenetic Formats + ### Newick (.nwk) -- Format spec: http://evolution.genetics.washington.edu/phylip/newicktree.html -- Extended spec: https://doi.org/10.1093/bioinformatics/btg190 +Standard format for representing phylogenetic trees using nested parentheses. + +*Contains*: +- Tree topology +- Branch lengths +- Node labels +- Bootstrap values + ### NEXUS (.nex) -- Original paper: http://dx.doi.org/10.1093/sysbio/46.4.590 -- Format guide: http://wiki.christophchamp.com/index.php?title=NEXUS_file_format +Rich format for storing multiple types of biological data: +- Character matrices +- Trees +- Distance matrices +- Analysis assumptions +Supports multiple data blocks and commands + ## Index Formats + ### BAI/CSI (.bai, .csi) -- Specs included in SAMtools: https://samtools.github.io/hts-specs/ -- CSI spec: https://samtools.github.io/hts-specs/CSIv1.pdf +- BAI: Index format for BAM files +- CSI: Coordinate-sorted index format + +These index files are typically required by common [CLI tools]() + +*Enables*: +- Random access to compressed files +- Quick retrieval of alignments +- Efficient genome browsing + ## Genome Browser Formats + ### bigWig/bigBed (.bw, .bb) -- UCSC spec: https://genome.ucsc.edu/goldenPath/help/bigWig.html -- Format guide: https://genome.ucsc.edu/FAQ/FAQformat.html +- bigWig: Binary format of `.wig` files +- bigBed: Binary format of `.bed` files + +*Advantages*: +- Efficient random access +- Reduced memory usage +- Fast display in genome browsers + # Pipeline Development ## Snakemake Snakemake is a workflow management system that helps automate data analysis pipelines. The functionality of snakemake cannot be covered here in meaningful detail. Please read through the [snakemake documentation](https://snakemake.readthedocs.io/en/stable/) and play around with example pipelines to gain familiarity with this powerful tool. @@ -104,7 +303,7 @@ Snakemake is a workflow management system that helps automate data analysis pipe ``` project_name/ ├── README.md # Project overview, setup instructions, and usage -├── LICENSE # Project license (e.g., MIT, GPL) +├── LICENSE.md # Project license (e.g., MIT, GPL) ├── .gitignore # Git ignore rules ├── environment.yml # Conda environment specification ├── requirements.txt # Python package dependencies diff --git a/pyproject.toml b/pyproject.toml index bb01a82..9b4dcd1 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -4,6 +4,7 @@ version = "0.1.1" description = "" authors = ["Christian S. Ramirez"] readme = "README.md" +package-mode = false [tool.poetry.dependencies] python = "^3.13"