Releases · CGATOxford/UMI-tools

Adds options to specify a delimiter for a cell barcode or UMI which should be concatenated + options to specify a string splitting the cell barcode or UMI into multiple parts, of which only the first will be used. Note, this options will only work if the barcodes are contained in the BAM tag - if they were appended to the read name using umi_tools extract there is no need for these options. See #217 for motivation:
- --umi-tag-delimiter=[STRING] = remove the delimeter STRING from the UMI. Defaults to None
- --umi-tag-split=[STRING] = split UMI by STRING and take only the first portion. Defaults to None
- --cell-tag-delimiter=[STRING] = remove the delimeter STRING from the cell barcode. Defaults to None
- --cell-tag-split=[STRING] = split cell barcode by STRING and take only the first portion. Defaults to - to deal with 10X GEMs
Reduced memory requirements for count --wide-format-cell-counts: #222
Debugs issues with --bc-pattern2: #201, #221
Updates documentation: #204, #210, #211 - Thanks @kohlkopf, @hy09 & @cbrueffer

Assets 4

16 Oct 12:45

TomSmithCGAT

0.5.1

616b9d0

0.5.1

Minor update. Improves detection of duplicate reads with paired end reads, reduces run time with dedup --output-stats and a few simple debugs.

Improved identification of duplicate reads from paired end reads - will now use the position of the FIRST splice junction in the read (in reference coords) (#187)
Speeds up dedup when running with --output-stats - (#184)
Fixes bugs:
- whitelist --set-cell-number --plot-prefix -> unwanted error
- dedup gave non-informative error when input contains zero valid reads/read pairs. Now raises a warning but exits with status 0 (#190, #195)
- count errored if gene identifier contained a ":" (#198)
Renames --whole-contig option to --buffer-whole-contig to avoid confusion with per-contig option. --whole-contig option will still work but will not be visible in documentation (#196)

Assets 4

18 Aug 15:38

TomSmithCGAT

0.5.0

e15fe8f

0.5.0

Version 0.5.0 introduces new commands to support single-cell RNA-Seq and reduces run-time. The underlying methods have not changed hence the minor release number uptick.

UMI-tools goes single cell

New commands for single cell RNA-Seq (scRNA-Seq):

whitelist - Extract cell barcodes (CB) from droplet-based scRNA-Seq fastqs and estimate the number of "true"
CBs. Outputs a flatfile listing the true cell barcodes and 'error' barcodes within a set distance. See #97 for a motivating example. Thanks to @Hoohm for input and patience in testing. Thanks to @k3yavi for input in discussions about implementing a 'knee' method.
count - Count the number of reads per cell per gene after de-duplication. This tool uses the same underlying methods as group and dedup and acts to simplify scRNA-Seq read-counting with umi_tools. See #114, #131
count_tab - As per count but works from a flatfile input from e.g featureCounts - See #44, #121, #125

In the process of creating these commands, the options for dealing with UMIs on a "per-gene" basis have been re-jigged to make their purpose clearer. See e.g #127 for a motvating example.

To perform group, dedup or count on a per-gene, basis, the --per-gene option should be provided. This must be combined with either --gene-tag if the BAM contains gene assignments in a tag, or --per-contig if the reads have been aligned to a transcriptome. In the later case, if the reads have been aligned to a transcriptome where each contig is a transcript, the option --gene-transcript-map can be used to operate at the gene level. These options are standardised across all tools such that one can easily change e.g a count command into a dedup command.

Updated options:

extract - Can now accept regex patterns to describe UMI +/- CB encoding in read(s). See --extract-method=regex option.

We have written a guide for how to use UMI-tools for scRNA-Seq analysis including estimation of the number of true CBs, flexible extraction of cell barcodes and UMIs and per-cell read-counting as well as common workflow variations.

Reduced run-time (#156)

Introduced a hashing step to limit the scope of the edit-distance comparisons required to build the networks. Big thanks to @mparker2 for this!

Simplified installation ( #145 )

Previously extensions were cythonized and compiled on the fly using 'pyximport, requiring users to have access to the install directory the first time the extension was required. Now the cythonized extension is provided, and is compiled at install-time.

Assets 4

08 May 09:10

TomSmithCGAT

0.4.4

9de0290

0.4.4

Tweaks the way group handles paired end BAMs. To simplify the process and ensure all reads are written out, the paired end read (read 2) is now outputted without a group or UMI tag. (#115).
Introduces the --skip-tags-regex option to enable users to skip descriptive gene tags, such as "Unassigned" when using the --gene-tag option. See #108.
Bugfixes:
- If the --transcript-gene-map included transcripts not observed in the BAM, this caused an error when trying to retrieve reads aligned to the transcript. This has been resolved. See #109
- Allow output to zipped file with extract using python 3 #104
Improved test coverage (--chrom and --gene-tag options). Thanks @MarinusVL for kindly sharing a BAM with gene tags.

Assets 4

28 Mar 09:40

TomSmithCGAT

0.4.3

8997cb2

0.4.3

Improves run time for large networks (see #94, #31).

Thanks to @gpratt for identifying the issue and implementing the solution

Assets 4

22 Mar 13:35

TomSmithCGAT

0.4.2

949b7c1

0.4.2

When using the directional method with the group command, the 'top' UMI within each group was not always the most abundant (see comments in #96). This has now been resolved

Assets 4

16 Mar 08:25

TomSmithCGAT

0.4.1

5772885

0.4.1

Due to a bug in pysam.fetch() paired end files with a large number of contigs could take a long time to process (see #93). This has now been resolved.

Thanks to @gpratt for spotting and resolving this.

Assets 4

09 Mar 16:34

TomSmithCGAT

0.4.0

7805eaf

0.4.0

Added functionality:

Deduplicating on gene ids ( #44 for motivation):
The user can now group/dedup according to the gene which the read aligns to. This is useful for single cell RNA-Seq methods such as e.g CEL-Seq where the position of the read on a transcript may be different for reads generated from the same initial molecule. The following options may be used define the gene_id for each read:
--per-gene
--gene-transcript-map
--gene-tag
Working with BAM tags (#73, #76, #89):
UMIs can now be extracted from the BAM tags and_group_ will add a tag to each read describing the read group and UMI. See following options for controlling this behaviour:
--extract-umi-method
--umi-tag
--umi-group-tag
Ouput unmapped reads (#78)
The group command will now output unmapped reads if the --output-unmapped is supplied. These reads will not be assigned to any group.

+ bug fixes for group command (#67, #81) and updated documentation (#77, #79 )

Assets 4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UMI-tools goes single cell

Reduced run-time (#156)

Simplified installation ( #145 )

Releases: CGATOxford/UMI-tools

0.5.4

0.5.3

0.5.2

0.5.1

0.5.0

UMI-tools goes single cell

Reduced run-time (#156)

Simplified installation ( #145 )

0.4.4

0.4.3

0.4.2

0.4.1

0.4.0