Skip to content

Commit

Permalink
Merge pull request #97 from BUStools/devel
Browse files Browse the repository at this point in the history
Devel
  • Loading branch information
Yenaled authored Nov 1, 2023
2 parents 04570a3 + c5e5254 commit a7b3e81
Show file tree
Hide file tree
Showing 12 changed files with 516 additions and 309 deletions.
140 changes: 86 additions & 54 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@ RNA-Seq datasets. It can be used to error correct barcodes, collapse UMIs, produ

If you use __bustools__ please cite

Melsted, Páll, Booeshaghi, A. Sina et al. [Modular and efficient pre-processing of single-cell RNA-seq.](https://www.biorxiv.org/content/10.1101/673285v2) BioRxiv (2019): 673285, doi.org/10.1101/673285.
Melsted, Páll, Booeshaghi, A. Sina et al. [Modular, efficient and constant-memory single-cell RNA-seq preprocessing.](https://doi.org/10.1038/s41587-021-00870-2) Nature Biotechnology, 2021.

For some background on the design and motivation for the __BUS__ format and __bustools__ see

Melsted, Páll, Ntranos, Vasilis and Pachter, Lior [The Barcode, UMI, Set format and BUStools](https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz279/5487510), Bioinformatics, btz279, 2019.
Melsted, Páll, Ntranos, Vasilis and Pachter, Lior [The Barcode, UMI, Set format and BUStools](https://doi.org/10.1093/bioinformatics/btz279), Bioinformatics, 2019.


## BUS format
Expand Down Expand Up @@ -56,46 +56,57 @@ Usage: bustools <CMD> [arguments] ..
Where <CMD> can be one of:
capture Capture records from a BUS file
correct Error correct a BUS file
count Generate count matrices from a BUS file
inspect Produce a report summarizing a BUS file
linker Remove section of barcodes in BUS files
project Project a BUS file to gene sets
sort Sort a BUS file by barcodes and UMIs
correct Error correct a BUS file
count Generate count matrices from a sorted BUS file
inspect Produce a report summarizing a sorted BUS file
allowlist Generate an on-list from a sorted BUS file
capture Capture records from a BUS file
text Convert a binary BUS file to a tab-delimited text file
whitelist Generate a whitelist from a BUS file
fromtext Convert tab-delimited text file to a binary BUS file
extract Extracts reads from input FASTQ files based on BUS file
umicorrect Use a UMI correction algorithm
compress Compress a sorted BUS file
decompress Decompress a compressed BUS file
Running bustools <CMD> without arguments prints usage information for <CMD>
~~~

### capture
`bustools capture` can separate BUS files into multiple files according to the capture criteria.

### sort

Raw BUS output from pseudoalignment programs may be unsorted. To simply and accelerate downstream processing BUS files can be sorted using `bustools sort`

~~~
Usage: bustools capture [options] bus-files
> bustools sort
Usage: bustools sort [options] bus-files
Options:
-o, --output Directory for output
-c, --capture List of transcripts to capture
-e, --ecmap File for mapping equivalence classes to transcripts
-t, --txnames File with names of transcripts
-t, --threads Number of threads to use
-m, --memory Maximum memory used
-T, --temp Location and prefix for temporary files
required if using -p, otherwise defaults to output
-o, --output File for sorted output
-p, --pipe Write to standard output
~~~

This will create a new BUS file where the BUS records are sorted by barcode first, UMI second, and equivalence class third.


### correct
BUS files can be barcode error corrected with respect to a technology-specific whitelist of barcodes using `bustools correct`.
BUS files can be barcode error corrected with respect to a technology-specific "on list" of barcodes using `bustools correct`.

~~~
> bustools correct
Usage: bustools correct [options] bus-files
Options:
-o, --output File for corrected bus output
-w, --whitelist File of whitelisted barcodes to correct to
-w, --onlist File of on-listed barcodes to correct to
-p, --pipe Write to standard output
~~~


### count
BUS files can be converted into a barcode-feature matrix, where the feature can be TCCs (Transcript Compatibility Counts) or genes using `bustools count`.

Expand All @@ -109,8 +120,10 @@ Options:
-e, --ecmap File for mapping equivalence classes to transcripts
-t, --txnames File with names of transcripts
--genecounts Aggregate counts to genes only
--cm Count multiplicites instead of UMIs
~~~


### inspect
A report summarizing the contents of a sorted BUS file can be output either to standard out or to a JSON file for further analysis using `bustools inspect`.

Expand All @@ -121,11 +134,11 @@ Usage: bustools inspect [options] sorted-bus-file
Options:
-o, --output File for JSON output (optional)
-e, --ecmap File for mapping equivalence classes to transcripts
-w, --whitelist File of whitelisted barcodes to correct to
-w, --onlist File of on-listed barcodes to correct to
-p, --pipe Write to standard output
~~~

`--ecmap` and `--whitelist` are optional parameters; `bustools inspect` is much faster without them, especially without the former.
`--ecmap` and `--onlist` are optional parameters; `bustools inspect` is much faster without them, especially without the former.

Sample output (to stdout):
~~~
Expand All @@ -151,60 +164,55 @@ Number of reads with singleton target: 1233940
Estimated number of new targets at 2x seuqencing depth: 6168
Number of barcodes in agreement with whitelist: 92889 (57.211752%)
Number of reads with barcode in agreement with whitelist: 3281671 (95.623992%)
Number of barcodes in agreement with on-list: 92889 (57.211752%)
Number of reads with barcode in agreement with on-list: 3281671 (95.623992%)
~~~

### linker
`bustools linker` removes specified section of barcode in BUS files.
### allowlist
`bustools allowlist` generates an on-list based on the barcodes in a sorted BUS file.

~~~
Usage: bustools linker [options] bus-files
Usage: bustools allowlist [options] sorted-bus-file
Options:
-s, --start Start coordinate for section of barcode to remove (0-indexed, inclusive)
-e, --end End coordinate for section of barcode to remove (0-indexed, exclusive)
-p, --pipe Write to standard output
-o, --output File for the on-list
-f, --threshold Minimum number of times a barcode must appear to be included in on-list
~~~

If `--start` is -1, the removed section begins at beginning of barcode. Likewise, if `--end` is -1, the removed section ends at the end of the barcode. BUS files should contain barcodes of the same length.
`--threshold` is a (highly) optional parameter. If not provided, `bustools allowlist` will determine a threshold based on the first 200 to 100,200 records.


### project
The `kallisto bus` command maps reads to a set of transcripts. `bustools project` takes as input kallisto's (sorted) output and a transcript to gene map (tr2g file), and outputs a BUS file, a matrix.ec file, and a list of genes, which collectively map each read to a set of genes.
### capture
`bustools capture` can separate BUS files into multiple files according to the capture criteria.

~~~
Usage: bustools project [options] sorted-bus-file
Usage: bustools capture [options] bus-files
Options:
-o, --output File for project bug output and list of genes (no extension)
-g, --genemap File for mapping transcripts to genes
-o, --output Directory for output
-c, --capture List of transcripts to capture
-e, --ecmap File for mapping equivalence classes to transcripts
-t, --txnames File with names of transcripts
-p, --pipe Write to standard output
~~~

### sort

Raw BUS output from pseudoalignment programs may be unsorted. To simply and accelerate downstream processing BUS files can be sorted using `bustools sort`

### text

BUS files can be converted to a tab-separated format for easy inspection and processing using shell scripts or high level languages with `bustools text`.

~~~
> bustools sort
Usage: bustools sort [options] bus-files
> bustools text
Usage: bustools text [options] bus-files
Options:
-t, --threads Number of threads to use
-m, --memory Maximum memory used
-T, --temp Location and prefix for temporary files
required if using -p, otherwise defaults to output
-o, --output File for sorted output
-p, --pipe Write to standard output
-o, --output File for text output
~~~

This will create a new BUS file where the BUS records are sorted by barcode first, UMI second, and equivalence class third.

### text
### fromtext

BUS files can be converted to a tab-separated format for easy inspection and processing using shell scripts or high level languages with `bustools text`.
Converts a plaintext tab-separated representation of a BUS file to a binary BUS file.

~~~
> bustools text
Expand All @@ -214,15 +222,39 @@ Options:
-o, --output File for text output
~~~

### whitelist
`bustools whitelist` generates a whitelist based on the barcodes in a sorted BUS file.

### compress

Compresses a sorted BUS file.

~~~
Usage: bustools whitelist [options] sorted-bus-file
> bustools compress
Usage: bustools compress [options] sorted-bus-file
Note: BUS file should be sorted by barcode-umi-ec
Options:
-o, --output File for the whitelist
-f, --threshold Minimum number of times a barcode must appear to be included in whitelist
-N, --chunk-size Number of rows to compress as a single block
-o, --output File to write compressed output
-p, --pipe Write to standard output.
~~~

`--threshold` is a (highly) optional parameter. If not provided, `bustools whitelist` will determine a threshold based on the first 200 to 100,200 records.
Reference for bustools compression:

Einarsson, P and Melsted, Páll [BUSZ: compressed BUS files](https://doi.org/10.1093/bioinformatics/btad295), Bioinformatics, 2023.


### decompress

Decompresses (inflates) a compressed BUS file.

~~~
> bustools compress
Usage: bustools compress [options] sorted-bus-file
Note: BUS file should be sorted by barcode-umi-ec
Options:
-o, --output File to write decompressed output
-p, --pipe Write to standard output.
~~~


12 changes: 11 additions & 1 deletion src/BUSData.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
#include <unordered_map>
#include <sstream>
#include <iostream>
#include <limits>

uint64_t stringToBinary(const std::string &s, uint32_t &flag) {
return stringToBinary(s.c_str(), s.size(), flag);
Expand Down Expand Up @@ -312,7 +313,16 @@ bool parseBcUmiCaptureList(const std::string &filename, std::unordered_set<uint6
std::string inp;
uint32_t flag; // Unused
while (getline(inf, inp)) {
captures.insert(stringToBinary(inp, flag));
uint64_t binary_val;
if (inp.back() == '*' && inp.length() == 17) { // 16-bp barcode that ends in *
inp.pop_back(); // Remove * from end of string
binary_val = stringToBinary(inp, flag);
binary_val |= (static_cast<uint64_t>(1) << 63); // Set MSB to 1 if * found at end of string (to label the barcode as "prefix")
captures.insert(std::numeric_limits<uint64_t>::max()); // Tells us to look for prefix barcodes
} else {
binary_val = stringToBinary(inp, flag);
}
captures.insert(binary_val);
}

return true;
Expand Down
4 changes: 3 additions & 1 deletion src/Common.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
#include "roaring.hh"
#include "hash.hpp"

#define BUSTOOLS_VERSION "0.43.0"
#define BUSTOOLS_VERSION "0.43.1"

#define u_map_ std::unordered_map
enum CAPTURE_TYPE : char
Expand Down Expand Up @@ -80,12 +80,14 @@ struct Bustools_opt
bool count_gen_hist = false;
double count_downsampling_factor = 1.0;
bool count_raw_counts = false;
bool sort_noflag = false;

/* correct */
std::string dump;
bool dump_bool = false;
bool split_correct = false;
bool barcode_replacement = false;
bool parse_error = false;

/* predict */
std::string predict_input; //specified the same way as the output for count - count and histogram filenames will be created from this
Expand Down
18 changes: 16 additions & 2 deletions src/bustools_capture.cpp
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
#include <iostream>
#include <fstream>
#include <algorithm>
#include <limits>

#include "Common.hpp"
#include "BUSData.h"
Expand All @@ -13,6 +14,7 @@ void bustools_capture(Bustools_opt &opt) {
std::unordered_set<uint64_t> captures;
std::vector<std::vector<int32_t>> ecmap;
u_map_<std::vector<int32_t>, int32_t, SortedVectorHasher> ecmapinv;
bool capture_prefixes = false;

if (opt.type == CAPTURE_TX) {
// parse ecmap and capture list
Expand All @@ -35,6 +37,9 @@ void bustools_capture(Bustools_opt &opt) {
std::cerr << "done" << std::endl;
} else if (opt.type == CAPTURE_UMI || opt.type == CAPTURE_BC) {
parseBcUmiCaptureList(opt.capture, captures);
if (opt.type == CAPTURE_BC && captures.count(std::numeric_limits<uint64_t>::max()) > 0) {
capture_prefixes = true;
}
} else if (opt.type == CAPTURE_F) {
parseFlagsCaptureList(opt.capture, captures);
} else { // Should never happen
Expand All @@ -61,7 +66,6 @@ void bustools_capture(Bustools_opt &opt) {

size_t nr = 0, nw = 0;
size_t N = 100000;
uint32_t bclen = 0;
BUSData* p = new BUSData[N];
BUSData bd;

Expand All @@ -77,6 +81,8 @@ void bustools_capture(Bustools_opt &opt) {
}
std::istream in(inbuf);
parseHeader(in, h);
if (h.bclen == 32) capture_prefixes = false;
uint64_t len_mask = ((1ULL << (2*h.bclen)) - 1);

if (!outheader_written) {
writeHeader(o, h);
Expand Down Expand Up @@ -107,7 +113,15 @@ void bustools_capture(Bustools_opt &opt) {
}

} else if (opt.type == CAPTURE_BC) {
capt = captures.count(bd.barcode) > 0;
if (capture_prefixes) {
uint64_t bitmask = (1ULL << (2*(32-h.bclen))) - 1;
uint64_t potential_prefix_barcode = (bd.barcode >> (2*h.bclen)) & bitmask;
// Now we have the sequence preceding the actual cell barcode, let's find it in the captures set
capt = captures.count((static_cast<uint64_t>(1) << 63) | potential_prefix_barcode) > 0; // If prefix exists in the "capture prefix" set
if (!capt) capt = captures.count(bd.barcode & len_mask) > 0; // If not, then check the barcode as-is
} else {
capt = captures.count(bd.barcode & len_mask) > 0;
}
} else if (opt.type == CAPTURE_UMI) {
capt = captures.count(bd.UMI) > 0;
} else if (opt.type == CAPTURE_F) {
Expand Down
8 changes: 4 additions & 4 deletions src/bustools_correct.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,7 @@ void bustools_split_correct(Bustools_opt &opt)
}
wf.close();

std::cerr << "Found " << wbc_12.size() << "," << wbc_34.size() << " barcodes in the half whitelists" << std::endl;
std::cerr << "Found " << wbc_12.size() << "," << wbc_34.size() << " barcodes in the half on-lists" << std::endl;

len_1 = len_12 / 2; // = 3, 3, 4
len_2 = len_12 - len_1; // =4, 4, 4
Expand Down Expand Up @@ -247,8 +247,8 @@ void bustools_split_correct(Bustools_opt &opt)

if (bclen != len_12 + len_34)
{
std::cerr << "Error: barcode length and whitelist length differ, barcodes = " << bclen << ", whitelist = " << len_12 + len_34 << std::endl
<< " check that your whitelist matches the technology used" << std::endl;
std::cerr << "Error: barcode length and on-list length differ, barcodes = " << bclen << ", on-list = " << len_12 + len_34 << std::endl
<< " check that your on-list matches the technology used" << std::endl;

exit(1);
}
Expand Down Expand Up @@ -399,7 +399,7 @@ void bustools_split_correct(Bustools_opt &opt)
}

std::cerr << "Processed " << nr << " BUS records" << std::endl
<< "In whitelist = " << stat_white << std::endl
<< "In on-list = " << stat_white << std::endl
<< "Corrected 1 = " << stat_corr << std::endl
<< "Corrected 2 = " << stat_corr_2 << std::endl
<< "Uncorrected = " << stat_uncorr << std::endl;
Expand Down
Loading

0 comments on commit a7b3e81

Please sign in to comment.