All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Ill advised parsing of RG ID field has been extended to additionally allow for protocol_run_id style (uuid) Run IDs, as well as standalone acquisition_id (sha1) Run IDs
- Parsing of RG ID field containing a modified base model, now returns only the core basecaller model.
- Workaround samtools "bug" where RG ID suffix is not fixed width.
- Segfault on SAM-style tags without values in the FASTQ header.
- Bug causing segfault on unlikely RG SAM tags in FASTQ header comments.
- SAM parsing of FASTQ header not enabled if only either of RG or RD tag is present and at the beginning of the header comment.
- 'run_id' instead of 'basecaller' as column name in bamstats basecaller summary output header line.
- 'run_id' instead of 'basecaller' as column name in basecaller summary output header line.
(null)
in FASTQ header comments when run with-H
on files that hadbasecall_model_version_id=...
as only header comment.
- Basecaller summary information similar to runid summary.
- RNA poly-A tail length histogram output.
- Random output for runid when not found in header.
--runids
option tobamstats
for enumerating detected run identifiers.
--reads_per_file
option can split inputs into batched files when demultiplexing. Users should use Unixsplit
with piped output.--runids
option to output a file enumerating detected run identifiers.
- Per-file read statistics now relate to filtered reads only.
- Link
fastcat
against zlib-ng for an even faster cat.
fastcat
reverts to using a space separator (introduced in v0.16.0) between the Read ID and comment when outputting FASTQ comments that are not SAM tags
- Modification of BAM record with strtok when inferring Run ID from RG aux tag causing missing NM tag
- Additional spurious "contains non-integer 'NM' tag type" errors by checking EINVAL only when NM appears to be zero, and clearing errno first
- Spurious "contains non-integer 'NM' tag type" errors by checking EINVAL only when NM appears to be zero
bamstats
now saves histograms for unmapped reads when--unmapped
is provided.
- Incorrect sanity check of NM.
- Prevent reads with implausible NM tag leading to illegal memory access in add_qual_count
- Extended FASTQ SAM tag parsing to comment lines that include the RD tag (as well as RG).
- Support for reading SAM tags from FASTQ headers.
fastcat
will output a tab between the Read ID and the SAM tags rather than a space to match samtools convention.bamstats
usesbam_get_tag_caseinsensitive
wrapper to get SAM tags with case insensitivity.fastcat
andbamstats
will infer a Run ID from theRG
tag ifRD
is not available.- Bumped version of htslib used to 1.19.
- Incorrectly capitalised ONT SAM tags are now output in lowercase by fastcat:
ch
,rn
,st
.
- Duplicated recipe name in Makefile.
- Section explaining
bamstats
output columns to README.
- Decimal precision of hisotgram outputs.
- Calculation of read length and quality histograms to
fastcat
andbamstats
. - Calculation of alignment accuracy and alignment read coverage to
bamstats
.
- Missing compilation of conda aarch64 package
bamstats --duplex
option allows to count the number of duplex reads and duplex-forming reads.
- Bug writing long reads to demultiplexed gzipped outputs.
- Bug writing
UINTMAX_MAX
formin_length
andnan
formean_quality
of a file in fastcat per-file stats if there were no reads in that file.
- Column with start time from MinKNOW header to
bamstats
output.
bamstats
now printsmean_quality
,iden
, andacc
values with 2 decimal places instead of 3 (the reason being thatfastcat
already uses 2 decimal places formean_quality
and more precision is unnecessary).
- Column with run ID from MinKNOW header to
fastcat
per-read stats andbamstats
output.
- Reverted the change of the default value of the
start_time
field to an empty string (it had been set to"2000-01-01T00:00:00Z"
in v0.11.1).
- Bug in
fastcat
per-read summary stats.
- Bamstats can now be run without a BAM index.
fastcat -H
now wraps all known header fields into SAM tags regardless of whether the header was "valid" (i.e. all expected fields were present) or not.
- Linux and macOS ARM conda packages.
- bamindex program missing from conda package.
- Create bamindex program to index unaligned BAMs for horizontal-parallel processing.
- Ensure reheadered fastq is indeed formatted as a valid SAM tag(s).
- Option to bamstats to add 'sample_name' column equivalent to fastcat.
- Option to report unmapped alignments in per read and summary files.
- Min read length in per-file statistics.
mean_quality
column to bamstats output, equivalent to that from fastcat.- optional per-reference summary file for bamstats similar to samtools flagstats.
- Behaviour of
-x/--recurse
. Top-level directory input will always be searched for data. Turning on recursion now exclusively refers to descending into child (and subsequent) directories.
- Updated kseq.h to allow exit on broken fastq/a stream.
fastcat
will exit non-zero if an input file (named or recursed) cannot be opened
- Use of uninitialized memory in thread pool init, leading to memory leak.
- Handle BAM_CEQUAL and BAM_CDIFF that some aligners like to use.
- Doubled tab in output header.
- Build conda package using bioconda's htslib.
- Occasional hanging on exit.
- Missing tab character in output header.
- Pin openssl version in conda build to that which work across Python versions.
- Removed libdeflate from conda build which caused issues with threading.
- Only multithread BAM decompression.
- Multithreading to
bamstats
for improved throughput.
- Improved performance of
bamstats
for many-target bams.
bamstats
program for summarising (primary) alignment information.
- Refomatted header tags were space separated, fixed to tab separated.
- Option to reformat fastq headers as SAM-style tags for minimap2 passthrough.
- Per-file summary file created with broken header.
- Per-read summary file created incorrectly when
-s
option provided.
- Program hang when directory input given without trailing
/
.
- Transpose read number, channel, and start time from fastq headers to summary.
- Additional columns in per-read summary file as above. These will be present, regardless of whether header information is present or not.
- Changed erroneously small MAX_BARCODE define; added runtime check to avoid invalid memory access.
- Updated CI release scripts.
- Parsing Guppy/MinKNOW fastq key=value header comments.
- Ability to demultiplex inputs based on "barcode" key in headers.
- Per-read and per-file summary files now optional.
- Read length and read quality output filtering.
- Average qualities computed with Kahan summation.
- Program hang when input file was non-existent or a directory.
- Ability to traverse a directory input.
- Ability to read input files from stdin.
- Moved output files to optional arguments.
-s
option to add in asample_name
column to outputs.
- No end-user changes.
- Per-read and per-file summarising of fastq data.