Skip to content

Commit

Permalink
Merge pull request #11 from KirillKryukov/develop
Browse files Browse the repository at this point in the history
Version 1.3.0
  • Loading branch information
KirillKryukov authored May 17, 2021
2 parents 372f161 + 1198b8a commit 042a210
Show file tree
Hide file tree
Showing 57 changed files with 269 additions and 92 deletions.
8 changes: 8 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,13 @@
# NAF Changelog

## Current

## 1.3.0 - 2021-05-17
- Added `--long` option to _ennaf_ for setting sequence window size.
- Added `--binary` shortcut option to _unnaf_.
- Added support for empty sequences.
- Updated zstd to v1.5.0.

## 1.2.0 - 2020-09-01
- Added `--sequences` option to _unnaf_.
- Added `--binary-stdout` option to _unnaf_.
Expand Down
22 changes: 18 additions & 4 deletions Compress.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@

`ennaf file.fq -o file.naf` - Compress a FASTQ file (format is detected automatically).

`ennaf -22 file.fa -o file.naf` - Use maximum compression level.
`ennaf -22 --long 31 file.fa -o file.naf` - Use maximum compression level.

`gzip -dc file.gz | ennaf -o file.naf` - Recompress from gzip to NAF on the fly.

Expand All @@ -27,6 +27,13 @@ Maximum level is 22, however take care as levels above 19 are slow and use signi
**--level #** - Use compression level #.
Same with `-#`, but also supports even faster negative levels, down to -131072.

**--long N** - Use window of size 2^N for sequence stream.
The range is currently from 10 to 31.
If not specified, the default window size depends on compression level.
`--long 31` can improve compression of large repetitive data.
Using large window increases memory consumption of both compression and decompression,
so please be careful with this option if you plan to share compressed files with others.

**--temp-dir DIR** - Use DIR for temporary files.
If omitted, uses directory specified in enviroment variable `TMPDIR`.
If there's no such variable, tries enviroment variable `TMP`.
Expand Down Expand Up @@ -110,6 +117,10 @@ while network transfer and decompression may be performed thousands of times by
Optimizing user experience is more important in such cases.
So, `ennaf -22` is the best option for sequence databases.

On some data `ennaf -22 --text` can be better than the default dna mode.
For maximum compression of large datasets you can add `--long 31`,
but use it carefully as it increases memory consumption of both compression and decompression.

## Specifying input format

Input format (FASTA of FASTQ) is automatically detected from the actual input data, so there's not need to specify it.
Expand Down Expand Up @@ -196,8 +207,11 @@ you have to switch to text mode (`--text`).
## Using text mode for DNA data

Since both `--dna` and `--text` modes can be used for DNA data, which is better?
Short answer: `--dna` is faster and has stronger compression.
For details, see [this benchmark page](http://kirill-kryukov.com/study/naf/benchmark-text-vs-dna-Spur.html).
Normally `--dna` should be preferred, as it's much faster than `--text`, and compression strength is similar.
For strongest possible compression, the choice depends on data.
With less repetitive data such as assembled genomes, `--dna` seems to give stronger compression
([example benchmark](http://kirill-kryukov.com/study/naf/benchmark-text-vs-dna-Spur.html)).
With repetitive data, `--text` is often better.

## Can it compress multiple files into single archive?

Expand All @@ -207,7 +221,7 @@ First you combine individual FASTA files into a single Multi-Multi-FASTA stream,
Example commands:

Compressing:<br>
`mumu.pl --dir 'Helicobacter' 'Helicobacter pylori*' | ennaf -22 --text -o Hp.nafnaf`
`mumu.pl --dir 'Helicobacter' 'Helicobacter pylori*' | ennaf -22 --long 31 --text -o Hp.nafnaf`

Decompressing and unpacking:<br>
`unnaf Hp.nafnaf | mumu.pl --unpack --dir 'Helicobacter'`
Expand Down
2 changes: 2 additions & 0 deletions Decompress.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,8 @@ Supported only for DNA and RNA sequences.

**--binary-stdout** - Set stdout stream to binary mode. Useful for piping decompressed sequences to md5sum on Windows.

**--binary** - Shortcut for `--binary-stdout --binary-stderr`.

**-h**, **--help** - Show usage help.

**-V**, **--version** - Show version.
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Copyright (c) 2018-2020 Kirill Kryukov
Copyright (c) 2018-2021 Kirill Kryukov

This software is provided 'as-is', without any express or implied
warranty. In no event will the authors be held liable for any damages
Expand Down
5 changes: 4 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

export prefix = /usr/local

.PHONY: default all test clean install uninstall
.PHONY: default all test test-large clean install uninstall

default:
$(MAKE) -C zstd/lib ZSTD_LEGACY_SUPPORT=0 ZSTD_LIB_DEPRECATED=0 ZSTD_LIB_DICTBUILDER=0 libzstd.a
Expand All @@ -15,6 +15,9 @@ all: default
test:
$(MAKE) -C tests

test-large:
$(MAKE) -C tests large

clean:
$(MAKE) -C ennaf clean
$(MAKE) -C unnaf clean
Expand Down
16 changes: 11 additions & 5 deletions ennaf/src/compressor.c
Original file line number Diff line number Diff line change
@@ -1,21 +1,27 @@
/*
* NAF compressor
* Copyright (c) 2018-2020 Kirill Kryukov
* Copyright (c) 2018-2021 Kirill Kryukov
* See README.md and LICENSE files of this repository
*/


static ZSTD_CStream* create_zstd_cstream(int level)
static ZSTD_CStream* create_zstd_cstream(int level, int window_size_log)
{
ZSTD_CStream *s = ZSTD_createCStream();
if (s == NULL) { die("ZSTD_createCStream() error\n"); }

if (window_size_log != 0)
{
ZSTD_TRY(ZSTD_CCtx_setParameter(s, ZSTD_c_enableLongDistanceMatching, 1));
ZSTD_TRY(ZSTD_CCtx_setParameter(s, ZSTD_c_windowLog, window_size_log));
}

size_t const initResult = ZSTD_initCStream(s, level);
if (ZSTD_isError(initResult)) { die("ZSTD_initCStream() error: %s\n", ZSTD_getErrorName(initResult)); }
return s;
}


static void compressor_init(compressor_t *w, const char *name)
static void compressor_init(compressor_t *w, const char *name, int window_size_log)
{
assert(w != NULL);
assert(w->allocated == 0);
Expand All @@ -35,7 +41,7 @@ static void compressor_init(compressor_t *w, const char *name)

w->allocated = COMPRESSED_BUFFER_SIZE;
w->buf = (unsigned char *) malloc_or_die(w->allocated);
w->cstream = create_zstd_cstream(compression_level);
w->cstream = create_zstd_cstream(compression_level, window_size_log);
w->path = (char *) malloc_or_die(temp_path_length + 1);
snprintf(w->path, temp_path_length, "%s/%s.%s", temp_dir, temp_prefix, name);
if (verbose) { msg("Temp %s file: \"%s\"\n", name, w->path); }
Expand Down
2 changes: 1 addition & 1 deletion ennaf/src/encoders.c
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
/*
* NAF compressor
* Copyright (c) 2018-2020 Kirill Kryukov
* Copyright (c) 2018-2021 Kirill Kryukov
* See README.md and LICENSE files of this repository
*/

Expand Down
2 changes: 1 addition & 1 deletion ennaf/src/encoders.h
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
/*
* NAF compressor
* Copyright (c) 2018-2020 Kirill Kryukov
* Copyright (c) 2018-2021 Kirill Kryukov
* See README.md and LICENSE files of this repository
*/

Expand Down
55 changes: 44 additions & 11 deletions ennaf/src/ennaf.c
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
/*
* NAF compressor
* Copyright (c) 2018-2020 Kirill Kryukov
* Copyright (c) 2018-2021 Kirill Kryukov
* See README.md and LICENSE files of this repository
*/

#define VERSION "1.2.0"
#define DATE "2020-09-01"
#define COPYRIGHT_YEARS "2018-2020"
#define VERSION "1.3.0"
#define DATE "2021-05-17"
#define COPYRIGHT_YEARS "2018-2021"

#include "platform.h"
#include "encoders.h"
Expand Down Expand Up @@ -34,6 +34,7 @@ static bool force_stdout = false;
static bool created_output_file = false;

static int compression_level = 1;
static int sequence_window_size_log = 0;

static char *temp_dir = NULL;
static char *dataset_name = NULL;
Expand Down Expand Up @@ -243,6 +244,35 @@ static void set_line_length(char *str)
}


static void set_sequence_window_size_log(char *str)
{
assert(str != NULL);

char *end;
long long a = strtoll(str, &end, 10);
if (*end != '\0') { die("can't parse the value of --long argument\n"); }

char test_str[21];
int nc = snprintf(test_str, 21, "%lld", a);
if (nc < 1 || nc > 20 || strcmp(test_str, str) != 0) { die("can't parse the value of --long argument\n"); }

if (a < ZSTD_WINDOWLOG_MIN)
{
warn("--long value of is %lld is smaller than the lowest supported value %d, using %d instead\n", a, ZSTD_WINDOWLOG_MIN, ZSTD_WINDOWLOG_MIN);
sequence_window_size_log = ZSTD_WINDOWLOG_MIN;
}
else if (a > ZSTD_WINDOWLOG_MAX)
{
warn("--long value of is %lld is larger than the largest supported value %d, using %d instead\n", a, ZSTD_WINDOWLOG_MAX, ZSTD_WINDOWLOG_MAX);
sequence_window_size_log = ZSTD_WINDOWLOG_MAX;
}
else
{
sequence_window_size_log = (int) a;
}
}


static int parse_input_format(const char *str)
{
assert(str != NULL);
Expand Down Expand Up @@ -306,6 +336,7 @@ static void show_help(void)
" -o FILE - Write compressed output to FILE\n"
" -c - Write to standard output\n"
" -#, --level # - Use compression level # (from %d to %d, default: 1)\n"
" --long N - Use window of size 2^N for sequence stream (from %d to %d)\n"
" --temp-dir DIR - Use DIR as temporary directory\n"
" --name NAME - Use NAME as prefix for temporary files\n"
" --title TITLE - Store TITLE as dataset title\n"
Expand All @@ -322,7 +353,7 @@ static void show_help(void)
" --no-mask - Don't store mask\n"
" -h, --help - Show help\n"
" -V, --version - Show version\n",
min_level, max_level);
min_level, max_level, ZSTD_WINDOWLOG_MIN, ZSTD_WINDOWLOG_MAX);
}


Expand All @@ -343,6 +374,7 @@ static void parse_command_line(int argc, char **argv)
if (!strcmp(argv[i], "--title")) { i++; set_dataset_title(argv[i]); continue; }
if (!strcmp(argv[i], "--level")) { i++; set_compression_level(argv[i]); continue; }
if (!strcmp(argv[i], "--line-length")) { i++; set_line_length(argv[i]); continue; }
if (!strcmp(argv[i], "--long")) { i++; set_sequence_window_size_log(argv[i]); continue; }

// Deprecated, undocumented.
if (!strcmp(argv[i], "--out")) { i++; set_output_file_path(argv[i]); continue; }
Expand Down Expand Up @@ -465,12 +497,13 @@ int main(int argc, char **argv)
}

make_temp_prefix();
compressor_init(&IDS, "ids");
compressor_init(&COMM, "comments");
compressor_init(&LEN, "lengths");
if (store_mask) { compressor_init(&MASK, "mask"); }
compressor_init(&SEQ, "sequence");
if (store_qual) { compressor_init(&QUAL, "quality"); }

compressor_init(&IDS, "ids", 0);
compressor_init(&COMM, "comments", 0);
compressor_init(&LEN, "lengths", 0);
if (store_mask) { compressor_init(&MASK, "mask", 0); }
compressor_init(&SEQ, "sequence", sequence_window_size_log);
if (store_qual) { compressor_init(&QUAL, "quality", 0); }

process();
close_input_file();
Expand Down
2 changes: 1 addition & 1 deletion ennaf/src/files.c
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
/*
* NAF compressor
* Copyright (c) 2018-2020 Kirill Kryukov
* Copyright (c) 2018-2021 Kirill Kryukov
* See README.md and LICENSE files of this repository
*/

Expand Down
2 changes: 1 addition & 1 deletion ennaf/src/platform.h
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
/*
* NAF compressor
* Copyright (c) 2018-2020 Kirill Kryukov
* Copyright (c) 2018-2021 Kirill Kryukov
* See README.md and LICENSE files of this repository
*/

Expand Down
Loading

0 comments on commit 042a210

Please sign in to comment.