A biological sequence file parser for Ruby
Austin G. Davis-Richardson
Features
- Supported Formats (submit a format request)
- Automatic format detection
- Lazy iteration
Tested on Ruby 1.9.3 and 2.0.0
$ (sudo) gem install dna
require 'dna'
# Automatic Format Detection
File.open('sequences.fasta') do |handle|
records = Dna.new handle
records.each do |record|
puts record.length
end
end
File.open('sequences.fastq') do |handle|
records = Dna.new handle
records.each do |record|
puts record.quality
end
end
File.open('sequences.qseq') do |handle|
records = Dna.new handle
puts records.first.inspect
end
# **caveat:** If you are reading from a compressed file
# or `stdin` you MUST specify the sequence format:
require 'zlib'
Zlib::GzipReader('sequences.fasta.gz') do |handle|
records = Dna.new handle, :format => :fasta
records.each do |record|
puts record.length
end
end
# Illumina > 1.3)
record.illumina_qualities # => [31, ..., 37]
# Error probabilities
record.illumina_probabilities
# => [1.0, 0.7943282347242815, ..., 0.3981071705534972]
# Solexa + Illumina =< 1.3
record.solexa_qualities
record.solexa_probabilities
# Sanger
record.sanger_qualities
record.sanger_probabilities
The DNA gem is also a command-line tool with grep-like capabilities. Print records with (Ruby) regexp match in header.
$ dna spec/data/input.fastq "[1-2]"
@1
TGAAACTTATTGATCACCCCGCTTGGCGTTGGGGAGAAATTCAGAAAAGAGTGCTTGATGGGGCGCCACATGCCGTGCAACCCACTCTCTTTCACGCAGCGCGCCCCA
+1
5888.6778888650/-//&,(,./*-11'//0&,-0.(.,,,,/2/&-,,,,,.(.,(,..&---&-,,,((*-----*+.&,,,,,(//&,,,-(,,+(,,,--&(
@2
GTCGCGGCTTACCACCCAACGATTTTTTTTAGAGGTGCTGGTTTCA
+2
2550//*-1./4.--/'+.2.,,,,,,,,&(/00.11426554+13
$ dna spec/data/test.fasta "\d"
>1
GAGAGATCTCATGACACAGCCGAAG
>2
GAGACAUAUCCNNNAA