Sequence Alignment

Implementation of the classic Dynamic Programming problem using the Needleman–Wunsch algorithm which requires quadratic space & time complexity.

Problem Statement

Given 2 sequences, find the minimum cost of aligning the 2 sequences (case insensitive).
Gaps can be inserted to 1 sequence or the other, but incur a penalty.

2 = Gap Penalty (δ)
If 2 characters are aligned with each other, there may be a mismatch penalty (α_{i j})

0 = aligning identical letters
1 = aligning a vowel with a vowel, or a consonant with a consonant
3 = aligning a vowel with a consonant

Minimum cost = sum of mismatch & gap penalties (the optimal alignment)

Optimal Substructure

There may be multiple optimal paths, but this only finds 1 of them

Runtime

O(M*N)
Storage: also O(M*N)

Pseudocode

Usage

Requirements & Caveats

Alphanumeric characters only (all others including spaces are sanitized out)
Case insensitive
_ (Underscore character) represents gaps but only when displaying results (it will be removed & ignored if it's present in a sequence)
View results in a fixed-width font for the 2 sequences to be lined up
predecessorIndexes is calculated when creating memoTable but not used to find the actual alignment, only to show where the values in memoTable came from
- For example: if predecessorIndexes[4][4] contains the array [4, 3], it means the value of memoTable[4][4] (which is 6) came from memoTable[4][3] (i.e. case3, a character in seq2 was aligned with a gap so it came from the left)
- predecessorIndexes[0][0] contains the array [-1, -1] because the upper left corner has no predecessor

Setup

Provide 2 strings (edit testSequences 2D array) & run calculateAndPrintOptimalAlignment()
Optionally change the penalties by passing in arguments to the non-default constructor
Currently vowelVowelMismatchPenalty, consonantConsonantMismatchPenalty & numberNumberMismatchPenalty are the same, but all are arbitrary

Example: aligning "mean" with "name"

Code Notes

seq1 & seq2 have a leading space added (after sanitizing illegal & whitespace characters)
This is so that seq1.charAt(i) & seq2.charAt(j) work in the loops & the string indexes match up
It causes some adjustments for the array sizes.
memoTable = new int[seq1.length()][seq2.length()] uses the exact value of length() which includes the space
Loops like for(int i=0; i < seq1.length(); i++) use i < seq1.length() to not go out of bounds of the string indexes
In findAlignment(), int i = seq1.length() - 1; & int j = seq2.length() - 1; since the 2 sequences have leading spaces & arrays indexes start from 0
findAlignment() retraces each calculation in the memoTable to see where it came from

References

String Alignment using Dynamic Programming - Gina M. Cannarozzi to retrace the memo table & find the alignment

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
images		images
.classpath		.classpath
.gitignore		.gitignore
.project		.project
README.md		README.md
SequenceAlignment.java		SequenceAlignment.java

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sequence Alignment

Problem Statement

Optimal Substructure

Runtime

Pseudocode

Usage

Requirements & Caveats

Setup

Example: aligning "mean" with "name"

Code Notes

References

About

Releases

Packages

Contributors 2

Languages

SleekPanther/sequence-alignment

Folders and files

Latest commit

History

Repository files navigation

Sequence Alignment

Problem Statement

Optimal Substructure

Runtime

Pseudocode

Usage

Requirements & Caveats

Setup

Example: aligning "mean" with "name"

Code Notes

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages