BLAST (Basic Local Alignment Search Tool)






The abbreviation for primary. 1° sequence refers to the letters of DNA, RNA, or protein. 1° transcript refers to an unprocessed RNA that still contains its introns.

The abbreviation for secondary. Most frequently used for generalizing protein and RNA structures; for example, the a-helix and hair-pin are common 2° structures.

The end of a nucleic acid sequence; often used with UTR.

The start of a nucleic acid (DNA or RNA) sequence; often used in conjunction with UTR (e.g., 5´UTR). Nucleotide sequences are conventionally written with the 5´ end at the left. DNA molecules are usually double-stranded but when written, usually only the 5´ to 3´ strand is displayed. The complementary strand has reversed polarity (3´ to 5´).



The abbreviation for an amino acid that is often used when describing the length of a protein (e.g., the average protein is about 300 aa long).


A form of a gene. Typically, the most common form is called wild-type, and each allele is given a specific (and often obscure) name.

amino acid

The basic building block for all proteins. There are 20 common amino acids.

Arabidopsis thaliana

Known by its common name, thale cress, this mustard weed is a favorite organism for plant genetics and molecular biology. It was the first plant with a complete genomic sequence. For more information, see


The contraction for binary digit. The base-2 logarithm of a number is in units of bits.


The abbreviation for a blocks substitution matrix. Matrix names are followed by a number (e.g., BLOSUM62) that indicate the minimum percent identity between any two aligned sequences.


The abbreviation for base pair. The length of DNA is usually given in bp or nt, Common measures include Kb, Mb, and Gb for thousands, millions, and billions of bp, respectively.


The end of a protein. In text form, the C-terminus of the protein is always at the right.

Caenorhabditis elegans

A nematode (also called a roundworm) that is about 1 mm long and has about 1,000 cells as an adult. C. elegans was the first animal to have its complete genome sequenced. See


The abbreviation for a coding sequence. CDS isn't synonymous with exon, since exons may contain noncoding sequence.


Three contiguous letters of DNA or RNA. Each of the 64 codons specifies either an amino acid or a translation stop.


The complement of a DNA sequence is the sequence on the other strand. For example, the complement of ACCCGT is TGGGCA. To complement a sequence in Perl, use either of the following:

# 4-letter alphabet

$dna =~ tr/ACGT/TGCA/;

# 15-letter alphabet



Drosophila melanogaster

The common fruit fly. This is one of the most famous organisms for genetic research and was one of the first animals whose complete genomic sequence was determined. See

dynamic programming

A common technique that reduces the computational complexity of a problem by finding and extending a partial optimization.

E. coli

Eschericia coli. A common bacteria normally found in your gut and a favorite organism for molecular biology research. Some variants cause food poisoning.

effective length

Karlin-Altschul statistics assume sequences of infinite length. To adjust for edge effects in real sequences, the search space is reduced by adjusting the true lengths of the sequences to effective lengths.


Randomness; disorder; unpredictability.


Organisms with intracellular membranous organelles such as the nucleus and mitochondria are called eukaryotes.

frame-shift mutation

A mutation that causes an insertion or deletion of nucleotides that isn't a multiple of three, and therefore causes the reading frame to change.


A functional unit of the genome. When not specifically stated, "gene" is usually considered a "protein-coding" gene, but many genes don't contain the instructions for proteins (e.g., various RNA genes).

genetic code

The mapping of codons to amino acids. See Table 2-3.

genetic drift

The tendency of sequences to change over time by accumulating random mutations.


The complete genetic material for an organism. For eukaryotes, the genome refers to the nuclear genome and doesn't include organelles.

global alignment

An alignment algorithm that requires every letter of each sequence to appear in the alignment. Globally aligning sequences of different lengths may lead to very strange alignments.



In sequence analysis, homologous means derived from a common ancestor. Sequences are either homologous or they aren't. It is incorrect to say that sequences are 80 percent homologous unless you mean that there is an 80 percent chance of common ancestry. Use percent identity to describe the similarity of alignments.


Literally, "likes water." Water is a polar molecule that mixes well with other polar molecules. The charged amino acids K, R, D, and E, are examples of hydrophilic amino acids.


Literally, "fears water." Nonpolar molecules (like those in oils) don't mix well with water. The amino acids L, I, V, and F are particularly hydrophobic.


The standard local alignment theory is often called Karlin-Altschul statistics after its founding authors.

lambda, l

The Karlin-Altschul statistical parameter that converts a raw score to a normalized score.

local alignment

An alignment algorithm that finds the optimal subsequence alignment. The alignment may include all letters of each sequence, but it isn't required to do so.

low-complexity sequence

Regions of sequences that are highly predictable—for example, a region that is 90 percent A or T.


One of the 20 common amino acids. Methionine is abbreviated as M or Met, and is especially important because all proteins begin with a methionine. There is only one codon for this amino acid: ATG.


Any change in sequence to a DNA molecule.


The start of a protein. In text form, a protein's N-terminus is always at the left.


Contraction for natural log digits. The base e logarithm of a number is in units of nats.

natural selection

A theory founded by Charles Darwin that explains how organisms change over time to better fit their environment. It is based on the principles of variation, heritability, and differential reproduction.


The abbreviation for noncoding RNA. Some RNAs, like tRNAs or rRNAs, don't contain information for protein sequences.


Global alignment is often called Needleman-Wunsch after the authors who first described the algorithm.


The basic building block of nucleic acid sequences (DNA and RNA). DNA is made from A, C, G, or T, while RNA contains A, C, G, or U.


The abbreviation for nucleotide.


The computational complexity of an algorithm is often described by its asymptotic behavior. O(n) problems grow linearly with the size of the input. O(log2n) grow much more slowly, and O(n2) grow much more quickly.


Abbreviation for open reading frame. Each strand of DNA has three frames. Any subsequence that doesn't contain stop codons in a particular frame is an open reading frame.


Genes that are separated by speciation (i.e., the same gene in different species). This is often approximated as the best reciprocal match between two complete genomes or proteomes.


A palindrome in DNA is a sequence that is read the same on the plus and minus strands. For example, the sequence GAATTC is a palindrome. Palindromes and near-palindromes are often sites for DNA-protein interaction. Proteins scanning along DNA "see" a palindrome as the same sequence regardless of which direction they are moving.


An acronym for Percent or Point Accepted Mutation. PAM scoring matrix names are usually followed by a number (e.g., PAM200), which indicates how many iterations of multiplication were used starting with the PAM1 matrix. The higher number indicates a more distant similarity.


Genes that are duplicated within a single genome. Duplication sometimes allows one of the genes to take on a specialized function.


The study of evolutionary relationships among organisms.


Organisms that don't contain intracellular organelles. All bacteria are prokaryotes.


The complete set of all proteins produced by a particular organism. Many proteins undergo post-translational modifications that add or subtract features from a protein. Therefore, a particular mRNA might have many different protein isoforms.


A sequence that looks like a gene but isn't. Most pseudogenes are derived from mRNAs that have been reverse-transcribed back to DNA and inserted into the genome. They have the hallmarks of RNA processing—notably a poly-A tail and no introns.

relative entropy

The average number of bits (or nats) per aligned letter for a given scoring scheme.


Any class of a sequence that appears multiple times in a genome. Usually, gene families aren't called repeats and the term is used for junk DNA. Some of the most common repeats in the human genome include the ALU and LINE families.

reverse transcriptase

A protein that creates DNA from an RNA template.


Ribonucleic acid. RNA is chemically similar to DNA but not used strictly for storage. Many RNA molecules have important functions in the cell and may even have enzymatic properties. Some of the most common functional RNA molecules include rRNAs and tRNAs.

RNA polymerase

A protein or multiprotein complex that creates RNA from a DNA template.


A complex macromolecule made up of proteins and rRNAs. Ribosomes are responsible for translating mRNAs into proteins.


Ribosomal RNA. The ribosome is composed of many specific RNA molecules, and these components are called rRNAs. rRNAs are some of the most abundant RNAs in a cell.


Local alignment is often referred to as Smith-Waterman, after the authors who first described the algorithm.

start codon

ATG. Codes for the amino acid methionine. Many proteins have N-terminal post-translational modifications, and the first amino acid of the mature protein may therefore not be methionine.

stop codon

TAA, TGA, and TAG are the three codons that terminate translation.

sum statistics

A method that determines the aggregate statistical significance of multiple local alignments.

target frequency

The expected frequencies of individual letter pairings. For nucleotide scoring matrices, the target frequency is often summarized by the expected percent identity in sequences with unbiased composition.


The complete set of transcripts for a particular genome. This term is often used to mean the mRNAs of protein coding genes and their alternatively spliced variants.


The abbreviation for transfer RNA. tRNAs transfer individual amino acids to the ribosome. Each tRNA molecule has an anti-codon the matches the reverse-complement of the amino acid it carries.


The abbreviation for an untranslated region. The 5´ and 3´ ends of an mRNA have untranslated regions. These regions sometimes play regulatory roles that change the mRNA's stability, translatability, or localization.