| Are you interested in contributing to HLWIKI Canada - hlwiki.ca? contact: dean.giustini@ubc.ca
To browse other articles on a range of HSL topics, see the wiki index.
Introduction
See also Bioinformatics, Informationists & Medical informatics
What is a nucleotide?
Nucleotides are molecules that make up the structural units of RNA and DNA, and nucleotide searching requires the use of Entrez's Nucleotide database where gene and nucleotide sequences are freely-searchable. Gene and nucleotide sequences are used to find:
- homologous nucleotide sequences across species and in model organisms
- common ancestors amongst species to determine evolutionary relationships
- location of a gene or sequence within a genomic region and visualize through gene mapping
- amino acid sequence, which can then be used for protein folding
You can search for nucleotides at PubMed’s main page, near the bottom under "Popular" or switch between databases at the main page. Above search, there is a pull-down menu to select Nucleotide.
BLAST search tool
In bioinformatics, the Basic Local Alignment Search Tool (BLAST) is used to compare primary biological sequence data of amino-acid sequences of proteins or nucleotides of DNA sequences. BLAST is also used to find regions of similarity between biological sequences (and calculated using statistical similarity matching).
The BLAST search tool can be used to:
- identify a sequence
- find related sequences
- infer function
- infer species relatedness
- perform phylogenetic analysis
BLAST works by breaking queries into a series of “words” or set of letters. These words are compared to words from sequences in the database. Once a match is found, the sequences are aligned and scored. The number of letters per word can be changed by using the "Algorithm parameters" link on the BLAST screen. Insertions and deletions in sequences result in gaps when sequences are aligned. These gaps are assigned a certain score penalty for their existence and for their extension. These scores can also be changed from their defaults by using the “Algorithm parameters” link on the BLAST screen.
What is being searched?
Nucleotide searches retrieve results from three databases:
- Genome Survey Sequence (GSS) database
- Expressed Sequence Tag (EST) - 300-500bp pieces of complementary DNA (cDNA) derived from mRNA and used to map where a gene is physically located in the genome
- Nucleotide Core: notice references to Core Nucleotide when using PubMed which contains sequences not available in GSS and EST databases
Sources for the sequences
- GenBank: available through NIH and part of the International Nucleotide Sequence Database Collaboration; consists of annotated DNA sequences made publically available
- EMBL: European Molecular Biology Laboratory, part of the International Nucleotide Sequence Database Collaboration
- DDBJ: DNA DataBank of Japan, part of the International Nucleotide Sequence Database Collaboration
- The Reference Sequence database is available through NCBI; it contains annotated, publicly-available sequences for DNA, RNA and proteins derived from various organisms (i.e. viruses, bacteria, eukaryotes)
- PDB: The Protein Data Bank
How to search Nucleotide
- Overview of specialized search fields
- How to use tagged searches to find genes, organisms, etc.
- Using "Preview/Index" to appropriately tag the search
- Most useful tags
- Building searches using Limits
- Results
How search results are displayed
The results are displayed from the newest addition to the oldest date that any given sequence has been entered into the database. Searchers can select to display results by accession number, organism name, taxonomy ID or date entry was modified or released. Click on the “Sort By” drop-down menu to make your selection.
How to read a sequence record
There are a number of formats available to view records such as GenBank, FASTA, Graphics, ASN.1, Revision History, GenBank (Full). The following is a description of the GenBank flat file, the default display.
- Locus: consists of an accession number, length of sequence, molecular type and a code to signify the organism the sequence was derived from; includes latest date record was updated
- Definition: description of the sequence
- Accession number: accession number is a unique identifier permanently linked to a particular sequence record
- Version: changes to a sequence will be shown by a version number affixed to an accession number as a decimal
- Organism name: scientific name of organism, including taxonomic information
- References: section consists of one (or more) entries that include citation information and commentaries
- Features: contains biological information in the record, presented in a consistent manner that operates across databases
- Origin: entire base-pair sequence for the record
|