Nucleotide searching

From HLWIKI Canada

Jump to: navigation, search
Nucleotide searching at NCBI - http://www.ncbi.nlm.nih.gov/nucleotide/
Are you interested in contributing to HLWIKI Canada - hlwiki.ca? contact: dean.giustini@ubc.ca

To browse other articles on a range of HSL topics, see the wiki index.

Contents

Introduction

See also Bioinformatics, Informationists & Medical informatics

What is a nucleotide?

Nucleotides are molecules that make up the structural units of RNA and DNA, and nucleotide searching requires the use of Entrez's Nucleotide database where gene and nucleotide sequences are freely-searchable. Gene and nucleotide sequences are used to find:

  • homologous nucleotide sequences across species and in model organisms
  • common ancestors amongst species to determine evolutionary relationships
  • location of a gene or sequence within a genomic region and visualize through gene mapping
  • amino acid sequence, which can then be used for protein folding

You can search for nucleotides at PubMed’s main page, near the bottom under "Popular" or switch between databases at the main page. Above search, there is a pull-down menu to select Nucleotide.

BLAST search tool

In bioinformatics, the Basic Local Alignment Search Tool (BLAST) is used to compare primary biological sequence data of amino-acid sequences of proteins or nucleotides of DNA sequences. BLAST is also used to find regions of similarity between biological sequences (and calculated using statistical similarity matching).

The BLAST search tool can be used to:

  • identify a sequence
  • find related sequences
  • infer function
  • infer species relatedness
  • perform phylogenetic analysis

BLAST works by breaking queries into a series of “words” or set of letters. These words are compared to words from sequences in the database. Once a match is found, the sequences are aligned and scored. The number of letters per word can be changed by using the "Algorithm parameters" link on the BLAST screen. Insertions and deletions in sequences result in gaps when sequences are aligned. These gaps are assigned a certain score penalty for their existence and for their extension. These scores can also be changed from their defaults by using the “Algorithm parameters” link on the BLAST screen.

What is being searched?

Nucleotide searches retrieve results from three databases:

  • Genome Survey Sequence (GSS) database
  • Expressed Sequence Tag (EST) - 300-500bp pieces of complementary DNA (cDNA) derived from mRNA and used to map where a gene is physically located in the genome
  • Nucleotide Core: notice references to Core Nucleotide when using PubMed which contains sequences not available in GSS and EST databases

Sources for the sequences

  • GenBank: available through NIH and part of the International Nucleotide Sequence Database Collaboration; consists of annotated DNA sequences made publically available
  • EMBL: European Molecular Biology Laboratory, part of the International Nucleotide Sequence Database Collaboration
  • DDBJ: DNA DataBank of Japan, part of the International Nucleotide Sequence Database Collaboration
  • The Reference Sequence database is available through NCBI; it contains annotated, publicly-available sequences for DNA, RNA and proteins derived from various organisms (i.e. viruses, bacteria, eukaryotes)
  • PDB: The Protein Data Bank

How to search Nucleotide

  • Overview of specialized search fields
  • How to use tagged searches to find genes, organisms, etc.
  • Using "Preview/Index" to appropriately tag the search
  • Most useful tags
  • Building searches using Limits
  • Results

How search results are displayed

The results are displayed from the newest addition to the oldest date that any given sequence has been entered into the database. Searchers can select to display results by accession number, organism name, taxonomy ID or date entry was modified or released. Click on the “Sort By” drop-down menu to make your selection.

How to read a sequence record

There are a number of formats available to view records such as GenBank, FASTA, Graphics, ASN.1, Revision History, GenBank (Full). The following is a description of the GenBank flat file, the default display.

  • Locus: consists of an accession number, length of sequence, molecular type and a code to signify the organism the sequence was derived from; includes latest date record was updated
  • Definition: description of the sequence
  • Accession number: accession number is a unique identifier permanently linked to a particular sequence record
  • Version: changes to a sequence will be shown by a version number affixed to an accession number as a decimal
  • Organism name: scientific name of organism, including taxonomic information
  • References: section consists of one (or more) entries that include citation information and commentaries
  • Features: contains biological information in the record, presented in a consistent manner that operates across databases
  • Origin: entire base-pair sequence for the record

See also

References

Personal tools