'Creating' database means a coherent collection of data with inherent meaning, used for future application. Database is a general repository of voluminous information or records to be processed by a programme.
Databases are broadly classified as generalized databases and specialized databases. Structural organisation of DNA, protein, carbohydrates are included under generalized databases. Databases of Expressed Sequence Tags (ESTs), Genome Survey Sequences (GSS), Single Nucleotide Polymorphisms (SNPs) sequence Tagged sites (STSs). RNA databases are included under specialized data bases.
Generalized databases contain sequence database and structure databases.
a. Sequence databases are the sequence records of either nucleotides or amino acids. The former is the nucleic acid databases and the latter are the protein sequence databases.
b. Structure databases are the individual records of macromolecular structures. The nucleic acid databases are again classified into primary databases and secondary databases.
Primary databases contain the data in their original form taken as such from the source eg., Genebank (NCBI/USA) Protein, SWISS-PROT (Switzerland), Protein 3D structure etc.
Secondary databases also called as value added databases contain annotated data and information eg., OMIN - Online Mendelian Inheritance in Man. GDB - Genome Database - Human.
European Molecular Biology Laboratory (EMBL) ; National Centre for Biotechnology Information (NCBI) and DNA data bank of Japan (DDBJ) are the three premier institutes considered as the authorities in the nucleotide sequence databases. They can be reached at
www.ebi.ac.uk/embl (for EMBL)
The protein sequence databases elucidate the high level annotations such as the description of the protein functions ; their domain structure (configuration), amino acid sequence, post-translational
modifications, variants etc. SWISS-PROT groups at SIB (Swiss Institute of Bioinformatics) and EBI (European Bioinformatics Institue) have developed the protein sequence databases. SWISS-PROT is revealed at http://www.expasy.ch/sprot-top.html.
The genome of an organism can be split up into different sized molecules by a technique called electrophoresis. When DNA of an organism is subjected to electrophoresis they migrate towards the positive electrode because DNA is a negatively charged molecule. Smaller DNA fragments move faster than longer ones. By comparing the distances that the DNA fragments migrate, their number of bases could be distinguished. The sequence of bases in the DNA fragments can be identified by chemical / biochemical methods. Nowadays automated sequencing machines called sequenators are developed to read hundreds of bases in the DNA. TheDNA sequence data are then stored in a computer accessible form.
A DNA library is a collection of DNA fragments, which contains all the sequences of a single organism.
In cDNA copies of messenger RNA are made by using reverse transcriptase enzymes. The cDNA libraries are smaller than genomic libraries and contain only DNA molecules for genes.
In the representation of either the nucleotides or the proteins, IUB/ IUPAC standards are followed. The accepted amino acid codes for proteins are given below.
A-Alanine B-aspartate / asparagines C-Cystine D-Aspartate
E-Glutamate F-Phenylalanine G-Glycine H-Histidine
I-Isoleucine M-Methionine K-Lysine N-asparagine
P-Proline Q-Glutamine R-Arginine S- Serine
T- Threonine Z-Glutamate/glutamine X-any *-Translation stop
-gap of indeterminate length.
The nucleic acid codes as follows (FASTA format)
A-adenosine B-GTC C-cytidine D-GATG-guanosine
R- Purines (guanine, adenine) T- Thymidine
Y- Pyrimidines (thymidine, cytosine) U-UridineH - ACT
V-GCA N-AGCT B-GTC D-GAT
-gap of indeterminate length.
To specialize in bioinformatics, knowledge of both biology and information computer technology is required. A biologist needs to know programming, optimization (code) and cluster analysis, as they are bioinformatics methods. The biologists should also be familiar with key algorithms (set of steps). The languages, which help in bioinformatics, are C, C++, JAVA, FORTRAN, LINUX, UNIX etc. Besides knowledge of ORACLE database and Sybase are essential. On the mathematical part knowledge of calculus and statistical techniques are needed. Knowledge of CGI (common gateway interface) scripts is also needed. With the above, a bioinformaticist could collect, organize, search and analyze the biological data viz., the nucleic acids and protein sequences.
1. It helps to understand gene structure and protein synthesis.
2. It helps to know more about the diseases.
3. It helps to understand more about the fundamental biology and the thread of life, - the DNA.
4. It paves the way for the medical and bio engineering applications.
5. It helps to apply the biophysical and biotechnologicl principles to biological studies. In turn, it will help to design new drugs and new chemical compounds to be used in health and environmental management respectively.