Gene Prediction Ppt 17143w

GENE PREDICTION

TOPICS  INTRODUCTION  TWO APPROACHES FOR GENE PREDICTION  CLASSIFICATION OF GENE PREDICTION  METHODOLGY FOR GENE PREDICTION  TOOLS AND SERVERS FOR GENE PREDICTION  CONCLUSION  REFRENCES

INTRODUCTION  Gene finding typically refers to the area of computational biology that is concerned with algorithmically identifying stretches of sequence, usually genomic DNA, that are biologically functional. This especially includes protein-coding genes, but may also include other functional elements such as RNA genes and regulatory regions.  Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.

 Gene prediction is to identify regions of genomic DNA that encode protiens

IDENTIFICATION mRna: Isolating mRNA from organisms in which they have been spliced out and then they are reverse translated into cDNA copy.  mRNA has only coding sequence. EST : A 200 to 500 base fragment of mRNA sequence of a gene that is sequenced from a random collection of mRNA fragments ,often from the 5’ to 3’ ends.

DNA

RNA

protein

cDNA

[1] Transcription [2] RNA processing (splicing) [3] RNA export [4] RNA surveillance

Phenotype

Signals: Pre-mRNA Splicing Start codon

Stop codon

Genomic DNA

Transcription pre-mRNA

Cap-

-Poly(A)

Splicing mRNA

-Poly(A)

Cap-

Translation Protein

exon

intron GT

AG Acceptor site

Donor site

Splice sites

Overview of gene prediction strategies What sequence signals can be used?

Transcription: TF binding sites, promoter, initiation site, terminator Processing signals: splice donor/acceptors, polyA signal Translation: start (AUG = Met) & stop (UGA,UUA, UAG) ORFs, codon usage

What other types of information can be used?

cDNAs & ESTs (experimental data,pairwise alignment) homology (sequence comparison, BLAST)

Finding Eukaryotic Genes Computationally  Gene finding based on homology evidence: BLAST, FASTA, BLAT etc.  Content-based Methods

G islands, GC content, hexamer repeats, composition statistics, codon frequencies

 Feature-based Methods

donor sites, acceptor sites, promoter sites, start/stop codons, polyA signals, feature lengths

 Similarity-based Methods

sequence homology, EST searches

 Pattern-based

HMMs, Artificial Neural Networks

Most effective is a combination of all the above !

Scheme of a eukaryotic gene

gene predicting approaches Focused on individual features Coding regions (ORFs) Splice sites Promoters Codon bias G islands GC content

Six Frames in a DNA Sequence CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC

CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG

start codons – ATG stop codons – TAA, TAG, TGA Stop codons: 3 out of 64 codons ~ 1 in 20

G Islands G islands are regions of the genome with a higher frequency of CG dinucleotides (not base-pairs!) than the rest of the genome G islands often occur near the beginning of genes maybe related to the binding of the Transcription Factor Sp1

Splice sites are conserved (can be an important signal)

Gene prediction: Eukaryotes vs prokaryotes Gene prediction is easier in microbial genomes Why?

Smaller genomes Simpler gene structures More sequenced genomes! (for comparative approaches)

Methods? Previously, mostly HMM-based Now: similarity-based methods because so many genomes available

PROCEDURE FOR GENE PREDICTION Obtain new genomic DNA sequence Translate in all six reading frames and compare to protien sequence database

Perform data base similarity search of EST database of same organism, or cDNA sequences if available Use gene prediction program to locate genes Analyze regulatory sequences in the genes

Integrated methods: Hidden Markov Models  Fully probabilistic, so can do proper statistics Can estimate the parameters from labeled data Can give confidence values

 Semi- or Generalized HMMs A state explains a subsequence (e.g. a whole exon), rather than a single base transition between states at features detected by other methods (e.g. splice site consensus)

Hidden Markov Models  Hidden Markov Models (HMMs) allow us to model complex sequences, in which the character emission probabilities depend upon the state  Think of an HMM as a probabilistic or stochastic sequence generator, and what is hidden is the current state of the model

HMM Details  An HMM is completely defined by its:  State-to-state transition matrix ()  Emission matrix (H)  State vector (x)

 We want to determine the probability of any specific (query) sequence having been generated by the model  Two algorithms are typically used for the likelihood calculation:  Viterbi  Forward

GRAIL Gene Recognition and Analysis Internet Link. Given by UBERBACHES & MURAL 1991 Basic first technique developed for gene prediction. Grail make use of N.N (neural network) method to recognize coding potential in fixed length about 100 bases without looking for additional features such as splice junction or start or stop codon ,it will depend upon sequence itself.  Improved version of grail 2 look for add feature ,predict by taking genomic context into .  Clint server application is of XGRAIL basically runs on Unix platform.  URL :http://compbio.ornl.gov/tools/index.html    

FGENEH/FGENES  Developed by Victor solovyr and colleagues.  It predicts internal exon by looking for structural features such as donar and acceptor splice site .  Method makes use of linear dicriminant analysis: A mathematical technique that allows data for multiple experiments to combined  The server SANGER CETRE WEB.  URL http:// genomic.sanger.ac.uk/gf/gf.html  Example: Human BAC clone RG346p16 of chromosome 7 (Gen bank Ac.no.Ac002416)  Protien Product out put in Fasta format.

MZEF  Michael Zhang’s Exon Finder  By Cold Spring harbour Laboratory .  Depend upon the technique quadratic discriminant analysis.  MZEF predict internal coding exons and does not give any other information.  Q.D.A : Result of two types of prediction 1.Splice site 2.Exon length.

•Predicting by exon length ,Exon –intron boundraies. •Programe can be ed from CSHLFTP site for Unix Programe or programe can be accessed through a web front end •URL: http:// www.cshl.org/genefinder

GENSCAN  Developed by Chris Burge & Sam Karlin.  Predict complete gene structure  Mostly used to predict high probability used in design of PCR primers for cDNA amplification.  GENSCAN rules on probabilistic model, the algorithm can assign a “optimal exon” As well as “suboptimal exon” Optimal exon: Are the sequence with highest probability (0.99 i.e .97.5%) Suboptimal exon: sequences having acceptable probability. (0.56 i.e.62%) URL http:// genes.mit.edu/GENSCAN.html

GENEID  Find exon based on coding potential .  Given by GUIGO et al ,1992.  GENEID uses position weight matrix to access whether a strech of sequence represent a splice sites or a start stop codon.  It is more specific means we can get output according to our need.  Out put of only internal Exon  Out put of only terminal Exon  Out put of only all Exon  URL: http:// www.imim.es/ geneid.html

BEST METHOD OF PREDICTION  Cold Spring Harbor Laboratory worked on gene prediction to predict best tool.  Website called “Banbury Cross”.  For each tool ther was four possible outcome . 1.Sensitivity value: Reflecting the fraction of actual coding region that are correctly predicted as truly being coding region. Specificity value: Reflecting the overall fraction of the prediction that is correct. To obtain a value of specificity and sensitivity correlation coefficient is formed. -1: prediction wrong ,0 to 1: prediction right

 Sensitivity (sn) = TP/ (TP+FN)  Specificity (sp) = TP/ (TP+FP)  Correlation coefficiant cc = TP*TN+FP*FN P.P*PN*AP*AN Result: over all exon finder was MZFE GENE structure prediction is GENESCAN As CC ..MZEF # 0.79 CC…GENSCAN #0.86

CONCLUSION  Gene prediction is to identify regions of genomic DNA that encode protiens

 Gene finding based on homology evidence: BLAST, FASTA, BLAT etc.  Content-based Methods  G islands, GC content, hexamer repeats, composition statistics, codon frequencies  Feature-based Methods  donor sites, acceptor sites, promoter sites, start/stop codons, polyA signals, feature lengths  Similarity-based Methods  sequence homology, EST searches  Pattern-based  HMMs, Artificial Neural Networks ON BASIS OF THIS GRAIL , fGENES , GENSCAN , MZEF , GENEID.  BEST CONCLUSION MADE WAS MZEF AND GENSCAN……….

REFERENCES BIOINFORMATICS (A PRACTICAL GUIDE TO THE ANALYSIS OF GENE AND PROTIENS) BY ANDRES D. BAXEVANIS BIOINFORMATICS( SEQUENCE AND GENOME ANALYSIS) BY DAVID W. MOUNT GOOGLE SEARCH TOOL WIKEPAEDIA SEARCH TOOL

THANK YOU FOR YOUR PATIENCE!

Gene Prediction Ppt 17143w

Overview 26281t

More details 6y5l6z

Related Documents 3h463d

Gene Prediction Ppt 17143w

Gene Therapy Ppt 5v702v

Ashtakavarga Prediction 2q6460

Gene Regulation 44591b

Gene Gun 41524

V.c.l. Prediction 4g4535

More Documents from "Atul Kumar" 4y2343

Dark Sensor Using Transistor, Phototransistor And Photodiode vi4x

Img148 1t5b1

Gene Prediction Ppt 17143w

Automatic Temperature Control Of Furnace 3b182b

Fiat Linea 50o3n

Sudha Dairy 412z4c