mldata :: Repository :: :: Translation Initiation Site Pred

About

You are here: Home / Repository / Data / View / Translation Initiation Site Pred

View Translation Initiation Site Pred (public)

2011-09-15 18:46 by kidzik | Version 3 | Rating

Rating

Overall (based on 0 votes)

Interesting

Documentation

ACTIVATE EDIT FORK DELETE

Summary
Data
More Info
History
Comments
Stats
Tasks

Summary: Used to find the Translation Initiation Site (TIS), at which the translation from mRNA to proteins initiates
License: unknown (from UCI repository)
Dependencies
Tags: biomedical Initiation Prediction Regression Site Translation
Attribute Types: Integer,String
Download: # Instances: 3312 / # Attributes: 928
HDF5 (49.3 MB) XML CSV ARFF LibSVM Matlab Octave
Files are converted on demand and the process can take up to a minute. Please wait until download begins.

Completeness of this item currently: 66%.
You can edit this item to add more meta information and make use of the site's premium features.

Original Data Format

Name

'fromDNA

Version mldata

Comment

Names

UP_A,DOWN_A,UP_R,DOWN_R,UP_N,DOWN_N,UP_D,DOWN_D,UP_C,DOWN_C,

Types

numeric
numeric
numeric
numeric
numeric
numeric
numeric
numeric
numeric
numeric

Data (first 10 data points)

UP_A	DOWN_A	UP_R	DOWN_R	UP_N	DOWN_N	UP_D	DOWN_D	UP_C	DOWN_C	...
2	2	4	2	0	2	0	4	0	1	...
0	0	3	2	0	2	2	1	3	0	...
6	2	5	1	0	2	0	2	1	0	...
0	6	1	1	0	1	0	0	0	2	...
2	3	3	2	0	1	1	2	2	0	...
3	6	1	0	1	1	0	0	1	0	...
0	2	0	3	0	0	0	2	0	0	...
4	2	1	0	1	3	0	1	0	0	...
1	1	0	1	0	1	0	4	0	0	...
0	3	0	0	1	1	0	0	1	1	...
...	...	...	...	...	...	...	...	...	...	...

Description

This data set is converted from sequence data. The original data consists of a selected set of vertebrates genomic sequences extracted from GenBank. It is used to find the Translation Initiation Site (TIS), at which the translation from mRNA to proteins initiates. Since only those sequences with an annotated TIS are included in the data set, a classification model can be built to distinguish true (positive) TIS and false (negative) TIS. As the data set is processed DNA, the TIS site is ATG. In total, there are 3312 sequences (i.e. 3312 true ATGs). There are various ways to extract sequences and build feature space. Here, we provide one approach: a window centered at each ATG, with both upstream and downstream are 100 bases long, is generated from each ATG. So there are 203 bases indicated by A, T, C and G in each window. If the portion of sequence is shorter than the window end, those bases are denoted by "?". With this strategy, we got 3312 true ATGs, 10063 false ATGs. When building feature space for classification, we matched 3 nucleotides to 1 amino acid and count the frequency of each amino acid. We distinguish these amino acid as upstream or downstream regarding to that it appears before or after the centered ATG. Besides the single amino acid, we also considered the frequency of a pair of amino acid. Thus, the number of features based on amino acid is (21+2121)2=924. Furthermore, according to our knowledge, a true ATG often has G at position 1 of its downstream side, A or G at position 3 of its upstream side and has no another upstream ATG (for mRNA). Then we added these 3 features in our feature space as well. Finally, we got a feature space containing 927 features.

URLs

http://datam.i2r.a-star.edu.sg/datasets/krbd/SequenceData/TIS.html

Publications

Data Source

Measurement Details

Usage Scenario

revision 2: by kidzik on 2011-09-05 18:18
revision 3: by kidzik on 2011-09-15 18:46

No one has posted any comments yet. Perhaps you would like to be the first?

To post a comment, please sign in.

This item was downloaded 18434 times and viewed 6326 times.

No Tasks yet on dataset Translation Initiation Site Pred

Submit a new Task for this Data item

Data

Sort by

Disclaimer

We are acting in good faith to make datasets submitted for the use of the scientific community available to everybody, but if you are a copyright holder and would like us to remove a dataset please inform us and we will do it as soon as possible.

Data | Task | Method | Challenge

Acknowledgements

This project is supported by PASCAL (Pattern Analysis, Statistical Modelling and Computational Learning)

http://www.pascal-network.org/.

UP_A	DOWN_A	UP_R	DOWN_R	UP_N	DOWN_N	UP_D	DOWN_D	UP_C	DOWN_C	...
2	2	4	2	0	2	0	4	0	1	...
0	0	3	2	0	2	2	1	3	0	...
6	2	5	1	0	2	0	2	1	0	...
0	6	1	1	0	1	0	0	0	2	...
2	3	3	2	0	1	1	2	2	0	...
3	6	1	0	1	1	0	0	1	0	...
0	2	0	3	0	0	0	2	0	0	...
4	2	1	0	1	3	0	1	0	0	...
1	1	0	1	0	1	0	4	0	0	...
0	3	0	0	1	1	0	0	1	1	...
...	...	...	...	...	...	...	...	...	...	...

UP_A	DOWN_A	UP_R	DOWN_R	UP_N	DOWN_N	UP_D	DOWN_D	UP_C	DOWN_C	...
2	2	4	2	0	2	0	4	0	1	...
0	0	3	2	0	2	2	1	3	0	...
6	2	5	1	0	2	0	2	1	0	...
0	6	1	1	0	1	0	0	0	2	...
2	3	3	2	0	1	1	2	2	0	...
3	6	1	0	1	1	0	0	1	0	...
0	2	0	3	0	0	0	2	0	0	...
4	2	1	0	1	3	0	1	0	0	...
1	1	0	1	0	1	0	4	0	0	...
0	3	0	0	1	1	0	0	1	1	...
...	...	...	...	...	...	...	...	...	...	...