View statlib-20050214 tecator (public)

2010-11-06 10:00 by mldata | Version 1 | Rating Empty StarEmpty StarEmpty StarEmpty StarEmpty StarEmpty Star
Rating
Empty StarEmpty StarEmpty StarEmpty StarEmpty StarEmpty Star Overall (based on 0 votes)
Empty StarEmpty StarEmpty StarEmpty StarEmpty StarEmpty Star Interesting
Empty StarEmpty StarEmpty StarEmpty StarEmpty StarEmpty Star Documentation
Summary

(No information yet)

License
unknown (from Weka repository)
Dependencies
Tags
arff slurped Weka
Attribute Types
Floating Point
Download
# Instances: 240 / # Attributes: 125
HDF5 (256.2 KB) XML CSV ARFF LibSVM Matlab Octave

Files are converted on demand and the process can take up to a minute. Please wait until download begins.

Completeness of this item currently: 55%.
You can edit this item to add more meta information and make use of the site's premium features.
Original Data Format
arff
Name
tecator
Version mldata
0
Comment

This is the Tecator data set: The task is to predict the fat content of a meat sample on the basis of its near infrared absorbance spectrum. 1. Statement of permission from Tecator (the original data source)

These data are recorded on a Tecator Infratec Food and Feed Analyzer working in the wavelength range 850 - 1050 nm by the Near Infrared Transmission (NIT) principle. Each sample contains finely chopped pure meat with different moisture, fat and protein contents.

If results from these data are used in a publication we want you to mention the instrument and company name (Tecator) in the publication. In addition, please send a preprint of your article to

Karin Thente, Tecator AB, 
Box 70, S-263 21 Hoganas, Sweden

The data are available in the public domain with no responsability from the original data source. The data can be redistributed as long as this permission note is attached. For more information about the instrument - call Perstorp Analytical's representative in your area.

  1. Description of the data file

For each meat sample the data consists of a 100 channel spectrum of absorbances and the contents of moisture (water), fat and protein. The absorbance is -log10 of the transmittance measured by the spectrometer. The three contents, measured in percent, are determined by analytic chemistry.

There are 240 samples which are divided into 5 data sets for the purpose of model validation and extrapolation studies. The data sets, further described in reference 1, are:

Data set  Use               Samples
C         Traning               129
M         Monitoring             43
T         Testing                43
E1        Extrapolation, Fat      8     
E2        Extrapolation, Protein 17

The data for all 240 samples appear at the end of this file - 25 lines per sample. The data sets appear in the order of the table above. The spectra are preprocessed using a principal component analysis on the data set C, and the first 22 principal components (scaled to unit variance) are included for each sample. Thus if you want to use the data for a standard (interpolation) test of your algorithm, use sample 1-172 for training and sample 173-215 for testing (and ignore the last 25 samples), and use the first 13 or so principal components to predict the fat content.

Each line contains the 100 absorbances followed by the 22 principal components and finally the contents of moisture, fat and protein.

Preceeding the data lines, the following lines appear:

real_in=122 real_out=3 training_examples=172 test_examples=43 extrapolation_examples=25

  1. More details on how to use the data

The data are made available as a benchmark for regression models. In order to compare models, it is practical to use the data set as follows:

C and M combined are used to tune (estimate, train) the model. (Some approaches set aside some training data to control overfitting. These data should be a subset of C+M. In (1) the subset M was used for this purpose.)

T is used to test the model once it has been tuned. If each model has an element of randomness (as is the case for neural networks) the most reliable measure of performance of a single model is obtained by selecting a handful of models on the basis of C+M and quoting the average of the performances on T. In the presence of randomness it is bad practice to train a lot of models on C+M and then select the best of these on the basis of T.

C, M and T are drawn from the same pool of data, so T is used to test the ability of the models to interpolate. The data sets E1 and E2 contain more fat and protein respectively and are intended to be used to test the
ability of the models to extrapolate.

  1. Performance of neural network models

The performance is measured as Standard Error of Prediction (SEP) which is the root mean square of the difference between the true and the predicted content.

For the prediction of fat on the data set T the following results were obtained

Reference SEP   method (see the papers for details)
(1)       0.65  10-6-1 network, early stopping
(2)       0.52  10-3-1 network, Bayesian
(3)       0.36  13-X-1 network, Bayesian, Automatic Relevance Determination

A linear model with 10 inputs yields SEP=2.78.

  1. References

(1) C.Borggaard and H.H.Thodberg, "Optimal Minimal Neural Interpretation of Spectra", Analytical Chemistry 64 (1992), p 545-551. (2) H.H.Thodberg, "Ace of Bayes: Application of Neural Networks with Pruning" Manuscript 1132, Danish Meat Research Institute (1993), available by anonymous ftp in the file: pub/neuroprose/thodberg.ace-of-bayes.ps.Z on the Internet node archive.cis.ohio-state.edu (128.146.8.52).

(3) Revised and extended version of (2), in preparation, to be submitted to IEEE Trans. Neural Networks (1995) available by anonymous ftp in the file: pub/neuroprose/thodberg.bayesARD.ps.Z on the Internet node archive.cis.ohio-state.edu (128.146.8.52).

Hans Henrik Thodberg Email: thodberg@nn.dmri.dk Danish Meat Research Institute Phone: (+45) 42 36 12 00 Maglegaardsvej 2, Postboks 57 Fax: (+45) 42 36 48 36 DK-4000 Roskilde, Denmark

real_in=122 real_out=3 training_examples=172 test_examples=43 extrapolation_examples=25

Note: all 240 samples are included in the same order as mentioned above

Information about the dataset CLASSTYPE: numeric CLASSINDEX: none specific

Names
absorbance_1,absorbance_2,absorbance_3,absorbance_4,absorbance_5,absorbance_6,absorbance_7,absorbance_8,absorbance_9,absorbance_10,
Types
  1. numeric
  2. numeric
  3. numeric
  4. numeric
  5. numeric
  6. numeric
  7. numeric
  8. numeric
  9. numeric
  10. numeric
Data (first 10 data points)
    abso... abso... abso... abso... abso... abso... abso... abso... abso... abso... ...
    2.61... 2.61... 2.61... 2.61... 2.61... 2.62... 2.62... 2.62... 2.62... 2.62... ...
    2.83... 2.83... 2.84... 2.84... 2.85... 2.85... 2.8606 2.86... 2.87... 2.87... ...
    2.58... 2.58... 2.58... 2.58... 2.58... 2.59... 2.59... 2.59... 2.59... 2.60... ...
    2.82... 2.8246 2.8263 2.82... 2.83... 2.83... 2.83... 2.83... 2.83... 2.84... ...
    2.78... 2.78... 2.79... 2.7935 2.79... 2.79... 2.79... 2.80... 2.80... 2.8089 ...
    3.00... 3.0154 3.02... 3.02... 3.0319 3.03... 3.04... 3.04... 3.05... 3.06... ...
    2.98... 2.99... 2.9998 3.0054 3.01... 3.01... 3.02... 3.03... 3.03... 3.04... ...
    2.52... 2.52... 2.52... 2.52... 2.52... 2.53... 2.53... 2.53... 2.53... 2.5371 ...
    3.27... 3.27... 3.28... 3.293 3.29... 3.30... 3.3131 3.32... 3.32... 3.33... ...
    3.39... 3.40... 3.41... 3.42... 3.42... 3.43... 3.44... 3.45... 3.45... 3.46... ...
    ... ... ... ... ... ... ... ... ... ... ...
Description

A gzip'ed tar containing StatLib datasets (statlib-20050214.tar.gz, 12,785,582 Bytes)

URLs
(No information yet)
Publications
    Data Source
    http://lib.stat.cmu.edu/datasets/
    Measurement Details
    Usage Scenario
    revision 1
    by mldata on 2010-11-06 10:00

    No one has posted any comments yet. Perhaps you would like to be the first?

    Leave a comment

    To post a comment, please sign in.

    This item was downloaded 4387 times and viewed 3543 times.

    No Tasks yet on dataset statlib-20050214 tecator

    Submit a new Task for this Data item

    Data

    Sort by

    Disclaimer

    We are acting in good faith to make datasets submitted for the use of the scientific community available to everybody, but if you are a copyright holder and would like us to remove a dataset please inform us and we will do it as soon as possible.

    Data | Task | Method | Challenge

    Acknowledgements

    This project is supported by PASCAL (Pattern Analysis, Statistical Modelling and Computational Learning)
    PASCAL Logo
    http://www.pascal-network.org/.