View uci-20070111 kdd_coil_test-5 (public)

2011-09-14 15:15 by mldata | Version 1 | Rating Empty StarEmpty StarEmpty StarEmpty StarEmpty StarEmpty Star
Rating
Empty StarEmpty StarEmpty StarEmpty StarEmpty StarEmpty Star Overall (based on 0 votes)
Empty StarEmpty StarEmpty StarEmpty StarEmpty StarEmpty Star Interesting
Empty StarEmpty StarEmpty StarEmpty StarEmpty StarEmpty Star Documentation
Summary

(No information yet)

License
unknown (from Weka repository)
Dependencies
Tags
arff slurped Weka
Attribute Types
Floating Point,String
Download
# Instances: 133 / # Attributes: 12
HDF5 (43.4 KB) XML CSV ARFF LibSVM Matlab Octave

Files are converted on demand and the process can take up to a minute. Please wait until download begins.

Completeness of this item currently: 55%.
You can edit this item to add more meta information and make use of the site's premium features.
Original Data Format
arff
Name
coil-test-5
Version mldata
0
Comment

%%%%%%%%%%%%%%%%%% Data-Description % %%%%%%%%%%%%%%%%%%

                      COIL 1999 Competition Data

Data Type

multivariate

Abstract

This data set is from the 1999 Computational Intelligence and Learning (COIL) competition. The data contains measurements of river chemical concentrations and algae densities.

Sources

Original Owner

[1]ERUDIT European Network for Fuzzy Logic and Uncertainty Modelling in Information Technology

Donor

Jens Strackeljan Technical University Clausthal Institute of Applied Mechanics Graupenstr. 3, 38678 Clausthal-Zellerfeld, Germany [2]tmjs@itm.tu-clausthal.de

Date Donated: September 9, 1999

Data Characteristics

This data comes from a water quality study where samples were taken from sites on different European rivers of a period of approximately one year. These samples were analyzed for various chemical substances including: nitrogen in the form of nitrates, nitrites and ammonia, phosphate, pH, oxygen, chloride. In parallel, algae samples were collected to determine the algae population distributions.

Other Relevant Information

The competition involved the prediction of algal frequency distributions on the basis of the measured concentrations of the chemical substances and the global information concerning the season when the sample was taken, the river size and its flow velocity. The competition [3]instructions contain additional information on the prediction task.

Data Format

There are a total of 340 examples each containing 17 values. The first 11 values of each data set are the season, the river size, the fluid velocity and 8 chemical concentrations which should be relevant for the algae population distribution. The last 8 values of each example are the distribution of different kinds of algae. These 8 kinds are only a very small part of the whole community, but for the competition we limited the number to 7. The value 0.0 means that the frequency is very low. The data set also contains some empty fields which are labeled with the string XXXXX.

The training data are saved in the file: analysis.data (ASCII format).

Table 1: Structure of the file analysis.data

A

K

a

g

CC[1,1]

CC[1,11]

AG[1,1]

AG[1,7]

CC[200,1]

CC[200,11]

AG[200,1]

AG[200,7]

Explanation: CC[i,j]: Chemical concentration or river characteristic AG[i,j]: Algal frequency

The chemical parameters are labeled as A, ..., K. The columns of the algaes are labeled as a, ..,g.

Past Usage

[4]The Third (1999) International COIL Competition Home Page _

[5]The UCI KDD Archive
[6]Information and Computer Science
[7]University of California, Irvine
Irvine, CA 92697-3425

Last modified: October 13, 1999

References

  1. http://www.erudit.de/
  2. mailto:tmjs@itm.tu-clausthal.de
  3. file://localhost/research/ml/datasets/uci/raw/data/ucikdd/coil/instructions.txt
  4. http://www.erudit.de/erudit/activities/ic-99/index.htm
  5. http://kdd.ics.uci.edu/
  6. http://www.ics.uci.edu/
  7. http://www.uci.edu/

%%%%%%%%%%%%%%%%%% Task-Description % %%%%%%%%%%%%%%%%%%

Third International Competition

Protecting rivers and streams by monitoring chemical concentrations and algae communities.

Intelligent Techniques for Monitoring Water Quality using chemical indicators and algae population

Recent years have been characterised by increasing concern at the impact man is having on the environment.
The impact on the environment of toxic waste, from a wide variety of manufacturing processes, is well known. More recently, however, it has become clear that the more subtle effects of nutrient level and chemical balance changes arising from farming land run-off and sewage water treatment also have a serious, but indirect, effect on the states of rivers, lakes and even the sea. In temperate climates across the world summers are characterized by numerous reports excessive summer algae growth resulting in poor water clarity, mass deaths of river fish from reduced oxygen levels and the closure of recreational water facilities on account of the toxic effects of this annual algal bloom. Reducing the impact of these man-made changes in river nutrient levels has stimulated much biological research with the aim of identifying the crucial chemical control variables for the biological processes.

The data used in this problem comes from one such study. During the research study water quality samples were taken from sites on different European rivers of a period of approximately one year. These samples were analyzed for various chemical substances including: nitrogen in the form of nitrates, nitrites and ammonia, phosphate, pH, oxygen, chloride. In parallel, algae samples were collected to determine the algae population distributions. It is well known that the dynamics of the algae community is determined by external chemical environment with one or more factors being predominant. While the chemical analysis is cheap and easily automated, the biological part involves microscopic examination, requires trained manpower and is therefore both expensive and slow.

Diatoms like Cymbella are major contributors to primary production throughout the world. The diatom reacts with large sensitivity to even small changes in acidity .

Over a three and half billion year history algae have evolved and adapted as primary plant colonizers of almost every known habitant in terrestrial and aquatic environments. They respond very rapidly to man-made environment changes.

The relationship between the chemical and biological features is complex and can be expected to need the application of advanced techniques. Typical of such real-life problems, the particular data set for the problem contains a mixture of (fuzzy) qualiative variables and numerical measurement values, with much of the data being incomplete.

The competition task is the prediction of algal frequency distributions on the basis of the measured concentrations of the chemical substances and the global information concerning the season when the sample was taken, the river size and its flow velocity. The two last variables are given as linguistic variables.

340 data sets were taken and each contain 17 values. The first 11 values of each data set are the season, the river size, the fluid velocity and 8 chemical concentrations which should be relevant for the algae population distribution. The last 8 values of each data set are the distribution of different kinds of algae. These 8 kinds are only a very small part of the whole community, but for the competition we limited the number to 7. The value 0.0 means that the frequency is very low. The data set also contains some empty fields which are labeled with the string XXXXX.

Each participant in the competition receives 200 complete data sets (training data) and 140 data sets (evaluation data) containing only the 11 values of the river descriptions and the chemical concentrations.

This training data is to be used in obtainin a 'model' providing a prediction of the algal distributions associated with the evaluation data.

The training data are saved in the file:

analysis.txt (ASCII format).

Structure of the file analysis.txt

A K a g CC1,1 ... CC1,11 AG1,1 ... AG1,7 .... ... ... ...

CC200,1 ... CC200,11 AG240,1 ... AG240,7

Explanation: CCi,j: Chemical concentration j=1,..11 AGi,k: Algal frequency k=1...7

The chemical parameters are labeled as A, ..., K. The columns of the algaes are labeled as a, ..,g.

Evaluation data are saved in file eval.txt (ASCII format).

Table 2: Structure of the file eval.* A K CC1,1 ... CC1,11

..... ...

CC140,1 ... CC140,11


Objective

The objective of the competition is to provide a prediction model on basis of the training data. Having obtained this prediction model, each participant must provide the solution in the form of the results of applying this model to the evaluation data. The results obtained in this way should correspond to the results of the evaluation data (which are known to the organizer). The criteria used to evaluate the results is given below. All 7 Algae frequency distributions must be determined. For this purpose any number of partial models may be developed.


Judgment of the results

To judge the results, the sum of squared errors will be calculated. The following Table describes the results of a particular participant.

Matrix of results a g

Res1,1 ... Res1,7

.... ...

Res140,1 Res140,7

All solutions that lead to a smallest total error will be regarded as winner of the contest.

Information about the dataset CLASSTYPE: numeric CLASSINDEX: last

ALGAE #: 5/7

Names
season,river_size,fluid_velocity,concentration_1,concentration_2,concentration_3,concentration_4,concentration_5,concentration_6,concentration_7,
Types
  1. nominal:autumn,spring,summer,winter
  2. nominal:large_,medium,small_
  3. nominal:high__,low___,medium
  4. numeric
  5. numeric
  6. numeric
  7. numeric
  8. numeric
  9. numeric
  10. numeric
Data (first 10 data points)
    season rive... flui... conc... conc... conc... conc... conc... conc... conc... ...
    summer small_ medium 7.95 5.7 57.333 2.46 273.... 295.... 380.0 ...
    winter small_ medium 7.98 8.8 59.333 7.392 286.... 33.333 138.0 ...
    summer small_ medium 8.0 7.2 80.0 1.957 174.... 47.857 113.... ...
    spring small_ high__ 8.35 8.4 68.0 3.026 458.0 45.2 111.8 ...
    spring small_ medium 8.1 13.2 19.0 0.0 130.0 6.0 40.0 ...
    summer small_ medium 8.37 12.1 12.85 0.84 15.0 5.0 10.507 ...
    spring small_ high__ 7.31 9.9 6.0 1.395 58.75 6.0 16.0 ...
    autumn small_ high__ 7.91 11.2 5.0 1.383 6.0 24.333 30.0 ...
    summer small_ high__ 7.99 10.7 4.0 1.368 117.0 17.25 44.75 ...
    autumn small_ high__ 7.82 11.5 8.18 1.488 39.0 16.0 139.5 ...
    ... ... ... ... ... ... ... ... ... ... ...
Description

A gzip'ed tar containing UCI and UCI KDD datasets (uci-20070111.tar.gz, 17,952,832 Bytes)

URLs
(No information yet)
Publications
    Data Source
    http://www.ics.uci.edu/~mlearn/MLRepository.html http://kdd.ics.uci.edu/
    Measurement Details
    Usage Scenario
    revision 1
    by mldata on 2011-09-14 15:15

    No one has posted any comments yet. Perhaps you would like to be the first?

    Leave a comment

    To post a comment, please sign in.

    This item was downloaded 2994 times and viewed 2135 times.

    No Tasks yet on dataset uci-20070111 kdd_coil_test-5

    Submit a new Task for this Data item

    Data

    Sort by

    Disclaimer

    We are acting in good faith to make datasets submitted for the use of the scientific community available to everybody, but if you are a copyright holder and would like us to remove a dataset please inform us and we will do it as soon as possible.

    Data | Task | Method | Challenge

    Acknowledgements

    This project is supported by PASCAL (Pattern Analysis, Statistical Modelling and Computational Learning)
    PASCAL Logo
    http://www.pascal-network.org/.