mldata :: Repository ::

About

You are here: Home / Repository / Data / View / RODRIGO

View RODRIGO (public)

2012-03-07 13:08 by nserrano | Version 6 | Rating

Rating

Overall (based on 0 votes)

Interesting

Documentation

ACTIVATE EDIT FORK DELETE

Summary
Data
More Info
History
Comments
Stats
Tasks

Summary: RODRIGO is the result of digitising and annotating a 853-page bound volume, entitled "Historia de España del arçobispo Don Rodrigo, and completely written in old Castilian (Spanish) by a single author
License: unknown
Dependencies: GERMANA
Tags: Handwriting-Recognition Historical-Documents
Attribute Types
Download: tar.gz (1.2 GB)
Files are converted on demand and the process can take up to a minute. Please wait until download begins.

Completeness of this item currently: 100%.
You can edit this item to add more meta information and make use of the site's premium features.

Original Data Format

tar.gz

Name

Version mldata

Comment

Names

Data (first 10 data points)

Description

"RODRIGO" is the result of digitizing and annotating a Spanish manuscript from 1545. It is a 853-page bound volume which, is divided in 307 chapters, which describe a series of chronicles from the spanish history. Its first part was copied from an older (XV century) manuscript, followed by an addition of posterior chronicles.

RODRIGO presents a series of important singularities:

Writer tended to embellish the writing, specially in broad white spaces, which results in the extension of some descendants (and ascendants) across the whole word.
The writer frequently omitted natural spaces between words, e.g. the words "por su estudio" were written as the unique word "porsuestudio"; or even add artificial spaces within the same word, e.g. words "llegaron se" were written instead the word "llegaronse".
Dropcaps at the beginning of each chapter are missing. Experts agreed that RODRIGO manuscript was never brought to the artist (or press) in charge of its addition. In most cases, dropcap letter is annotated in small in its reserved space.

The manuscript was carefully scanned by experts from the "Ministerio de Cultura" at 300dpi in true colors, this version can be downloaded from here. As with historical documents in general, scanned pages have noise effects like spots, tears, ink fading, transparency of back side and non-uniform dewarping. Also, they show a slight warping due to book binding. Nevertheless, the manuscript can be easily read and thus we decided not to apply any preprocessing (apart from de-saturation) to it for the purpose of annotating ground-truth.

Ground-truth annotation of RODRIGO consisted of two parts. On the one hand, all text blocks were marked with enclosing rectangles and, within each text block, each text line was marked by its (straight) baseline. This was done semi-automatically by means of the GiDOC prototype. All blocks and baselines detected automatically were also manually supervised, and corrected when needed. On the other hand, the whole manuscript was transcribed line by line, by palaeography experts.

URLs

http://prhlt.iti.upv.es/w/rodrigo.php

Publications

http://www.lrec-conf.org/proceedings/lrec2010/pdf/477_Paper.pdf

Annotation of digitized pages from historical document collections is very important to research on automatic extraction of text blocks, lines, and handwriting recognition. We have recently introduced a new handwritten text database, GERMANA, which is based on a Spanish manuscript from 1891. To our knowledge, GERMANA is the ﬁrst publicly available database mostly written in Spanish and comparable in size to standard databases. In this paper, we present another handwritten text database, RODRIGO, completely written in Spanish and comparable in size to GERMANA. However, RODRIGO comes from a much older manuscript, from 1545, where the typical difﬁcult characteristics of historical documents are more evident. In particular, the writing style, which has clear Gothic inﬂuences, is signiﬁcantly more complex than that of GERMANA. We also provide baseline results of handwriting recognition for reference in future studies, using standard techniques and tools for preprocessing, feature extraction, HMM-based image modelling, and language modelling.

Data Source

The manuscript is an annotated version of the scanned version at the "Biblioteca Virtual del Patrimonio Bibliográfico".

Measurement Details

The typical task in RODRIGO is transcribe it from the begining to the end. Transcription quality is expressed in terms of Word Error Error (WER) rate.

Usage Scenario

RODRIGO annotation includes line transcriptions, as well as their corresponding baselines, and text block coordinates. This annotation can be used to test methods for: document layout analysis, line detection, and handwritten text recognition.

revision 2: by nserrano on 2012-02-21 09:54
revision 3: by nserrano on 2012-03-07 12:53
revision 4: by nserrano on 2012-03-07 12:55
revision 5: by nserrano on 2012-03-07 13:01
revision 6: by nserrano on 2012-03-07 13:08

No one has posted any comments yet. Perhaps you would like to be the first?

To post a comment, please sign in.

This item was downloaded 2096 times and viewed 5111 times.

No Tasks yet on dataset RODRIGO

Submit a new Task for this Data item

Data

Sort by

Disclaimer

We are acting in good faith to make datasets submitted for the use of the scientific community available to everybody, but if you are a copyright holder and would like us to remove a dataset please inform us and we will do it as soon as possible.

Data | Task | Method | Challenge

Acknowledgements

This project is supported by PASCAL (Pattern Analysis, Statistical Modelling and Computational Learning)

http://www.pascal-network.org/.