View RODRIGO (public)
RODRIGO is the result of digitising and annotating a 853-page bound volume, entitled "Historia de España del arçobispo Don Rodrigo, and completely written in old Castilian (Spanish) by a single author
- Handwriting-Recognition Historical-Documents
- Attribute Types
tar.gz (1.2 GB)
Files are converted on demand and the process can take up to a minute. Please wait until download begins.
You can edit this item to add more meta information and make use of the site's premium features.
- Original Data Format
- Version mldata
- Data (first 10 data points)
"RODRIGO" is the result of digitizing and annotating a Spanish manuscript from 1545. It is a 853-page bound volume which, is divided in 307 chapters, which describe a series of chronicles from the spanish history. Its first part was copied from an older (XV century) manuscript, followed by an addition of posterior chronicles.
RODRIGO presents a series of important singularities:
- Writer tended to embellish the writing, specially in broad white spaces, which results in the extension of some descendants (and ascendants) across the whole word.
- The writer frequently omitted natural spaces between words, e.g. the words "por su estudio" were written as the unique word "porsuestudio"; or even add artificial spaces within the same word, e.g. words "llegaron se" were written instead the word "llegaronse".
- Dropcaps at the beginning of each chapter are missing. Experts agreed that RODRIGO manuscript was never brought to the artist (or press) in charge of its addition. In most cases, dropcap letter is annotated in small in its reserved space.
The manuscript was carefully scanned by experts from the "Ministerio de Cultura" at 300dpi in true colors, this version can be downloaded from here. As with historical documents in general, scanned pages have noise effects like spots, tears, ink fading, transparency of back side and non-uniform dewarping. Also, they show a slight warping due to book binding. Nevertheless, the manuscript can be easily read and thus we decided not to apply any preprocessing (apart from de-saturation) to it for the purpose of annotating ground-truth.
Ground-truth annotation of RODRIGO consisted of two parts. On the one hand, all text blocks were marked with enclosing rectangles and, within each text block, each text line was marked by its (straight) baseline. This was done semi-automatically by means of the GiDOC prototype. All blocks and baselines detected automatically were also manually supervised, and corrected when needed. On the other hand, the whole manuscript was transcribed line by line, by palaeography experts.
Annotation of digitized pages from historical document collections is very important to research on automatic extraction of text blocks, lines, and handwriting recognition. We have recently introduced a new handwritten text database, GERMANA, which is based on a Spanish manuscript from 1891. To our knowledge, GERMANA is the ﬁrst publicly available database mostly written in Spanish and comparable in size to standard databases. In this paper, we present another handwritten text database, RODRIGO, completely written in Spanish and comparable in size to GERMANA. However, RODRIGO comes from a much older manuscript, from 1545, where the typical difﬁcult characteristics of historical documents are more evident. In particular, the writing style, which has clear Gothic inﬂuences, is signiﬁcantly more complex than that of GERMANA. We also provide baseline results of handwriting recognition for reference in future studies, using standard techniques and tools for preprocessing, feature extraction, HMM-based image modelling, and language modelling.
- Data Source
- The manuscript is an annotated version of the scanned version at the "Biblioteca Virtual del Patrimonio Bibliográfico".
- Measurement Details
The typical task in RODRIGO is transcribe it from the begining to the end. Transcription quality is expressed in terms of Word Error Error (WER) rate.
- Usage Scenario
RODRIGO annotation includes line transcriptions, as well as their corresponding baselines, and text block coordinates. This annotation can be used to test methods for: document layout analysis, line detection, and handwritten text recognition.
No one has posted any comments yet. Perhaps you would like to be the first?
Leave a commentTo post a comment, please sign in.
This item was downloaded 1007 times and viewed 2186 times.
No Tasks yet on dataset RODRIGOSubmit a new Task for this Data item
We are acting in good faith to make datasets submitted for the use of the scientific community available to everybody, but if you are a copyright holder and would like us to remove a dataset please inform us and we will do it as soon as possible.
This project is supported by PASCAL (Pattern Analysis, Statistical Modelling and Computational Learning)