View RODRIGO (public)
























- Summary
RODRIGO is the result of digitising and annotating a 853-page bound volume, entitled "Historia de España del arçobispo Don Rodrigo, and completely written in old Castilian (Spanish) by a single author
- License
- unknown
- Dependencies
- Tags
- Handwriting-Recognition Historical-Documents
- Attribute Types
- Download
-
tar.gz (1.2 GB)
Files are converted on demand and the process can take up to a minute. Please wait until download begins.
You can edit this item to add more meta information and make use of the site's premium features.
- Original Data Format
- tar.gz
- Name
- Version mldata
- Comment
- Names
- Data (first 10 data points)
- Description
"RODRIGO" is the result of digitizing and annotating a Spanish manuscript from 1545. It is a 853-page bound volume which, is divided in 307 chapters, which describe a series of chronicles from the spanish history. Its first part was copied from an older (XV century) manuscript, followed by an addition of posterior chronicles.
RODRIGO presents a series of important singularities:
- Writer tended to embellish the writing, specially in broad white spaces, which results in the extension of some descendants (and ascendants) across the whole word.
- The writer frequently omitted natural spaces between words, e.g. the words "por su estudio" were written as the unique word "porsuestudio"; or even add artificial spaces within the same word, e.g. words "llegaron se" were written instead the word "llegaronse".
- Dropcaps at the beginning of each chapter are missing. Experts agreed that RODRIGO manuscript was never brought to the artist (or press) in charge of its addition. In most cases, dropcap letter is annotated in small in its reserved space.
The manuscript was carefully scanned by experts from the "Ministerio de Cultura" at 300dpi in true colors, this version can be downloaded from here. As with historical documents in general, scanned pages have noise effects like spots, tears, ink fading, transparency of back side and non-uniform dewarping. Also, they show a slight warping due to book binding. Nevertheless, the manuscript can be easily read and thus we decided not to apply any preprocessing (apart from de-saturation) to it for the purpose of annotating ground-truth.
Ground-truth annotation of RODRIGO consisted of two parts. On the one hand, all text blocks were marked with enclosing rectangles and, within each text block, each text line was marked by its (straight) baseline. This was done semi-automatically by means of the GiDOC prototype. All blocks and baselines detected automatically were also manually supervised, and corrected when needed. On the other hand, the whole manuscript was transcribed line by line, by palaeography experts.
- URLs
- http://prhlt.iti.upv.es/w/rodrigo.php
- Publications
- Data Source
- Measurement Details
- Usage Scenario
- revision 2
- by nserrano on 2012-02-21 09:54
- revision 3
- by nserrano on 2012-03-07 12:53
- revision 4
- by nserrano on 2012-03-07 12:55
- revision 5
- by nserrano on 2012-03-07 13:01
- revision 6
- by nserrano on 2012-03-07 13:08
No one has posted any comments yet. Perhaps you would like to be the first?
Leave a comment
To post a comment, please sign in.This item was downloaded 1989 times and viewed 4779 times.
No Tasks yet on dataset RODRIGO
Submit a new Task for this Data itemDisclaimer
We are acting in good faith to make datasets submitted for the use of the scientific community available to everybody, but if you are a copyright holder and would like us to remove a dataset please inform us and we will do it as soon as possible.
Acknowledgements
This project is supported by PASCAL (Pattern Analysis, Statistical Modelling and Computational Learning)
http://www.pascal-network.org/.