AI-based approaches to text recognition have made a big leap forward in text recognition for historical documents, especially for non-Latin scripts. Since autumn 2018, FID4SA has been using the Transkribus platform developed as part of the READ project for text recognition of the Devanāgarī holdings of the Naval Kishore Press collection. Various data models for text recognition were trained with Transkribus for this stock, which delivered very good results with a Character Error Rate (CER) between 5.59% and 0.83% so that subsequent corrections of the recognition results can now largely be dispensed with.
Our data model Devanagari mixed M1A for the recognition of printed Devanāgarī texts is available for Transkribus users on the Transkribus website as a public model for re-use. Further models are in preparation
The workflow for OCR with Transkribus
- To feed the Transkribus HTR+ engine with training data for text recognition, the document facsimile must first be page segmented. This means that the text regions and lines must be defined in the document. This can be done manually or automated using layout analysis tools available through the Transkribus platform.
With P2PaLA, Transkribus also offers an open-source tool for text structure recognition, with which individual models for page segmentation can be trained. With the P2PaLA models trained by the FID4SA team for the Pothi format, which is based on Indian manuscripts and the Naval Kishore Press standard book format, the page segmentation of the historical book prints can now be automated with little manual correction effort.
- After page segmentation, transcriptions can be created using the text editor Ground Truth (GT). These are 1:1 copies of the text on the document facsimile. These GT transcriptions are created manually. Alternatively, existing transcriptions can also be imported into Transkribus. A data model for automatic text recognition can be trained based on approx. 5,000 words GT transcriptions for printed texts. As a rule of thumb - the more training data a model contains, the better the recognition accuracy, i.e., the Character Error Rate (CER).
- Various export functions and export formats are available for further processing of the documents outside of Transkribus, e.g., B. ALTO PDF, TEI, TXT.
Our Current Focus of Work
Based on the very good results with documents printed using hot metal typesetting, ground truth transcriptions are currently being created for texts printed using lithography from the NKP collection. These form the basis for training data models for OCR. In the lithographic printing process, the texts are applied to the lithographic stone by hand by various scribes and calligraphers. Since we are dealing with handwritten materials, the particular challenge is to train data models with a good CER for different handwritings.
The second focus of work is the training of data models for texts in Devanāgarī script based on ground truth "transliterations". The Latin transliteration of the Devanāgarī text on the document facsimile is used as GT training material. A first data model based on approx. 9,000 words delivers a promising result with a CER of 4.05% for the validation data set.
Provision of the Texts
The automatically recognized texts are available to users as image facsimiles and as searchable full text in the original script and Latin transliteration. The web presentation of the edited texts takes place via our portal Naval Kishore Press - digital. The Heidelberg in-house development DWork is used for this. The software has a modular design and supports all the individual steps of the digitization workflow - from the creation of metadata and scan processing to the creation of the web presentation with annotation and comment functions. It enables scientists to work on texts in a spatially andtemporally distributed manner.
The FID4SA exchanges information with other national and international projects in which AI processes for structure and text recognition are used. For example, there is a close professional exchange between the FID4SA team and the digital curator of the Two Centuries of Indian Print project at the British Library. This project digitized more than 1,000 historical Bengali books and trained data models for Bengali text recognition using Transkribus.