FID für Südasien: Text recognition

Optical Character Recognition of Historical Devanāgarī Printed Works

AI-based approaches to text recognition have made a big leap forward when it comes to historical documents. Since autumn 2018, FID4SA has been using the Transkribus platform for text recognition of the Devanāgarī part of the Naval Kishore Press collection. Various data models for text recognition were trained with Transkribus for this early 19th century material, which delivered very good results with a Character Error Rate (CER) between 5.59% and 0.83%. That means that subsequent corrections of the recognition results can now largely be dispensed with.

Our data model Devanagari mixed M1A for the recognition of printed Devanāgarī texts is available for Transkribus users on the Transkribus website as a public model for re-use. Further models are in preparation

The workflow for OCR with Transkribus

To feed the Transkribus PyLaia engine with training data for text recognition, text regions and text lines must be defined first in the document. This can either be done manually or automatically by using various layout analysis tools available through the Transkribus platform.

With P2PaLA, Transkribus also offers an open-source tool for text structure recognition, with which individual models for page segmentation can be trained. With the P2PaLA models trained by the FID4SA team for the Pothi format, which is based on Indian manuscripts and the Naval Kishore Press standard book format, page segmentation is now performed automatically with little manual post-correction.
After page segmentation, Ground Truth (GT) transcriptions can be created by using the text editor. Ground Truth are 1:1 copies of the text on the document facsimile. These GT transcriptions are usually created manually. Alternatively, existing transcriptions can also be imported into Transkribus. A data model for automatic text recognition then can be trained based on approx. 5,000 words GT transcriptions for printed texts. As a general rule - the more training data a model contains, the better the recognition accuracy, i.e. the Character Error Rate (CER).
Various export functions and export formats are available for further processing of the documents outside of Transkribus, e.g., B. ALTO PDF, TEI, TXT.

Our Current Focus of Work

Based on the very good results with lead type printed documents, GT transcriptions are currently created for lithographically printed texts from the NKP collection. Typical for this printing process was that the texts were hand-written and were applied on the lithographic stone by scribes and calligraphers. Since we are dealing with handwritten materials, the particular challenge is to train data models with a good CER for different handwritings.

The second focus of our work is training data models for texts in Devanāgarī script based on GT "transliterations". The Latin transliteration of the Devanāgarī text is thereby used for the GT training. A first data model based on approx. 9,000 words delivers a promising result with a CER of 4.05% for the validation data set.

Provision of the Texts

The automatically recognised texts are available to users as image facsimiles and as searchable full texts in the original script as well as in Latin transliteration. Access to the digitised and OCRed texts is provided through Naval Kishore Press - digital. The Heidelberg software DWork, developed by Heidelberg University Library, is used for this purpose. The software has a modular design and supports all individual steps of the digitisation workflow - from the creation of metadata and scan processing to the creation of the web presentation with annotation and commentary functions. It enables scholars to work on texts together independent of time and space.

Networking

The FID4SA is in contact with other national and international projects in which AI-based methods for text recognition are used. There is a close professional exchange between the FID4SA team and the digital curator of the Two Centuries of Indian Print project at the British Library. This project digitized more than 1,000 historical Bengali books and trained data models for Bengali text recognition using Transkribus. In addition, an HTR Expert Group was founded, which meets online twice a year to share their experiences.

Text recognition

Further Information

An insight into the work with Transkribus text recognition in the digitisation of the Naval Kishore Press collection is provided by an interview in the ANUBhasha Podcast, Season 1, Episode 6 (March 31, 2023).

Contact

Nicole Merkel-Hilf
CATS Library /
Dept. South Asia
Tel.: +49 6221 54 15047
merkel@ub.uni-heidelberg.de