The long-term archiving of research data is a central aspect of good scholarly practice. It is the prerequisite for the fundamental traceability and verifiability of research results based on the evaluation of these data. In addition, archiving such data offers the opportunity to reuse them in the context of new avenues of research.
Research Data and E-Publishing with HASP
In addition to the e-publishing services for articles, books and journals, FID4SA offers Asian studies scholars worldwide the opportunity to have the associated research data permanently archived. These can be linked directly to the online publications at Heidelberg Asian Studies Publishing (HASP). All research data - i.e. images, videos, audio files, tables, graphics - are given a DOI (Digital Object Identifier) and are thus permanently citable and visible and specifically linkable as independent academic achievements.
Images, audio and video data as well as other multimedia objects are either stored on the heidICON platform operated by Heidelberg University Library or integrated into the Heidelberg digitisation system DWork, which is also sustainably hosted by Heidelberg University Library. Further data publications are available in HASP@heidDATA and are dynamically integrated into the online publication. In the future, not only the publications themselves, but also the media objects used will be archived sustainably in the OAIS-compatible long-term archiving system heiARCHIVE, which is currently being set up and has been developed jointly by the University Computing Centre and the University Library as part of the Competence Centre for Research Data (KFD). The code of software used in the context of publications can also be sustainably published and archived on heiDATA.
Research Data and Ground Truth Transcriptions
FID4SA uses the Transkribus platform developed within the READ project for text recognition of South Asian scripts.
Various data models for text recognition of the Devanāgarī script were trained with Transkribus and deliver very good recognition results with a character error rate (CER) of approx. 2.3%. These data models are based on so-called ground truth transcriptions. These are 1:1 transcriptions of the text on the document facsimile.
With FID4SA@heiDATA, FID4SA has set up a dataverse for archiving ground truth data for South Asian scripts. Interested researchers can download the data archived there and use it as training data for their own text recognition models. At the same time, researchers working on text recognition for South Asian scripts are invited to use this archive to make their own ground truth data available and to contribute to the creation of a ground truth data archive at a central site.