Text alignment in early printed books combining deep learning and dynamic programming
Digital preservation of written cultural heritage is a fundamental topic in digital humanities. This task is hard because of the state of preservation of the document and by the lack of typographic standards. In this seminar Dr. Zahra Ziran (Università degli Studi di Firenze) describes a technique for transcript alignment in early printed books.
The technique is based on deep models in combination with dynamic programming algorithms. Two object detection models, based on Faster R-CNN, are trained to locate words. An initial model is first trained to recognize generic words and hyphens by using information about the number of words in text lines. Using the model prediction on pages where a line-by-line ground-truth annotation is available, a second model able to detect landmark words is trained. The alignment is then based on the identification of landmark words in pages where we only know the text corresponding to zones on the page.
The proposed technique is evaluated on publicly available digitization of the Gutenberg Bible while the transcription is based on the Vulgata, a late 4-th century Latin translation of the Bible.