Identification problems for OCR characters for text retrieval in ancient books: A case study in the Ancient Collections of the Central Library at UNAM

Authors

  • Silvia Socorro Ballesteros Estrada Dirección General de Bibliotecas UNAM
  • Guillermo Morales Romero
  • Pavel Alfredo Cedillo Pérez

DOI:

https://doi.org/10.22201/dgb.0187750xp.2012.1.39

Keywords:

Text recognition, OCR, ancient collections, digitization.

Abstract

This article describes, in general terms, the problems faced for proper text retrieval through optical character recognition (OCR) in ancient books, by taking a sample of works from the fifteenth to the eighteenth centuries that are protected in the Ancient Collections of the Central Library at UNAM, and digitized by the General Directorate of Libraries. It first presents a conceptual theoretical exposition of OCR and its application in text retrieval to continue with the exemplification of the factors that determine the correct or incorrect identification of the graphemes in these books, by means of some tests applied with Adobe Acrobat 8 Professional and, last, it shows some findings obtained as a result of the analysis and interpretation of the data corresponding to the variables in question.

Downloads

Download data is not yet available.

Author Biographies

Silvia Socorro Ballesteros Estrada, Dirección General de Bibliotecas UNAM

Secretaria Técnica de Biblioteca Digital, Dirección General de Bibliotecas. Anexo de la DGB, Circuito de la Investigación Científica, UNAM-CU, c.p. 04510, México D.F, México. Correo electrónico: silviabe@dgb.unam.mx.

Guillermo Morales Romero

Fondo Antiguo y Colecciones Especiales, Biblioteca Central. Décimo piso del Edificio de Biblioteca Central, Circuito Interior, UNAM-CU, c.p. 04510, México D.F., México. Correo electrónico: guillermoralesromero@gmail.com; guillermom@dgb.unam.mx.

Pavel Alfredo Cedillo Pérez

Secretaría Técnica de Biblioteca Digital, Dirección General de Bibliotecas. Anexo de la DGB, Circuito de la Investigación Científica, UNAM-CU, c.p. 04510, México D.F., México. Correo electrónico: alfredoc@dgb.unam.mx.

Published

2012-06-20

How to Cite

Ballesteros Estrada, S. S., Morales Romero, G. and Cedillo Pérez P. A. (2012) “Identification problems for OCR characters for text retrieval in ancient books: A case study in the Ancient Collections of the Central Library at UNAM”, Biblioteca Universitaria, 15(1), pp. 25–34. doi: 10.22201/dgb.0187750xp.2012.1.39.

Issue

Section

Articles

Most read articles by the same author(s)