El Debate and El Sol Newspaper Archives, 1910-1936
OCR Data Transformation and Analysis
Charles III University of Madrid (UC3M)

Back

At Osiris-AI, our capabilities reach far beyond processing early handwritten manuscripts.

Historical newspapers offer a wealth of information for scholars but are not always straightforward to process using standard OCR due to the historical typesets used and the fact that the content is normally broken up into a series of columns. Many OCR tools do not recognise the fonts used in historical materials, nor can they detect column breakers, and so the text ‘jumps’ across lines making the resulting output too messy for analysis. Osiris-AI can resolve these challenges by employing their specialist knowledge and expertise with OCR technology to ensure accurate results when processing these types of documents. This results in accurate transcriptions in a readable format which are ready for analysis and interpretation.

Dealing with large numbers of image scans is also a huge challenge for those wanting to process historical archives due to the computing power required.

El Debate was a Spanish newspaper offering a unique perspective on European modern history from 1911-1936. Charles III University of Madrid (UC3M) commissioned us to OCR this archive and conduct advanced keyword searches.

During this project we have used accurate OCR to process over a quarter of a century of printed material, amounting to over 50,000 pages. We transformed the entirety of the newspaper text into analysable dataset.

Once the database was processed, we were able to run multiple word and phrase search queries and to produce metrics listing results at page level.

http://hdl.handle.net/10637/12072

The client has commissioned us to work a second newspaper archive for similar historical Spanish newspaper, El Sol.