Monday, 26 May 2025

Reading and Transcribing Old Handwritten Documents: Transkribus

I have previously written about reading old handwriting and whether children in school today will be able to even write cursively. I stated in one blog post about poring through hundreds of pages of old parish registers and learning how to read Old English as well as interpret poor handwriting of some scribes of yesteryear. And I lamented that many students today are not learning cursive writing that many of us older folk did and may not be able to read old documents.

Well, help may be on the way!

We have seen the large strides made in converting printed material into digital documents. Most family researchers, or even casual readers now take advantage of thousands of books and articles, both old and current, now being available online as a result of extensive scanning efforts.

I also commented back in 2017 that technology is being developed to read handwriting through optical character recognition. That will be a great boon to anyone dealing with old documents, but it is likely still a long way off before we have such a program on our own computers.” It is now a lot closer to reality than I imagined back then.

One of the major applications to reading and transcribing old documents is through Transkribus. I mentioned it in a recent presentation to the Family Tree (UK) Brickwalls, Skills & Solutions Club.

Transkribus has become one of the go-to programs for transcribing historical documents. The development of software to transcribe old records began back in the late 1990s. Libraries were already using Optical Character Recognition (OCR) to digitize books, but primarily for those written in English.  Another program was needed for printed material in other languages commonly used in Europe.

Researchers came up with Analysed Layout and Text Object (ALTO) format which stored text and images of letters and words. The images were transcribed and stored for use in comparison to other documents over time, building up a library of words and phrases. Such documentation became what is called Ground Truth, a growing repository of images that could serve machine learning, or artificial intelligence processing.

The Transkribus project was established and backed by several institutions, coming together as the READ-Coop, formed to test and further develop the programming. The group became the official guardians of the Transkribus platform. There are now more than 100 European members of the coop. The first version went online in February 2015.

Read more about the history and development of Transkribus in a 2023 post on the Transkribus blog.

Costs to use the platform vary depending on the purpose and organization type of size. There is a free version for genealogists and students although the waiting period to get transcriptions done can be longer as the priority for these accounts is lower.

The process is simple.

      Set up an account.

      Open the Transkribus program.

      Drag an image into the left-hand side of the window.

      And the program will begin to transcribe it.

There may be a wait while the image is in a queue.

I wanted to test the technique, so I uploaded one of my family documents from the 17th century. This was a purported will for Sampson Shepheard of Cornwood, Devon, but it was actually a forgery conceived by his brother, William. But that is a story for another time. Anyway, it seemed like a good document to try out on various transcription programs.

Once a document is uploaded it enters a transcribing queue. Within a few moments a transcription will appear.


You can get a better view of the results on the Editor screen where you can see a line-by-line transcription. In this case the Transkribus version, aside from a few words and phrases, was not all that bad.

Then you can compare it with your own transcription or one from a published source.

The Transkribus results had 56 errors in 286 words, a Character Error Rate (CER) of 20%. Most of the errors were ones with different letters interpreted, like adding an ‘s’ to the end of a word that had a squiggle, interpreting a ‘p’ for a similarly shaped ampersand, or using a slightly different spelling. So, the CER of 20% was a bit misleading.

By the way, I did tests on Transkribus with other documents. Some were good; some were poor. What I found was that there are positives and negatives with their process, although admittedly I tried a very small sample.

Note that Transkribus is a transcribing software, as the name suggests. It does not translate documents, though, from one language to another.

I recommend that genealogists try Transkribus for themselves. Compare it with other AI platforms such as OpenAI’s ChatGPT or Microsoft’s Copilot as well. You might be amazed what results you get.

No comments:

Post a Comment