I have previously written about reading old handwriting and whether children in school today will be able to even write cursively. I stated in one blog post about poring through hundreds of pages of old parish registers and learning how to read Old English as well as interpret poor handwriting of some scribes of yesteryear. And I lamented that many students today are not learning cursive writing that many of us older folk did and may not be able to read old documents.
Well, help may be on the way!
We have seen the large strides made in converting printed
material into digital documents. Most family researchers, or even casual
readers now take advantage of thousands of books and articles, both old and
current, now being available online as a result of extensive scanning efforts.
I also commented back in 2017 that “technology is being
developed to read handwriting through optical character recognition. That will
be a great boon to anyone dealing with old documents, but it is likely still a
long way off before we have such a program on our own computers.” It is now
a lot closer to reality than I imagined back then.
One of the major applications to reading and transcribing
old documents is through Transkribus. I mentioned it in a recent presentation
to the Family
Tree (UK) Brickwalls, Skills & Solutions Club.
Transkribus has become one of the go-to programs for transcribing
historical documents. The development of software to transcribe old records
began back in the late 1990s. Libraries were already using Optical Character
Recognition (OCR) to digitize books, but primarily for those written in
English. Another program was needed for printed material in
other languages commonly used in Europe.
Researchers
came up with Analysed Layout and Text Object (ALTO) format which stored text
and images of letters and words. The images were transcribed and stored for use in
comparison to other documents over time, building up a library of words and
phrases. Such
documentation became what is called Ground Truth, a growing repository of
images that could serve machine learning, or artificial intelligence
processing.
The Transkribus
project was established and backed by several institutions, coming together as
the READ-Coop, formed to test and further develop the programming.
The group became the official
guardians of the Transkribus platform. There are now more than 100 European members of the
coop. The first
version went online in February 2015.
Read more about
the history and development of Transkribus in a 2023 post on the Transkribus blog.
Costs to use the platform vary depending on the purpose and
organization type of size. There is a free version for genealogists and
students although the waiting period to get transcriptions done can be longer
as the priority for these accounts is lower.
The process is simple.
• Set up an account.
• Open
the Transkribus program.
• Drag
an image into the left-hand side of the window.
• And
the program will begin to transcribe it.
There may be a wait while the image is in a queue.
I wanted to test the technique, so I uploaded one of my family documents from the 17th century. This was a purported will for Sampson Shepheard of Cornwood, Devon, but it was actually a forgery conceived by his brother, William. But that is a story for another time. Anyway, it seemed like a good document to try out on various transcription programs.
Once a document is uploaded it enters a transcribing queue. Within a few moments a transcription will appear.
You can get a better view of the results on the Editor screen where you can see a line-by-line transcription. In this case the Transkribus version, aside from a few words and phrases, was not all that bad.
Then you
can compare it with your own transcription or one from a published source.
The Transkribus results had 56 errors in 286 words, a Character Error Rate (CER) of 20%. Most of the errors were ones with different letters
interpreted, like adding an ‘s’ to the end of a word that had a squiggle,
interpreting a ‘p’ for a similarly shaped ampersand, or using a slightly
different spelling. So, the CER of 20% was a bit misleading.
By the way, I did tests on Transkribus with other documents.
Some were good; some were poor. What I found was that there are positives
and negatives with their process, although admittedly I tried a very small
sample.
Note that Transkribus is a transcribing software, as the
name suggests. It does not translate documents, though, from one language to another.
I recommend that genealogists try Transkribus for
themselves. Compare it with other AI platforms such as OpenAI’s ChatGPT or
Microsoft’s Copilot as well. You might be amazed what results you get.
No comments:
Post a Comment