Generative Artificial Intelligence (AI) is becoming a more prevalent technique for transcribing old documents, in particular, that for handwritten text. Programs are growing in number for very old records and for use in many languages.
The process revolves around Optical Character Recognition for written or printed documents that were created using old handwriting or printing styles. The idea was to create a digital library or memory of letter and word shapes that could then be compared to new, scanned images to produce an interpretation of what the document contained. The results would be applicable to reading single pages or entries (of interest to genealogists) to mass conversion of documents containing a multitude of pages stored in archives. The objectives are to do so quickly, easily and accurately.
FamilySearch Full-Text Search
One feature
that has caught the eye of genealogists lately is the Full-Text Search function
developed by FamilySearch that has expanded options for searching handwritten
documents in the thousands of collections and millions of images it has in its
digital library.
Full-Text Search, introduced in 2024, is a part of a group of experimental programs which are part of FamilySearch Labs “where you can explore emerging FamilySearch features that are not yet ready for public release”. Researchers are invited to participate in refinement of the programs, through testing and feedback.
Full-Text
Search uses AI processes
developed by FamilySearch to scan and locate information on digitized
documents that have not been fully indexed. Their technique makes it possible
to find specific words or phrases, including names of people that may not have
been the main parties. All one needs to participate is an account on the site,
which is free to obtain.
I wanted to try out the Full-Text Search to see how it would work. Beginning on the FamilySearch Labs page, I selected the experiment titled Expand your search with Full Text and clicked on “Go To Experiment.”
A search
form came up asking for information on keywords, names, places, dates and
collections to locate data about people and events.
On the form I entered just Nicholas Shepheard, which is the name of several of my ancestors, including four in my direct line, who lived in Devon, England for several centuries. As recommended, I put his name in quotation marks so that it would look for instances where both names were present and with the exact spelling.
I did not fill in any other data as I wanted the widest search possible. A list came up with 79 results with some documents having the name repeated several times within them. The documents were spread across many areas in the United Kingdom and Ireland (72) and the United States of America (7). In the UK and Ireland group, 66 were from England sources, four were from Ireland, and two were from Wales. The English sources included 11 counties or regions.
Then I
narrowed down the list by choosing only those in Devon and got 37 results. Images of each
original document found could be opened so I could see who the individual was and
whether they were part of my family. A full transcription accompanied the images,
both of which could be downloaded.
Some of the
records I had seen before from searches of other collections and websites. Some
were new to me. For convenience, Full-Text Search highlighted
Nicholas’s name on the images and in the transcription.
One of the documents was a 1786 settlement examination from Ermington parish (reference FamilySearch: England, Devon, Plymouth, Parish Chest Records, 1556-1950). It shows that a man named Hercules Ferris worked for Mr. Nicholas Shepheard around 1761-62 at his farm called “Quay” in Cornwood parish.
AI transcriptions are not perfect. The settlement
transcription had 14 errors plus six missing words out of a total 189 words in
the actual transcription, a 10.6%-character error rate (CER). I determined that
Quay was a misinterpretation for Gnats. This location later became Notts and was the Shepheard family seat
for over 170 years – between 1630 and (probably) 1806.
This was
the first time I had seen this document. I had not encountered it on any search
of Cornwood or Ermington parish indexes or lists that included Nicholas
Shepheard’s name. Important to me was the document appeared to confirm the
family was living at Gnats/Notts in the mid-18th century.
So, my
experiment was a success!
Future
projects will be to investigate those other examples with Nicholas Shepheard in
UK and USA documents and to investigate other family members and locations.
Or I may
look at natural events, one of my favorite subjects. For example, I did a quick
search for "hailstorm” and got 6,536 results: 595 in the United Kingdom
and Ireland and 5,711 in the United States of America. A search for “floods”
got 149,004 hits: 10,107 in the UK and Ireland and 132,618 in the USA; “earthquakes”
got 3,531 and 231,628, respectively; “famine” got 18,861 and 194,020,
respectively. The results included mentions in newspapers which FamilySearch
has in its library. Each search can be narrowed down to locales, years and
names which will be handy for looking at specific families and past homes.
I highly recommend family historians take advantage of this new program and do some searches for their ancestors. I think you will be very pleasantly surprised.
Other AI Transcription Options
In
reviewing AI transcription options, I also wanted to test and compare other
techniques, so I uploaded the 1786 Ermington settlement example to other
platforms. The results were eye-opening.
Transkribus
Transkribus has become one of the internationally
recognized go-to programs for transcribing historical documents.
The
development of software to transcribe old records began back in the late 1990s.
Libraries were already using Optical Character Recognition to digitize printed books,
but primarily for those written in English.
Another program was needed for material published in other languages.
Researchers
came up with Analysed Layout and Text Object format which stored text and
images of handwritten letters and words. The images were transcribed and stored
for comparison to other documents over time building up a library of words and
phrases. Such documentation became what is called Ground Truth, a growing
repository of images that could serve machine learning, or artificial
intelligence processing.
The Transkribus
project was established and backed by several institutions, coming together as
the READ-Coop, formed to test and further develop the programming. The group
became the official guardians of the Transkribus platform. There are now
more than 100 European members of the coop.
The process
is simple to use. Just set up a free account, open the program, drag an image
into the left-hand side of the window and the program will immediately begin.
After a few minutes waiting in the queue, a transcription will be available. A
line-by-line comparison with the original image can be produced.
I followed this formula with the 1786 Ermington settlement document.
Ancestry
Ancestry
is developing a new
process – still in Beta testing at present – called Document Transcription
Tool. The function can read and transcribe a variety of old handwritten
documents. This feature can be used globally across all Ancestry platforms,
including the app, mobile, and desktop websites and in multiple languages.
To use the program,
a user must have an Ancestry account and a family tree posted on their site. A
target document is first loaded on to the Gallery section of an ancestor’s tree
profile. Once opened, a button marked “Transcribe” is selected and the process
will begin. The transcription takes only a few minutes.
For my test, I added the 1786 Ermington settlement document to the profile of my 5th great-grandfather, Nicholas Shepheard. I then let Ancestry do its thing and come up with a transcription.
ChatGPT and Copilot
As part of
my assessment, I also looked at having two other AI sites attempt a
transcription: ChatGPT, developed by OpenAI; and Microsoft Copilot. These are
two main-line platforms, developed by well-known groups, now commonly used in
AI processing.
After
uploading the 1786 Ermington settlement document to each of them, I asked, “Can
you transcribe this image?”
Again,
almost immediately I had transcriptions of the document.
Results
I compared Transkribus,
Copilot, ChatGPT and Ancestry results with my own (actual)
transcription. On the illustration here, my transcription,
which I believe is accurate, is on the right. All the words in the four processes
which matched the actual transcription are highlighted in yellow.
All AI
techniques worked well. Character Error Rates (CER) were calculated for each from
the number of words transcribed wrongly plus any word count difference in the
result.
• The best CER was in the ChatGPT
transcription at just 7.9%, including missing seven words.
• Copilot was right behind with a CER of 8.5%. This
transcription was very close to the actual with only seven words
mis-transcribed. It did miss nine words, though.
• The Ancestry transcription
ended up with fewer words than the actual transcription. It missed a whole
phrase, along with the last word. Its 12.7% CER is acceptable but such large rates
need close, line-by-line checking.
• Most of the 16 errors in the Transkribus
transcription were words where different letters were interpreted, such as often
mistaking ‘e’ for ‘o’. Curiously it looked at a blemish on the document and
transcribed it as a number. The CER of 11.1% is a bit misleading as it was very
close to the original both in word count, format and spelling.
• The FamilySearch Full-Text Search transcription had, as noted above, a CER of 10.6%, comparable to the other platform results (14 errors plus six fewer words).
Many errors
can be a result of penmanship as much as historical writing styles.
Transcription
of any of the processes can be improved by making corrections to the results
offered and resubmitting them. Over time, as the archive of “ground truth”
(more examples processed and corrections submitted) is built for similar
documents, the transcriptions will get better.
Overall, the results of all the techniques were very encouraging. I certainly will be using each, or all of them going forward.
Online References
AI know how
for family history: Have you tried the FamilySearch AI Full-Text Search. https://www.family-tree.co.uk/how-to-guides/ai-know-how-for-family-history-have-you-tried-the-familysearch-ai-full/
Ancestry
News: Ancestry launches Document Transcription Feature https://www.ancestry.com/c/ancestry-blog/ancestry-news/document-transcription-feature
ChatGPT
https://chatgpt.com/
Copilot
https://copilot.microsoft.com/
FamilySearch Full-Text Search https://www.familysearch.org/en/search/full-text
FamilySearch Labs https://www.familysearch.org/en/blog/familysearch-labs
Mühlberger,
Günter. (2023). A Short History of Transkribus. https://blog.transkribus.org/en/a-short-history-of-transkribus-with-gunter-Muhlberger
Transkribus https://blog.transkribus.org/en
BYU Library helpful recent videos
Using the FamilySearch Full Text Search Feature-A
Genealogical Goldmine – James Tanner (2 June 2024) https://youtu.be/YRYn7wyo7OA?si=J7P10grh7p_pxhPA
AI, Handwriting Recognition, and Full Text Searches – James
Tanner (2 February 2025) https://youtu.be/5PVUHrJLT4w?si=I5LLqSafJ3iJ6BPm
FamilySearch Full-Text Search: A New Key to Tearing Apart
Brick Walls – Amy Peacock (4 February 2025) https://youtu.be/udU2xT0ssXA?si=Mnf06b4K793Nem-m
Getting to Know FamilySearch's New Full-Text Search –
Kathryn Grant (19 February 2025) https://youtu.be/LhNE8znSPgM?si=_Fz1lT-UTfbPXQlj
The Needle in the Haystack: Researching Women and Minorities
using FamilySearch’s Full-Text Search – Julia A. Anderson (26 February 2025) https://youtu.be/jPg0qTcsBVM?si=6Qxrxzypi2-UdBS4
The FamilySearch Full Text Function – Jerroleen Sorensen (10
May 2025) https://youtu.be/4M3h-bSiQGM?si=5p5z20qZ2u9dz87b
Legacy Family Tree Webinars
Full-Text
Search: Genealogy Game Changer – Geoff Rasmussen (11 March 2024). https://familytreewebinars.com/webinar/full-text-search-genealogy-game-changer/
Secrets for Success: How to Harness the Power of
FamilySearch’s Full-Text Search – Julia A. Anderson (21 May 2025) https://familytreewebinars.com/webinar/secrets-for-success-how-to-harness-the-power-of-familysearchs-full-text-search/
No comments:
Post a Comment