Understanding the Limitations of GPT-Powered Transcription Service

When ChatGPT emerged on the scene in late 2022, the solution represented such a huge leap forward that many users overlooked its limitations. For example, some users would submit information from ChatGPT as a report, which ended up being false or inaccurate. ChatGPT had hallucinated information to fill in gaps.

While people are now better at double-checking and verifying ChatGPT-provided content, there is still another area that people tend to overlook: transcription. There are now many transcription solutions powered by ChatGPT, or more specifically, OpenAi’s Whisper, its open source speech recognition system, as well as those from competing providers.

Transcription solutions powered by AI also have their limitations. If users are not aware of them and take the AI-generated transcripts at face value, they can suffer from serious consequences, including everything from operational chaos (i.e. colleagues don’t know what a misspelled word is actually referring to that is necessary for a workflow to move forward) to reputational damage (i.e. an incorrect transcription creates a memorable gaffe, undermine the organization’s reputation).

Because of these risks, users must be familiar with the limitations.

Brand names

Many new organizations choose brand names that are unconventionally spelled, such as by omitting a vowel or repeating the same letter. The purpose of this is strategic: Perhaps the domain name of the more conventional spelling is already taken, or they want to appear at the top of search results for their unmistakable brand.

AI-powered transcription does not do well with these. For example, Motorola Mobility’s line of smartphones, Razr, would likely be transcribed as the more common razor. If Motorola Mobility were to publicly release a transcript for one of its videos, this mistake could lead to serious reputational damage. Critics would lampoon Motorola Mobility for its failure to spell their own brand name correctly, even if a tool was ultimately responsible.

Interjections

The typical conversation does not usually happen neatly, where one person talks and then stops, giving the other person the floor to speak. During the course of a natural conversation, people interrupt one another, interject, and make side comments that often overlap.

According to the The Computational Social Science group of the Institut Polytechnique de Paris, GPT-powered solutions struggle with these asides. They tend to commit one of two errors:

One, parts of the secondary dialogue are omitted.

Two, the dialogue is presented sequentially, even though they did not really occur that way, potentially changing the meaning of a given conversation.

For businesses that rely on these tools for comprehensiveness and accuracy, they may be falling short in ways that are not always easy to detect.

Code-switching

For those unfamiliar with the term, code-switching refers to when bilingual or multilingual people freely change between their common languages during the course of a conversation. Coworkers in Singapore might start a conversation in English and then switch to Mandarin, when the topic at hand goes to concepts that would be easier expressed in that language.

According to several reports, GPT-powered transcription struggles with audio that includes code-switching. For example, instead of accurately transcribing the second language, it may simply write down similar-sounding words present in the first language. This inability may make it harder for working cultures that commonly use code-switching to have accurate transcriptions.

Select languages

Whisper’s training data set spanned over 680,000 hours of audio, which gives it the ability to recognize dozens of different languages, including everything from Arabic and Armenian to Vietnamese and Welsch.

But not all of the language support is of the same quality, owing to the fact that not all languages had equal inclusion in the massive dataset. Tagalog, the national language of the Philippines, suffered from numerous common errors, as Tommy Lim, Philippines Interviews Digitization Project Intern at GBH and the American Archive of Public Broadcasting, pointed out.

“[Errors included] unnecessary inclusions of certain speech sounds, typically consonants and glides, misspelling of words, separation of a single word into multiple, as well as the concatenation of separate words into one,” he wrote.

Brands that come to these services expecting higher quality transcription - for the very fact there are few other options that support their language - may be left disappointed. They may have to waste significant time correcting these errors, spending more time than they would have had they just chosen a human transcriber who natively speaks their language.

Peculiar failure case

GPT is also susceptible to what is known as a peculiar failure case. This is similar to how ChatGPT hallucinates untrue facts. With a peculiar failure case, the audio input will be a clear phrase, and the corresponding output will be totally unrelated.

Some of the examples from Analytics India Magazine are particularly egregious: “Seven” in the audio becomes “Damn it!” in the transcription, while “Her jewelry shimmered” becomes “Hey, did you lose your mind?” These mistakes can radically change the meaning of a given transcription, and easily lead to misunderstandings between employees, or even reputational damage with the public at large.

Punctuation

Out of all knowledge workers, journalists maybe one of the roles that does the most transcription. They have to interview sources and get these interviews transcribed in order to write their stories. Naturally, transcription is a huge time-sink for them, as journalist James Sommer described in The New Yorker.

He documented his long history with transcription solutions and services, which eventually culminated with his first AI-based one, Otter.AI. He noted that it was not great at punctuation. For example, in places where most human transcriptionists would put a period, the content would simply go on and on, a run-on sentence that would enrage even the most patient of English teachers. Having to correctly punctuate sentences for readability may be a time-consuming nuisance for people who came to this solution looking for more efficiency.

Thinking past transcription

Because even cutting-edge tools like GPT are still prone to various errors, businesses should think beyond just transcription software. The goal should never be to just jot down what was said at a meeting, as this is a passive exercise.

Instead, businesses should conceptualize transcription as a mere jumping off point for further collaboration. This premise is possible with voice collaboration platforms like Vocol.ai, which starts with transcription but crucially offers value-added features that make a meaningful impact, addressing some of the issues with transcription alone.

On Vocol.ai, users can translate transcripts, share auto-generated summaries, make highlights where appropriate, and even produce action items. These features transform transcription from a one-off exercise into a hub for follow-through: With Vocol.ai, plans don’t only get made - they get done.

Like this article? Share it with your friends and associates! Remember to follow us on Facebook, LinkedIn and Twitter for news and updates about Vocol.ai. Better still, leave us a rating on the following review sites and we will love you forever!

Understanding the Limitations of GPT-Powered Transcription Service

Brand names

Interjections

Code-switching

Select languages

Peculiar failure case

Punctuation

Thinking past transcription

Related Posts

What Problems Does Online Transcription Service Solve?

How AI Removes Language Barriers in Service-Oriented Industries

Thinking Beyond the Transcript: 6 Benefits of Using Vocol.ai for Video Meetings