How Does Automated Closed Captioning Work?

How Does Automated Closed Captioning Work?

How does automated closed captioning work? What elements improve or impact the accuracy for artificial intelligence (AI) driven captioning?

This article examines why automating caption generation is important before diving into how speech recognition and other elements combine to provide an accurate experience. This includes many behind the scenes aspects that go into how AI approaches the task of transcribing audio. The article then concludes with a few tips to keep in mind when looking for a solution that automates closed captioning.

For more information on this topic, including how to succeed with automated closed captioning and what to expect, also be sure to download our in-depth How Can AI Elevate Your Closed Captioning Solutions? white paper as well.

Why automated closed captioning is important

Without some sort of automation, closed captioning is a very time consuming process. For those experienced doing it, the process can take roughly 5-10 times the length of the content. So an hour long video could take anywhere from 5-10 hours to transcribe, essentially taking a full day of work to achieve. For organizations producing a lot of content that also want to keep up with increasing regulations that mandate the inclusion of captions, such as the Americans with Disabilities Act and rules from the FCC, this presents a challenge.

This is where artificial intelligence and closed captioning is key. Through utilizing and training AI based solutions, organizations can greatly reduce the time devoted to captioning, speed up the time it takes to get assets ready to be shared with captions and manage larger volumes of content.


How speech recognition and auto closed captioning works

In overly simplified terms, the way AI creates closed captions is through speech to text. There are a variety of elements that go into this process, including ASR (Automated Speech Recognition). Many of these are focused around providing not just captions, but to improve accuracy of the final product as well. These concepts and technologies include:

  • Speech Recognition:
    The first steps of the process of ASR is being able to receive audio. From this, the AI can begin to work through the audio to match speech to a machine readable format, i.e. text. Rudimentary offerings require that words be spoken very clearly to be recognized. More advanced AI can handle natural speech, accents and dialects, although accuracy will not be as high as simple speech spoken very clearly.
  • AI Vocabulary:
    Artificial intelligence, as part of the speech recognition process, will try to match what it recognizes as speech against a vocabulary list of terms. Now AI can only transcribe words that it knows. If it’s not familiar with a term, it will try its best to link it to something in its vocabulary. For example, if the term “webinar” isn’t known it might give a result like “weapons are” as the closest proximity.
  • Audio Recognition:
    Another aspect involves being able to recognize and separate sounds from actual speech. This can be something like a crowd cheering, but can also be noises like a ball being hit or a player grunting as they trip. Consequently, it’s important for the AI to be able to know that not every sound is necessarily a word. It has to decipher between actual language and noises.
  • Language Identification:
    While content will generally be in a single language, some content can be be mixed. For example, a news program might shift from an announcer in English to an interview with someone speaking in Spanish. In those scenarios it’s very beneficial for the technology to be able to detect and identify the different languages at any given time, realizing that the language has changed and using a list of words associated with that language. That said, the use cases for this can be minor. It’s rare that a content owner will want content that has multiple languages represented in the same closed captions.
  • Diarization:
    Diarization deals with the capability of being able to separate different speakers. For example, an interview will have multiple people speaking, sometimes one person asking questions and one or multiple people answering. Being able to separate speakers can be important to understand different accents and dialects, if appropriate, to maintain accuracy. This can also help to break up captions by recognizing when a person starts and stops speaking. This can be either to separate them between different speakers or add more appropriate punctuation as needed. This could even be used to note the speaker and association them by name, as a much more advanced example.
  • Context:
    Are you looking for the “bare” necessities or the “bear” necessities? Did someone just “ate” or do they have “eight” of something? Homophones (homo : “same” and phone : “sound”) are words that sound the same but carry different meanings. Homophones are not exclusive to one language, although English in particular has a lot of them, and they make transcription hard. To get them right, context has to be able to decipher the subject. This can go beyond just the context for an individual sentence too. For example, both “the kid was a minor” and “the kid was a miner” could be correct. However, by the fact that a kid is involved it’s probably talking about their age and not their profession. In this case, context for the content as a whole can help and becomes valuable for the AI to lean on this.
  • Audio Description:
    AI can look beyond verbal cues to also take in visual cues as well, although this is a more complex exercise for an AI to employ for caption generation. This includes, though, being able to understand concepts like someone walking up on stage or that it’s raining. This can then be used for both greater context and also could be used to even caption visual elements as well.


Tips for automating closed captioning

Automatic closed captioning is a powerful solution as industries are seeing increasing content being generated. However, the effectiveness of using AI to decrease the manual labor involved in captioning depends on accuracy. Below are some tips and considerations to make that can improve the overall accuracy of the final captions.

  • AI Training:
    AI can be trained to do captions better. This can be manual training and self-training, such doing AI vocabulary training and educating on context. On the topic of vocabulary, this can be training on acronyms or new company names, or even unique spelling of an individual’s name. Context can also be taught, and is something that takes a rich library of examples to improve on. This includes being able to manage homophones better as well, like the sun’s “rays” or someone is getting a “raise”, and other similar sounding words and terms.
  • Manual Editing:
    Automated closed captioning shouldn’t fully replace the human element of captioning. It’s still suggested to have someone review these generated transcripts for accuracy and preference. For example, correcting a homophone or deciding that you want a statement to read “we grew our business by 88%” instead of “we grew our business by eighty eight percent”. Editing doesn’t have to be a one time benefit either, as correcting a transcription can have long standing benefits as well for training.
  • Language Selection:
    It’s important to be able to note the language of the asset for transcription. Limiting speech recognition so that it just tries to match spoken terms against just words in English or just words in Spanish is vastly preferable. The reason is that it limits the pool of words, such as removing the possibility that someone said “chou” (French for cabbage) versus “shoe”. This also leans on the concept that content owners will generally want a single language represented in each closed caption file versus multiple languages.

For the context of live content, there is also an oppurtinity to utilize live scripts as a reference source. Ideally in a manner where the script isn’t used verbatim, but as a guide to help with vocabulary choices based on the speech process. Due to the fact that manual editing isn’t available for live content, anything that can be used to boost the accuracy can often times be incredibly beneficial. For more details on this, also check out our Captioning at the Speed of Live for Accessible TV article.



Increased content pipelines and increased regulations present both a hurdle to caption all content alongside a need to caption this content. Automating this process is a way to address this, reducing the manual labor and costs that were once associated with captioning an individual asset. Through training, this process can continue to evolve and improve with time as well, offering increasingly accurate captions after the initial generation of them.

Curious on how Watson Captioning addresses this? Also be sure to register for our Auto Closed Captions & AI Training Webinar. This includes a live demo on the technology, looking at the workflow for managing the automated caption