Video Transcription: Automated Audio to Text

Video Transcription: Automated Audio to Text

Transcribing audio can be a slow process. For those looking for a solution to scale or speed up video transcription, a solution is automated audio to text. This takes AI (artificial intelligence) and uses it to transcribe speech through combining information about grammar and language structure. Using this technology, content owners can start generating transcripts through simply uploading a file.

Note: some of these topics were covered in our Simplify Your Corporate Video Strategy webinar, with the archived version available for immediate viewing.

Automatically transcribing audio

Automatic speech to text relies heavily on machine learning. This is a complex process that needs to separate dialogue from other noises to focus on transcribing the spoken language. It then needs to not only decipher what they were saying but the meaning behind it as well. For example, there is a big difference between someone saying “we have a mail problem” versus “we have a male problem” even though both sound the same.

Consequently, the success of a transcription service will depend on the strength of the AI behind it. With IBM’s video streaming and enterprise video streaming offerings, this is done through using the IBM Watson technology. Furthermore, this is included into the service for these offerings, at no additional cost. Content owners simply need to upload their video content to the service and select a language for the transcription process to start. Setting the language can be done manually or setup to automatically transcribe content uploaded in the future. Once the language is set, it takes roughly the length of the asset to create a transcription of the audio. So a 20 minute video would take roughly 20 minutes to generate the transcription. This transcription is then associated with the video file, and can be used over the IBM’s video streaming and enterprise video streaming platforms or downloaded as a WebVTT text file. This file can be used in any text editor to view the transcription, or uploaded to services that recognize the WebVTT format.

Here is a guide for using Watson to generate captions over IBM Watson Media. Alternatively, you can sign up and start uploading files to test it out.

Benefits of automated video transcription

Manually transcribing audio is a time consuming and, arguably, painful process. It often involves slowly listening to the audio again and again to write down dialogue. How long it takes to transcribe content is up for debate, largely as it depends on the individual doing it. One rough estimate puts an hour of audio as taking a minimum of 4 hours to transcribe, with the caveat that it could take 6-8 hours for most and even 8-10 hours if many individuals are talking in the audio.

Now expecting an employee to devote an entire work day to transcribing an hour of audio can be an unreasonable, if not prohibitively costly, ask. As a result, automated video transcription through converting audio to text can be a huge benefit. It can vastly cut down on the time commitment to manage video libraries.

Uses for video transcriptions

There are two large benefits inherent to video transcripts, one more visible to end users and one less visible.

Video Transcription: closed captions

Video captions

The first and more visible of these benefits are related to providing closed captions for your video content. Now closed captions are a very important aspect of a video strategy. Overtly, they aid in reaching those hard of hearing or deaf, an audience that is projected to be 15% of the American adult population. There are also a variety of regulations and legal reasons for doing closed captions. These range greatly by country, state and industry and includes regulation such as the Workforce Rehabilitation Act and the Americans with Disabilities Act. For a more complete list of regulations, reference this What is Closed Captioning and How Does it Work article. However, beyond being considerate and legal reasons, there is also a growing preference around watching content muted. In fact, Facebook discovered that a staggering 85% of video content on their platform was watched muted. As a result, closed captions are crucial for providing context for this growing number of users watching content with no sound.

To learn more about using automation for captions, reference this Convert Video Speech to Text with Watson article.

Searchable transcripts

A less obvious benefit for transcripts is increasing discoverability of video assets. Making content easy to find becomes a more crucial problem to solve as video archives grow. Many executives are already realizing this pain point as well. In fact, 79% state that a “frustration of using on-demand video is not being able to quickly find the piece of information I am looking for when I need it” as noted in the Unlocking the Hidden Value of Business Video report. While uniform metadata, such as a description and tags, should help in finding assets, this is far from perfect. Consequently, allowing end users to search against transcripts can be a great way to unearth relevant content. For example, let’s say an executive is doing a forecast for the new year. He can cover goals and projections as part of this presentation, and chances are the metadata will tie into this. However, let’s say he also does a recap of last year’s performance. This could be valuable information that the metadata could have left out. Consequently, being able to search against the transcript would ideally unearth this for the end user regardless.

To learn more about using transcripts for search, reference this Enterprise Video Search & Discoverability article.

Local copies

After a transcript is generated, a file can be downloaded in a WebVTT format. This file can be opened through Notepad, TextEdit and other programs. While it is intended as a closed caption format, this file can act as a locally stored transcription. This can then be used through other programs or services, or simply be used as an easy means to copy lines of text for other applications or uses.

Considerations when using automated audio to text

While manual transcription can have inaccuracies, automated processes are much more prone. So while automating transcription generation can save a tremendous amount of time, they should be checked and edited for accuracy. That said, if an organization finds themselves strapped for time, reference the list below. This will note variables that will negatively impact speech to text accuracy. As a result, it can be used to prioritize assets that have the most potential for errors.

Variables that lower automated transcription accuracy

Some factors can lead to a notable decrease in accuracy for automated transcription processes. Many of these factors are ones that would also inhibit the manual creation of an audio transcript. So those familiar with the process should be well versed in these pain points.

Video Transcription: multiple speakers

Multiple speakers

It can be hard for someone manually transcribing to keep track of multiple speakers. Part of the problem is generally the speed that dialogue happens is faster with more people involved. The difficulty for automated processes, though, comes from moments when people are interrupting each other. Overlapping dialogue will be a major roadblock. In these instances it will take manual judgement to decide how to transcribe it. For example, it might include all dialogue or choose to just transcribe a dominating voice instead.

Audio quality and ambient noise

The quality of the audio will also impact transcription accuracy. Overly compressed, muffled audio can be victim of this. However, content owners should be avoiding this anyway, unless they have inherited old or poor transfers and have to use them. Another aspect is ambient noise. Outdoor recordings or those that demonstrate a bit of an echo can also be problematic for automated transcription. Vocal soundtracks, if the audio isn’t intended to be transcribed, can also pose a problem.


A speaker with a thick accent, making their speech hard to understand, is a perfect use case for closed captions. The reason being that captions can provide that clarity as to what the speaker was saying. …unfortunately, machine learning is not a silver bullet for managing someone with a thick accent. If an audience struggled to understand the individual, so will artificial intelligence.

Subject complexity

While technology like IBM Watson can navigate technical terms and even acronyms, it can struggle with industry terms and names. This includes the name of an individual, but is especially true for products and brands with unusual spellings. Well known examples of this include brands like Reddit, Flickr and even Krispy Kreme Doughnuts that have unusual spellings. These should be spot checked as part of the review process.

Thankfully, in the case of the latter issue, Watson can be trained for both industry and brand terms and then have this leveraged for live captioning. This will improve accuracy, while also supporting live streaming use cases. Contact IBM sales to learn more.


Audio transcription is a valuable but potentially time consuming process. Through artificial intelligence, organizations can better scale their transcription efforts. This means less time spent transcribing with the ability to tackle larger portfolios. Ultimately, this benefits end users as well, being given access to content with captions on them or being able to search and find assets easier.

Interested in trying out the automated transcription feature? Sign up for a free trial and start uploading files to take advantage of the automated audio to text capabilities.