Transcribing voice recordings means putting them in a textual format for better usage. Often there are meaningful discussions of seminars, meetings, or focus groups going on. There are also recordings of such events made available for future use. Transforming these voice recordings into a textual format is a valuable habit that provides many benefits. For example, the text is a more flexible format that can be made available to anyone in need. Saving the text requires less memory storage than the audio form. The textual design makes it flexible for sharing with others in various formats. Moreover, the text is usually more accessible than audio for any file.

How to Transcribe Speech Recordings to Text with Python

Methods for Transcribing Voice Recordings to Texts

There are several methods to transcribe an audio file into a text file. These methods could be building a project to get the work done, using online platforms, using free transcription tools, or converting the audio into text using an application for such purpose. The users use these quick methods because the world is moving very fast as it grows richer in technology. Therefore, there is a short and limited time available to perform a task, and the best way to get the work done is to choose a reliable application that solves the urgent problem. Using an efficient transcribing application saves both time and effort spent on the task. Therefore, the primarily used applications for transcription include Rev Voice Recorder, Temi Record and Transcribe App, and Rev Call Recorder. These applications include several features such as recording a voice, uploading the voice file, and providing a text file.

How to Transcribe with Python

Transcribing audio recordings to texts is done in python by using the SpeechRecognition library available in Python. Speech recognition refers to the ability of computer software which helps in identifying words and phrases in the spoken language and then converts them into humanly readable textual format. There is no need to build a model for such a purpose from scratch. There are many essential wrappers available with this library known for public speech recognition APIs. For example, some of the famous recognition engines supported by this library are following:

  • ai
  • Google Cloud Speech API
  • IBM Speech To Text
  • Google Speech Recognition
  • CMU Sphinx
  • Microsoft Bing Voice Recognition
  • Snowboy Hotword Detection
  • Houndify API

The famously known Google Speech Recognition is utilized here as an example for transcribing the audio file.

Transcribing Speech to Text in Python

The voice assistants such as Siri, Alexa, and Google assistant utilize the speech recognition process. The Speech Recognition API in Python provides the ability to convert audio into text. It applies to large files too. The main thing here is that the large files are divided into chunks first to maintain accuracy.


A simple example of converting speech into text using the Speech Recognition library in python is the following. The module pydub and SpeechRecognition are required here to utilize libraries. Following is the code to install the library:

pip3 install SpeechRecognition pydub

Working with Audio Files

The AudioFile class in Voice Recognition makes it simple to interact with audio files. This class takes a path to an audio file as an argument and provides a context manager interface for reading and interacting with the file’s contents.

Supported File Types

It accepts the following file types:

  • WAV: PCM/LPCM format is required.
  • AIFF.
  • AIFF-C.
  • FLAC: The format must be native FLAC; OGG-FLAC is not supported.

The user should operate with FLAC files without issue on x-86-based Linux, macOS, or Windows. However, other platforms may require the user to install a FLAC encoder and access the FLAC command-line tool.

Record() Function

The record() function is used to capture the recordings.

The context manager examines the file’s contents and stores it in an AudioFile instance called source. Finally, the whole file data gets stored into an AudioData object by the record() function. The user may confirm this by looking at the audio format.

>>> type(audio)
<class 'speech_recognition.AudioData'>

Recognize_google() can help the user to recognize any speech in the audio. However, depending on the internet connection speed, the user may have to wait a few seconds before viewing the result.

The next thing is to convert a simple mp3 format file to a .wav format file to feed the speech recognition system for further processing. Once the file is loaded and fed to the system, the system will automatically convert the speech into text format and output the transcription of the original audio file. Here is the code to achieve this goal.

import speech_recognition as sr
from os import path
from pydub import AudioSegment
with sr.AudioFile(AUDIO_FILE) as source:
            print(“Transcription: ”+r.recognize_google(audio))

The user needs to save the above code in a file name and run the file to get the output in the text format. First, it reads the file into mp3 format and then converts it into .wav format. After that, it reads the whole audio file into the variable audio, which prints the entire transcription later. The process works by uploading the file to Google and grabbing the output. It works accurately for small or medium-sized files. However, this does not support the transcription of the large files.

Listen() Function

Microphone, like the AudioFile class, is a context manager. The listen() function of the Recognizer class inside the with block can be used to take input from the microphone. As its first parameter, this method takes an audio source and records input from the source until silence is detected.

>>> with mic as source:
...     audio = r.listen(source)

Try saying “hi” into the microphone after the user executes the with block. Then, wait for the interpreter’s prompt to appear again. When the “>>>” prompt appears again, it is ready to identify speech.

>>> r.recognize_google(audio)

If the prompt does not appear again, the microphone may take up too much background noise. The user can restore the prompt by interrupting the procedure with Ctrl+C.

To address ambient noise, the user may need to utilize the Recognizer class adjust for ambient noise() method, much like when trying to make sense of the noisy audio file. Because microphone input is significantly less predictable than audio file input, it is good to perform this whenever it listens for microphone input.

>>> with mic as source:
...     r.adjust_for_ambient_noise(source)
...     audio = r.listen(source)

After running the above code to adjust for ambient noise() to finish, wait for a second, then try saying “hello” into the microphone. Once again, the user must wait for the interpreter’s prompt to return before attempting to recognize the speech.

Remember that adjusting for ambient noise() performs a one-second analysis of the audio source. If this appears very long, you can shorten it with the duration keyword parameter.

Transcribing Large Audio Files

The accuracy of speech recognition drops while working with large audio files. There is also a shortcoming in the Google speech recognition API. It does not recognize large audio files with reasonable accuracy rates. One solution could be to convert the large audio files into smaller chunks to feed the API to resolve this problem. Thus, the accuracy of the process increases by using smaller chunks as the input.

Problem with Splitting Large Audio Files

Since splitting the large audio files into chunks is an essential process to increase output accuracy, splitting the files into chunks of equal length is an initial idea. For example, there is an audio file that is around 20 minutes long. One way to split this file is to form smaller chunks, each having a recording that is 10 seconds long. The next step is to feed these chunks to the Google speech recognition API, transcribing them into texts.

Finally, the process concatenates these chunks to form a complete transcription of the audio file. The problem with this particular method is the silence in the original audio file. There is a possibility of interruptions in words due to silence in the file. As a result, the chunks could include incomplete words, or the essential words might get lost while splitting the file into constant-sized chunks. The API will not be able to recognize this problem, and general transcription might not be correct.

Splitting Based on Silence

Another method is to split the audio file based on silence. There is a chance to analyze the pause taken in sentences and then split the file based on these pauses. After that, the user needs to concatenate the chunks to make sense of the transcription. There is no need to split the file into equal-sized chunks for this method. Moreover, there is no chance of cutting the sentence in between and having incomplete words in the conversion. The chunks contain entire sentences without interruption in this division.

Problem with Splitting Based on Silence

One limitation of this method is that it is pretty hard to determine how long the silence will be. The reason is that there are various periods of pauses used by different people. Some might pause for a second, while others might pause more than that or less than that. The period varies based on the speaking habits of users.


Following is an example of transcribing a large audio file into the textual format by splitting it based on the silence:

import speech_recognition as sr
import os
from pydub import AudioSegment
from pydub.silence import split_on_silence
def silencebasedconversion(path = "alice-medium.wav"):
           song = AudioSegment.from_wav(path)
           fh = open("recognized.txt", "w+")
           chunks = split_on_silence(song, min_silence_len = 400, silence_thresh = -16)
           i = 0
           for ck in chunks:
                                    chunksilent = AudioSegment.silent(duration = 10)
                                    audio_chunk = chunksilent + ck+ chunksilent
                                    print("saving chunk{0}.wav".format(i))
                                    audio_chunk.export("./chunk{0}.wav".format(i), bitrate ='192k', format ="wav")
                                    filename = 'chunk'+str(i)+'.wav'
                                    print("Processing chunk "+str(i))
                                    file = filename
                                    r = sr.Recognizer()
                                    with sr.AudioFile(file) as source:
                                               audio_listened = r.listen(source)
                                               rec = r.recognize_google(audio_listened)
                                               fh.write(rec+". ")
                                    except sr.UnknownValueError:
                                               print("Audio is not understandable")
                                    except sr.RequestError as e:
                                               print("Could not request results. check the internet connection")
                                    i += 1
if __name__ == '__main__':                          
           print('Please provide the audio file path')
           file_path = input()