We get a lot of potential customers looking for software that will automatically transcribe their audio or video files for them. But we have to disappoint them. As much as we would love to be able to provide software that automatically transcribes speech, unfortunately the technology to support speech recognition simply hasn't gotten to the point where automatic transcription of audio or video recordings can match, let alone surpass, the accuracy of (good) human transcriptions. One of the reasons for this is that speech is incredibly complex, with variations in accents and enunciation as well as pitch and tone of voice, making it hard to match spoken words to written ones. Human transcribers have the luxury of determining how accurately they're going to transcribe a given audio or video file (or voicemail), for example
- Transcribe verbatim, including "ums" and repetitive phrases such as "like, like," and even enter indications of non-language cues such as laughter and sighs
- Skip over the "ums" and pauses as transcribed (which is what I decided to do)
- Transcribe only the relevant parts of the message.
Commercially available speech-to-text software, such as Dragon generally works best if you "train" the software to a specific voice and even then users should listen to the audio they wish to transcribe and re-speak what they hear for the software to translate. Moreover, there's a further step needed in proofreading the transcription and correcting any errors, which are quite many regardless of your mic quality of ability to speak like a news anchor with absolutely no accent.
Progress is being made on cracking the "speech to text" nut. Some voicemail providers offer automatic speech-to-text transcriptions of incoming voicemail. Apple's Siri is another step towards instant voice to text but the accuracy is well beyond the acceptable.
In the end, unlike computers, humans can compensate, at least to a degree, for another person's mumbling or to poor audio quality and other problems that can affect the clarity of the speech being transcribed.