Speech to text software

As we move more into the world of corpora of written English, the next logical step is to consider a corpus-informed approach to teaching and learning spoken English.

Corpora of spoken English

There are some spoken corpora available online (http://quod.lib.umich.edu/m/micase/ is a good example of one) but the key problem is how to get recorded speech into a text form that can be processed.

Individual speech to text transcription

If we are thinking about the notion of the ‘i-corpus’ then it is possible for individuals to easily transcribe their own voice.  You can buy DRAGON (see http://www.nuance.com/naturallyspeaking/products/editions/default.asp) — if you train it to your own voice, they claim 99% accuracy. I’ve got a friend who uses this, who confirms that it does what it claims.

General speech to text software

However, if you want to transcribe a collection of various recordings of different people, with different quality of recordings, you might get something vaguely usable, but it would have to be checked and edited.  There is research being done in this area that is in the open source community.  One example is http://cmusphinx.sourceforge.net/ – developed at Carnegie Mellon University.  In fact, this has spawned a READING TUTOR — which will listen to a child reading a text, and point out any errors in pronunciation and stress: http://www.cs.cmu.edu/~listen/.

Archives of spoken English

Aligned to this work are researchers who are attempting to build an archive of spoken corpora, which can then be used as a basis for testing speech to text software.  One of these is http://www.voxforge.org/.  Another interesting area of research is based on accents – read the Guardian article about this at http://www.guardian.co.uk/education/2010/jun/01/english-accents-research?&CMP=%20EMCEDUEML1088.  If you want to contribute your own voice recording to the database of accents, just go here and record yourself:  http://accent.gmu.edu/

Leave a Reply

Your email address will not be published. Required fields are marked *