Category Archives: lexical tools

Speech to text software

As we move more into the world of corpora of written English, the next logical step is to consider a corpus-informed approach to teaching and learning spoken English.

Corpora of spoken English

There are some spoken corpora available online (http://quod.lib.umich.edu/m/micase/ is a good example of one) but the key problem is how to get recorded speech into a text form that can be processed.

Individual speech to text transcription

If we are thinking about the notion of the ‘i-corpus’ then it is possible for individuals to easily transcribe their own voice.  You can buy DRAGON (see http://www.nuance.com/naturallyspeaking/products/editions/default.asp) — if you train it to your own voice, they claim 99% accuracy. I’ve got a friend who uses this, who confirms that it does what it claims.

General speech to text software

However, if you want to transcribe a collection of various recordings of different people, with different quality of recordings, you might get something vaguely usable, but it would have to be checked and edited.  There is research being done in this area that is in the open source community.  One example is http://cmusphinx.sourceforge.net/ – developed at Carnegie Mellon University.  In fact, this has spawned a READING TUTOR — which will listen to a child reading a text, and point out any errors in pronunciation and stress: http://www.cs.cmu.edu/~listen/.

Archives of spoken English

Aligned to this work are researchers who are attempting to build an archive of spoken corpora, which can then be used as a basis for testing speech to text software.  One of these is http://www.voxforge.org/.  Another interesting area of research is based on accents – read the Guardian article about this at http://www.guardian.co.uk/education/2010/jun/01/english-accents-research?&CMP=%20EMCEDUEML1088.  If you want to contribute your own voice recording to the database of accents, just go here and record yourself:  http://accent.gmu.edu/

Just the Word & WORDLE…a match made in [lexical] heaven..

Just the Word (http://193.133.140.102/JustTheWord/) is gaining popularity with practitioners as well as researchers.  WORDLE is a wonderful graphic interface to illustrate corpus frequency statistics.  Few people are aware of the ADVANCED feature on WORDLE and how to ‘mash up’ input from a site like Just the Word.

Here is an example WORDLE based on high frequency collocates of RESEARCH using the pattern analysis of the BNC from Just the Word.  I replaced the root RESEARCH with a bullet to make it less cluttered.

Wordle: Research collocates

And here is how I did this:

WORDLE has an ‘advanced’ button (top right) that takes you to http://www.wordle.net/advanced – from here, you can specify not only the ‘size’ of the words, but also the colour.

For example, from Just the Word I generated the collocates of ‘RESEARCH’.  I then did a little Excel ‘magic’ and sorted all the collocates by pattern, and filtered within the frequency range of 100 to 1000 (to produce a reasonable wordle not dominated by one or two really high frequency items).  I then selected a different colour for each PATTERN.  Because RESEARCH was the common root, I replaced it with a ‘bullet’ to make the graphic less dominated by the repeated word.  I then put the data into the ADVANCED feature.  See http://www.wordle.net/show/wrdl/2168943/Research_collocates

Here is the original filtered data from JTW.  (I copied the JTW output, put it into EXCEL and then executed a few formulae to repeat the PATTERN and cluster data.)

research FREQUENCY cluster PATTERN
carry out research 155 cluster 1 V obj *research*
conduct research 132 cluster 1 V obj *research*
undertake research 122 cluster 2 V obj *research*
do research 358 cluster 3 V obj *research*
research show 380 cluster 1 *research* subj V
research suggest 131 cluster 1 *research* subj V
research have 745 cluster 4 *research* subj V
recent research 171 cluster 1 ADJ *research*
further research 190 cluster 9 ADJ *research*
more research 115 cluster 9 ADJ *research*
medical research 242 cluster 9 ADJ *research*
much research 102 cluster 9 ADJ *research*
own research 153 cluster 9 ADJ *research*
scientific research 240 cluster 9 ADJ *research*
social research 182 cluster 9 ADJ *research*
such research 111 cluster 9 ADJ *research*
market research 425 cluster 1 N *research*
Cancer research 114 cluster 2 N *research*
research into 708 cluster 2 *research* PREP
research on 644 cluster 2 *research* PREP
research in 840 cluster 2 *research* PREP
research by 164 cluster 2 *research* PREP
research at 151 cluster 2 *research* PREP
research department 103 cluster 1 *research* N
research group 205 cluster 1 *research* N
research institute 214 cluster 1 *research* N
research team 151 cluster 1 *research* N
research unit 178 cluster 1 *research* N
research study 135 cluster 2 *research* N
research work 132 cluster 2 *research* N
research method 141 cluster 3 *research* N
research programme 316 cluster 3 *research* N
research project 482 cluster 3 *research* N
research grant 185 cluster 5 *research* N
research council 446 cluster 7 *research* N
research center 344 cluster 7 *research* N
research finding 128 cluster 7 *research* N
research laboratory 189 cluster 7 *research* N
research student 137 cluster 7 *research* N
result of research 117 cluster 4 N PREP *research*
center for research 109 cluster 5 N PREP *research*
research and development 359 cluster 1 *research* and N
our research 148 cluster 1 article *research*
some research 140 cluster 1 article *research*
this research 262 cluster 1 article *research*
their research 171 cluster 1 article *research*
my research 111 cluster 1 article *research*

Here is the data coded for WORDLE (which I pasted into the ADVANCED feature of WORDLE–the number is the FREQUENCY, and the HEX value is the HTML colour code.)  Note that I’ve replaced the word RESEARCH with a bullet.

carry out•:155:4411AA
conduct•:132:4411AA
undertake•:122:4411AA
do•:358:4411AA
•show:380:00FF48
•suggest:131:00FF48
•have:745:00FF48
recent•:171:6280AA
further•:190:6280AA
more•:115:6280AA
medical•:242:6280AA
much•:102:6280AA
own•:153:6280AA
scientific•:240:6280AA
social•:182:6280AA
such•:111:6280AA
market•:425:62FF48
Cancer•:114:62FF48
•into:708:6280FF
•on:644:6280FF
•in:840:6280FF
•by:164:6280FF
•at:151:6280FF
•department:103:0080FF
•group:205:0080FF
•institute:214:0080FF
•team:151:0080FF
•unit:178:0080FF
•study:135:0080FF
•work:132:0080FF
•method:141:0080FF
•programme:316:0080FF
•project:482:0080FF
•grant:185:0080FF
•council:446:0080FF
•center:344:0080FF
•finding:128:0080FF
•laboratory:189:0080FF
•student:137:0080FF

Neat, eh?

“keep a blog” or “have a blog”

Ever wonder about which phrase or word is more commonly used? Let GOOGLEFIGHT sort it out for you.

See which is the most common: “keep a blog” or “have a blog”.

Neat, eh? Has some interesting applications with students.

NOTE: If you compare phrases that have different numbers of words, then you will need to put the phrases in quotes.  Thanks to Gavin Dudeney for this little gem.