i-corpus

Objectives

These workshop notes serve to introduce some basic tools and principles that lay the foundations that are key to understanding and applying an ‘i-corpus’ approach to learning and language development.  By participating in this workshop, you will be able to

  1. understand the nature of word frequency in corpus approaches
  2. generate graphic and statistical analyses of word frequency from your own corpus
  3. create and analyze word pattern analysis using the British National Corpus
  4. install a free concordancing software called ANTCONC and
    • analyze a sample student i-corpus with ANTCONC by creating wordlists and KWIC (key word in context) concordance extracts from a sample student corpus
    • use an academic reference corpus to conduct a keyword analysis of the sample student i-corpus
  5. consider the ramifications of such a corpus-informed approach within the context of an ELT programme of study.

1. Test your knowledge of English word frequency

As English language teachers, we have a built-in awareness of words that will present difficulties to our students.

  • Our ability to guide our students in their language development is also based on determining which words have a bigger return on investment in learning.
  • Research shows that students need to know the first 3,000 most common words of English before they can really make progress in learning English.  But, how well can we actually pick out the words that are ‘rare’ or ‘common’.

Take this ‘frequency trainer‘ test–if you can guess the correct frequency bands of ten randomly chosen words in three tries, you’re doing quite well!

2. Create a FREQUENCY graphic from WORDLE

When we read a text, we cannot appreciate the fact that over 80% of the words in virtually any text come from a small band of about 2,000 most commonly used words.

  • In fact, about 45% of any text is made up of  about 100 ‘function’ words that have little to no meaning in themselves, but provide for the cohesion, coherence, referencing and rhetoric in communicating the message of the text.
  • Knowing the most common words that make up to 95%  of the text gives students the ability to guess the remaining 5% of the unknown words.  Research shows that it is difficult, if not impossible to guess the meaning of an unknown word if you don’t know 19 of the 20 words surrounding the unknown one.

Portraying a text from a frequency count is a powerful way to highlight this nature of word frequency that most students are not aware of.   This can be done using a tool called WORDLE by simply copying and pasting any text into an input box on a web page and pressing a button.

  • Try it out:   visit http://news.bbc.co.uk/ and pick out a news story you would like to use as a reading text with your students.  Highlight and copy the text
  • Then go to http://worlde.net and click CREATE.  Paste the text into the input box and click GO.
  • Play with the LAYOUT, MAXIMUM WORDS, and COLOUR feature.
  • You can save the graphic by using PRINTSCREEN (or the SNIPPING tool in Windows 7) or you can save and publish it with your own USERNAME, so you can link to it later.

Here is an example based on this news story: http://news.bbc.co.uk/2/hi/world/us_and_canada/10398955.stm and from the subsquent WORDLE below, can you speculate on the content of the article?
Wordle: Afgan policy shift

3. Create and analyze a VOCABULARY PROFILE from LEXTUTOR

While WORDLE is a wonderful graphical representation of the word frequency profile in a text, it does not tell us which of the content words are from the most commonly used words band.

  • This information is critical if we want to use the text for teaching purposes.  You can get a statistical analysis of the vocabulary profile from Tom Cobb’s Lextutor site.  For this workshop, let’s use the BNL2709 Vocabulary Profiler.
  • This site is a bit ‘scary’ at first sight, but don’t let that put you off.  Again, copy and paste the news article in the text input box and press SUBMIT.

You will see a statistical analysis of the text according to the words from the most commonly used 2,709 words in English, plus a colour coded view of the text, side-by-side with the original.

  • The RED words are the ‘off list’ words–i.e., words that are not so frequently used in English (i.e., they appear less than 10 times in one million words).  Often these words are proper nouns.

Of course, just because a word is not common does not mean our students won’t know it.  We can get the software to deal with words that we know our students know.

  • Using the Afgan news article, the initial statistical analysis shows 77% of the text coming from the most commonly used 2,709 words in English.  However, if we assume that student know the names of people and places, we can RE-VP with the proper nouns  and ‘RE-CATEGORIZE AS KNOWN’ these words:

nato gen barack obama taliban david petraeus stanley mcchrystal afgan afganistan nick parker hamid karzai kandahar mardell iraq kabul anders fogh rasmussen bbc james jones pakistan richard holbrooke

  • If we do, then the text coverage of the most commonly used words rises to 90%.  Selecting a few words to preteach, based on frequency, can then raise the words in the text that a student would know to the 95% threshold level.

Probe the PATTERN and COLLOCATE of a word in JUST THE WORD

From the profiling bands in the Lextutor BNL2709 profile output, you can scroll down to see the sets of words according to each band.  These words are often the ones that students could benefit a lot from doing further work with.

bnl-3 [ fams 14 : types 15 : tokens 18 ] approach challenge challenges challenges committed considerable disappointed four informed internal nine policy policy policy regret stressed tough vehicle

bnl-4 [ fams 12 : types 14 : tokens 15 ] agencies agency announcement comments critical criticised criticisms debate ensure insult june june perspective resigned shifts

bnl-5 [ fams 10 : types 10 : tokens 24 ] administration august confirmed editor insisted insisted mr mr mr mr mr mr mr quoted security strategy strategy strategy strategy strategy strategy strategy wednesday wednesday

bnl-6 [ fams 5 : types 6 : tokens 12 ] magazine magazine mission personnel personnel senior senior troop troop troop troops troops

Words that appear frequently in the text are repeated, and these are the ones that could be exploited most easily.  In this article, a few words stand out:  STRATEGY, CHALLENGE, CRITIC*, TROOP.

Given this tool, and perhaps a few other more graphic tools, such as LEXIPEDIA and VISUAL THESARUS, students would have a bank of data that they can access for self-directed language development.

Download and install ANTCONC and sample corpora

This will be hands on in the lab.

  1. Install ANTCONC
  2. Open a corpus
  3. Create a WORDLIST
  4. Generate a KWIC
  5. Declare a REFERENCE CORPUS and conduct a KEYWORD analysis.

Leave a Reply

Your email address will not be published. Required fields are marked *