Since the advent of widely available and powerful computers, much of the work done in sociolinguistics and historical linguistics has used the methodology of corpus linguistics. A ‘corpus’ (from the Latin for ‘body’) technically speaking is simply a big collection of texts, and once upon a time would have referred to printed texts. In other contexts we would still talk about ‘the corpus of Old English poetry’, for example, to refer to all surviving Old English poetry, or ‘the corpus of Plutarch’ to refer to everything written by Plutarch. In linguistics, however, ‘corpus’ has come to have a specialised meaning: a digital collection of language data which can be easily searched and quantified. ‘Corpus linguistics’, then, refers to the methods for undertaking research on such corpora.

Here’s a little example of the sort of research I mean. Some verbs in English have two possible forms of the past tense, an irregular one with a -t and a regular one with an -ed. Two of these are spell (spelled vs. spelt) and spill (spilled vs. spilt). There’s a long tendency in the history of English for irregular verbs to become regular, so we might guess that these two possible forms reflect ongoing change towards the regular form. There’s also a general tendency for writing to be more conservative (that is, old fashioned) than speech during ongoing change. So we have a hypothesis we can test: in a corpus of modern day written and spoken English, for these two verbs, the irregular forms will be more common in writing than they are in speech. I’ve done a quick search of the British National Corpus to test this hypothesis—here are the results:

SPILT 176 37
SPELT 311 107

spelledspelt spilledspilt


As you can see, our hypothesis turned out to be completely wrong. For both verbs, the irregular form is much more common in speech than in writing. A chi-squared test tells us that this difference is significant for spill (χ²=56.578, df=1, p=0) and spell (χ²=67.143, df=1, p=0): that implies that there is a real difference between the ways speakers choose which form to use when writing and when speaking for both of these verbs, but that the difference is the opposite of what we predicted.

One apparent advantage of corpus linguistics is that it offers quick ways to approach very open-ended questions without first having to formulate specific hypotheses in the way we did above. Questions like ‘what are the differences between spoken and written English?’ or ‘how has English changed between today and twenty years ago?’ would normally be very hard to answer directly. With corpus linguistics, however, we can quickly process very large amounts of data to trawl for such differences by using ‘keyword analysis’.

Keyword analysis simply looks for words that are more frequent in one corpus than another. Because two corpora are unlikely to be exactly the same size, with keyword analysis we don’t look at raw frequencies of words—it wouldn’t be surprising or interesting that a word was more frequent in a million word corpus than a thousand word corpus. Instead, we look at relative frequencies: effectively, the percentage of all words represented by the word of interest. Another way of thinking of these relative frequencies is as the frequency of the word of interest per thousand words (or per million words, or whatever). Words that have a significantly different relative frequency in one corpus than in another are then called keywords.
To take one of our examples above, we might predict that spilled would be a keyword for the written BNC compared with the spoken BNC:

all words 87,953,932 10,409,851
% SPILLED 4.696 x 10-6 5.764 x 107
SPILLED per million words 4.696 0.576

Here we can see, as expected, that spilled is much more frequent in written texts, occurring more than eight times as frequently as in spoken data. This difference is statistically significant (χ²=31.079, df=1, p=0), making spilled a keyword of the written BNC compared with the spoken BNC. If you do the maths, you’ll find the same is true for the other three words we’ve looked at.

As we’re using computers, we can undertake this kind of analysis en masse, and compare the relative frequencies of every distinct word in two corpora. At first glance, this seems like a wonderfully easy way to answer the sorts of general questions we posed above: compare the relative frequencies of all words in two corpora and identify all of those which are significantly more frequent in one than the other; the resulting list of keywords is a list of the differences in language use between the two corpora.

As it turns out, however, there are some real problems with this methodology. The first deals with what we interpret the results of keyword analysis to mean. So far we’ve talked about seeing differences in the frequencies of words in different corpora as evidence of language being used differently in those corpora. But this isn’t the only possible explanation for keywords. To take the example of spelled discussed above, we might explain its higher relative frequency in the written BNC as evidence that people tend to choose spelled rather than another option (in this case spelt) more frequently in written language, but we might alternatively explain it by suggesting that people write about spelling more often than they talk about spelling. This would then reflect a difference not in how language was being used, but in what it was being used for.

It turns out that keywords actually very frequently reflect just these sort of differences—differences in topics being talked or written about, the contexts in which the language use is taking place and the social roles occupied by the speakers—rather than differences in the way language is being used. And from a raw list of words generated by en masse keyword analysis, it’s very hard to know what sort of difference to attribute each keyword to.

Incidentally, we can get an indication of whether this explanation is correct for spelled by adding together our numbers for spelled and spelt to give us the frequency with which any past tense of spell is used. It turns out that that this too is a keyword for the written BNC, strongly indicating that people contributing to the BNC did indeed write about spelling more than they spoke about it. But we already had an indication that our original explanation is also correct—spelled is used much more frequently relative to spelt in the written BNC compared with the spoken BNC. So it seems that this keyword actually reflects both kinds of explanation.

The other main problem with mass keyword analysis concerns the use of the chi-squared test, but it seemed a little technical for this post. If you’re interested to read a more detailed discussion of both of these problems, possible solutions and a case study, check out my article here.