There have already been a few posts on this blog relating to corpus linguistics. To recap, a corpus is a (usually very sizeable) collection of texts which can be useful for certain types of linguistic analysis.

Admittedly, the school of linguistic thought which follows in the footsteps of Noam Chomsky has tended to place more importance on linguistic competence (what speakers know about what is and isn’t possible in a language) than performance (what they actually do with language). That is, more importance is given to the questions like “Would this sentence be grammatical?” than “Has anyone ever actually used this sentence?” The fact that nobody has ever previously produced the sentence There is a cacophonous allosaurus in the echoing corridor under the model of King’s College Chapel which is made entirely of string is considered of secondary interest to the fact that someone legitimately could produce it if they so desired.

But of course all a corpus can tell us is what sentences people do produce, and reveals nothing directly about what is and isn’t grammatical. (Indeed, they may even include a good number of sentences that everybody would agree are definitely not grammatical!) But that doesn’t mean they tell us nothing whatsoever about competence and grammaticality. For example, the fact that the phrase working away occurs 51 times in the 100 million word British National Corpus but arriving away doesn’t occur at all may suggest sentences like Lucy was working away are grammatical whereas ones like Lucy was arriving away aren’t – something which can be confirmed in other ways. In the context of historical linguistics, where we don’t have access to native speakers to ask directly what is and isn’t allowed in their language variety, this sort of frequency analysis becomes a major source of evidence.

100 million words may sound like a lot, but actually it can be quite limiting: a lot of stuff that we might be interested in just doesn’t turn up that often. For instance, most people agree that outswim (in a sentence like Lucy outswam Chris in the race back to the beach) is a real word, but it only occurs once in the entire BNC. Nowadays we can get around this problem to a certain extent by using Google to search the World Wide Web, giving us access to around 50 billion webpages, and many more words than that (outswim comes up 223,000 times). But there are problems with using Google, too, for example that it might not give us a very balanced mix of different discourse types (as traditional corpora might aim to do), that the Web contains a lot of material produced by non-native speakers of English, or that some of the things a search throws up might not even be produced by humans at all!

Another resource provided by Google is the Ngram Viewer. (An “Ngram” is a sequence of N items, such as words: e.g. “this is a short sentence” is a 5-gram.) The Google Ngram Viewer is based on a huge corpus of books going back hundreds of years; I personally have spent far longer than I really should have done playing around on it. It can be used to demonstrate things about how language changes over time, for instance the relative frequency of the word has and its older equivalent hath:

We can also use the Ngram Viewer as a source of information on other cultural trends, through their influence on language. The following graph of the frequency of the word railway, for example, seems to correlate with the changing role of railways throughout history: really starting to take off in the 1840s, and then going into decline in the twentieth century with the invention of the private motor car as a rival form of transport:

Not all of this is necessarily of much direct interest to someone who wants to focus only on linguistic competence, of course. But language isn’t just an abstract thing in our brains; it’s something used by real people in daily life, and the ways in which it is used are as valid an object of study as the make-up of our mental grammars. And corpora can be very useful indeed in telling us more about the ways in which languages are used.