Traditionally, we’ve found out about variation in how people speak—whether that be variation between people in different places, of different classes, genders, or whatever—by doing surveys. Dialectologists have travelled around the country interviewing a few people in each town to record how each would say a set of words. Sociolinguists have interviewed wide ranges of people from different educational and social backgrounds and looked for differences in how they speak. These sorts of methods have been very successful—but they’re also very costly. Sending out researchers to do dialectological surveys is an expensive business: many researchers are needed to carry out the long process of getting to know local people and finding some who are willing to be interviewed in every locality and all those researchers have to be paid for their time and travel. The reality is, there just hasn’t been the funding in humanities and social science research to do this sort of work on a large scale for some years and so much of our data is rather out of date.

But in the era of the internet and ‘Big Data’ there’s a new way of finding out about language variation: using social media. And so a new generation of research into language variation using language data from social media is just starting to appear.

Using social media data for research is a very different proposition to traditional survey data. Obviously, it’s mostly written rather than spoken data, which immediately puts some limits on the sorts of things it can tell us. More problematically, you can rarely find out as much information about each person in your study as in a traditional survey, and even what information you can find out is unreliable. As an interviewer in person the researcher can ask for more information when needed: ‘You say you’re from York—were you born and brought up there, or did you move around as a child? Were your parents also from York?’ But dealing with online data, the vast majority of the time what you see is all you get. You know what the user chose to write in the ‘Hometown’ box but not necessarily what they meant by it. You know where their phone was when they tweeted—but you don’t know if that’s the place that they live and were brought up, or indeed whether those are the same places.

Nevertheless, there is one big advantage to this sort of data: there’s lots of it. And a big enough quantity of data can often make up for low quality data, if we’re asking the right questions. Because of the uncertainties about who’s really behind the keyboard, we can rarely use social media to make definitive statements about how much a given group of people speaks or writes in a certain way (that would be statements like ‘people under 25 from London use the word order “give it me” 50% of the time and “give me it” 50% of the time’)—but we can make comparative statements (like ‘people from London use the word order “give it me” twice as often as people from Lincolnshire’).

To exemplify what sort of work is being done with social media at the moment, I’ll take you through a couple of interesting recent papers (links to both are found at the bottom of the post). Gonçalves & Sánchez (2014) gathered around 50,000,000 tweets written in Spanish and associated with a GPS location over two years. They then tracked lexical variation—variation in the words people choose to use to describe a given concept—to see if they could find differences in people’s language use associated with different places. The map below is reproduced from their paper, showing the different words used for ‘car’. As you can see, five distinct areas emerge: people in North America and northern South America largely use ‘carro’; people in Central America and in Spain usually use ‘coche’; and people in the southern half of South America generally use ‘auto’.


goncalves and sanchez cars

They then took results like this for many words and used machine learning algorithms (specifically K-means clustering) to investigate whether there were identifiable groups of dialects. The result was very surprising. Instead of showing big, regional dialects associated with contiguous areas on the map, the algorithm identified just two dialects: one associated with the big urban areas and one with everywhere else. Gonçalves & Sánchez write: “Superdialect α is utilized by speakers in main American and Spanish cities and corresponds to an international variety with a strongly urban component while superdialect β is comprised mostly of rural areas and small towns” (6). They see this as evidence for the homogenising effect of globalisation on language.

Eisenstein et al. (2014) focused not on the static facts of whole dialects but on fast-paced processes of change associated with new words entering the language. They collected a corpus of 107,000,000 tweets in English from 2009-2012 and looked only at words whose frequencies changed significantly over time. Below is an example, reproduced from their paper. It shows the expansion of the term ion (short for ‘I don’t’ as in ‘ion even care’) over a 150 week period.

eisenstein et al ion

One interesting finding which is immediately clear from such figures is that even for these sorts of words which are fundamentally written and exist (basically) only online, geography is relevant. On the face of it, we might expect words on the internet to spread randomly across space, as most of what is posted is publicly visible regardless of where you are. But the reality is that words basically spread through social networks, and these exist in real space, even if we’re watching them in action online.

Eisenstein et al. go on to examine the most common routes of linguistic diffusion, mapping the paths most often taken by new words between the cities, and then investigate what factors favour such linguistic pathways. They found that racial demographics were crucially important: linguistic differences were more likely to be transmitted between cities with similar proportions of African American citizens and Hispanic citizens. Small geographic distance and similar proportion residents of urbanised areas and median income also facilitated linguistic influence. Population also had an effect: larger settlements were more likely to exert influence than be subject to it.

These two studies are just a small intimation of the potential for linguistic research with social media, but hopefully you can start to see what an exciting area this promises to be!

Eisenstein, Jacob, Brendan O’Connor, Noah A. Smith & Eric P. Xing. 2014. Diffusion of Lexical Change in Social Media. PLoS ONE 9(11). e113114. doi:doi:10.1371/journal.pone.
Gonçalves, Bruno & David Sánchez. 2014. Crowdsourcing Dialect Characterization through Twitter. PLoS ONE 9(11). e112074. doi:10.1371/journal.pone.0112074.