Twitter has increasingly restricted access to the largest organized database of modern language in the world, despite its immense research value. It's a tragedy.
There is a lot to be learned from tweets, if Twitter would let us.
Celebrities, politicians, world leaders, news organizations, millions of normal people, and even the occasional cat, use Twitter every day to talk about their breakfast, natural disasters, political events, and events of global interest like the Oscars and the Super Bowl. With millions of tweets transmitted every day, Twitter has become an important historical and cultural record and an immensely useful resource for researchers of politics, history, literature, language, and anything else you can imagine. Or it could be. In recent years Twitter has made changes to its service that severely limit its usefulness to researchers.
The difficulty with most research is getting a good enough dataset, and the bigger the better. Corpus linguistics, for example, uses giant databases of hundreds of millions of words, painstakingly organized and annotated. The biggest corpora exceed 450 million words, and with a reported average of half a billion tweets per day, and around 15 words per tweet, that much data passes through Twitter in less than a day.
Twitter has already shown its value as a dataset, for hobbyists and academics alike. Edwin Chen, a data scientist at Twitter, mapped the use of "coke", "soda" and "pop" to refer to soft drinks in tweets and got results that largely concur with non-Twitter research, going some way to confirming Twitter's value as a research tool. On the academic side, as the New York Times reported in October last year, researchers are using Twitter to study the sentiment of tweets relating to major events like the Arab Spring and the way emotions relate to the rhythms of daily life. More recently, researchers announced their findings in using machine learning and natural language processing to predict whether the information contained in a tweet is true or false. But Twitter's evolution into a media company is making it much harder to collect good data for research, reducing its utility as a research tool.
The Text REtrieval Conference (TREC), co-sponsored by the National Institute of Standards and Technology and U.S. Department of Defense, has been helping support research in the area of information retrieval since 1992. In the past they've had sessions on contextual search suggestions, helping lawyers search for relevant information from legal databases, and making it easier for doctors and nurses to search medical records. As of 2011, they also have a microblog track, studying search behaviors on Twitter.
At TREC's 2011 conference they provided 58 organizations with a dataset of around 16 million tweets from a two-week period that included the Egyptian Revolution and the U.S. Superbowl. Or they would have liked to. What attendees actually got was a database of 16 million tweet identifiers and a set of tools that would let them access the tweets and download them themselves. Ian Soboroff, one of the researchers leading the microblog track, told me that, before they offered the dataset, people were downloading their own collections of tweets, but couldn't share them. "Twitter actually contacted a number of researchers trying to share their tweet collections and told them to stop," he said. The problem was a change made to Twitter's API Terms of Service early in 2011, a new clause forbidding redistribution of tweets.
The change had an immediate effect on a variety of startups and organizations. TwapperKeeper, a service that allowed users to export archives of tweets with certain keywords or hashtags, found itself in violation of the new clause and had to shut down. The same problem affected 140kit, one of the first groups researching Twitter's political and cultural influence, who were no longer allowed to share their datasets with interested users. For academics, it meant they could no longer collect and share datasets of tweets for analysis.
The change was a major blow to the prospect of getting good research out of Twitter data. With a shared, common resource, multiple observers can assess the integrity of the data collected, removing the question of bad data from a study's results, and it enables reproducibility, allowing the results of a study to be confirmed by other parties. But Twitter's new terms put a stop to that.
TREC worked around this by only offering its participants tweet identifiers, an approach Soboroff is disappointed others haven't taken. "We need more datasets in the wild before we can know what makes a good dataset and what makes a bad one, for Twitter researchers." But it's an imperfect solution for a number of reasons, not least of which is data integrity. Users of Twitter often make their accounts private or delete old tweets, meaning the dataset's identifiers for those tweets would no longer return anything. And, as Soboroff explained, some users had trouble downloading the tweets and using the tools provided, which entails "cloning a git repo, getting it to build, running the crawler, storing the data, and analyzing the data at that volume." In practice, it hasn't been a huge issue for TREC's microblog track, which Soboroff says had fairly high levels of participation this year, but they had to spend more time than they might have in better circumstances helping users with technical issues. The track's mailing list is replete with messages from participants having technical difficulties.
TREC's tools for downloading tweets also have to allow for Twitter's strict rate-limiting. TREC attendees trying to download their dataset of 16 million tweets are subject to a maximum of 180 API calls every 15 minutes. Under those limits, it would take more than two weeks to get all 16 million tweets. Researchers trying to gather their own datasets would use Twitter's streaming APIs, which returns a real-time feed of tweets posted to Twitter, but the publicly-available API for that is limited to only a small fraction of total tweets to the service, around 1 percent. There is an API that allows all tweets to the service to be collected, the "firehose," but Twitter strongly limits access to it and charges a fee that is well outside the budget of most academic research. Gnip, Twitter's resellers for firehose access, charges $0.10 per thousand tweets. It would cost TREC, whose participants have no funding at all, $16,000 to get their 16 million tweets that way.
It's not clear that Twitter is interested in doing anything to mitigate its hostility to research, but it's absolutely in its best interest to do so: Twitter stands to greatly improve its service, and even profit, based on the findings of researchers. Take the aforementioned study into determining the veracity of tweets. It's easy to imagine ways that could be put to use by Twitter, particularly in its recent efforts to surface good content in its Discover tab. But without letting researchers create a shared corpus of tweets, they're hurting further, potentially more exciting research prospects.
So what could Twitter do, without hurting their important bottom line? They could bring their research efforts in-house, as they almost certainly have to some degree, but that's expensive and time consuming, and in doing so they lose the benefit of crowdsourcing. There are researchers waiting in the wings to study things Twitter's team might never even dream of. They could make their own datasets and allow limited access to them, but researchers would have no control over the data they're getting, and it would present Twitter with the even more expensive task of ongoing maintenance of the datasets, a responsibility they're likely not interested in taking on. It would be much better for everyone involved, then, to let researchers pull their own data and manage it themselves.
The only realistic option is a change to give researchers more freedom with their data. Is there any good reason Twitter can't let academics get better access to their data, and perhaps relaxed rate limiting for getting it? It could be as simple as giving a single educational institution firehose access and permission to host a shared corpus of tweets. The Corpus of Contemporary English, the largest widely available corpus of English, is managed solely by Brigham Young University; anybody that wants to use it can access an interface to the data, but not the raw data itself. With Twitter's growing profile and increasing wealth of valuable data there is almost certainly a linguistics department somewhere willing to take on the task of maintaining a corpus of tweets.
No comments:
Post a Comment