Library of Congress almost done archiving 170 billion tweets
In a new statement released by the largest library in the world, Washington’s premiere research center says it is almost done with the first steps in a project involving a massive trove of micromessages sent over Twitter going all the way back to when the site first got off the ground in 2006 [.pdf].
In April 2010, Twitter announced that every public tweet published since its inception would be added to the Library of Congress so that the Untied State’s top researchers could have access to a then-untapped form of correspondence that was thought to be on the way to becoming as commonplace as snail mail. Today, they admit that they’ve almost reached that goal.
“The Library’s first objectives were to acquire and preserve the 2006-10 archive; to establish a secure, sustainable process for receiving and preserving a daily, ongoing stream of tweets through the present day; and to create a structure for organizing the entire archive by date. This month, all those objectives will be completed,” the Library announced last week.
Of course, newer users of Twitter won’t be forgotten either. Once the Library secured a method of collecting all archived tweets, it couldn’t just end there. In February 2011 they began receiving “current” tweets sent after the 2010 cutoff, and by last month they’ve figured that in all there are now roughly 170 billion public tweets in their archives.
“The volume of tweets the Library receives each day has grown from 140 million beginning in February, 2011 to nearly half a billion tweets each day as of October, 2012,” the library claims.
As one can imagine, such a spectacular amount of information isn’t exactly easy to make sense of. The Library says they are sitting on around 133.2 terabytes of tweets at the moment — so many messages that running a search for a single keyword can take as long as 24 hours right now.
“This is an inadequate situation in which to begin offering access to researchers, as it so severely limits the number of possible searches,” the Library explains. “The Library’s focus now is on confronting and working around the technology challenges to making the archive accessible to researchers and policymakers in a comprehensive, useful way.”
“It is clear that technology to allow for scholarship access to large data sets is lagging behind technology for creating and distributing such data. Even the private sector has not yet implemented cost-effective commercial solutions because of the complexity and resource requirements of such a task.”
The Library says they are now pursuing partnerships with the private sector “to allow some limited access capability in our reading rooms,” and Gawker reports that they’ve already received requests from over 400 researchers who want to feast their eyes on the billions upon billions of tweets. Don’t think for a minute that that means anyone is invited over to comb through their collection though. Under their contract with Twitter, the library can only allow access to public tweets sent longer than six months ago, and only to “bona fide researchers” who are prohibited from conducting commercial research at the library.
Gnip, a Colorado-based social media enterprise company picked by Twitter to handle moving the Tweets from Silicon Valley to the nation’s capital, tells Talking Points Memo that they think the final product will be amazing in terms of what it can do to modern researchers.
“Gnip believes Twitter represents the largest archive of human behavior to have ever existed. We’re thrilled that we’re able to partner with the Library of Congress to help make this data available to researchers. At Gnip, we believe that the value from social data is limitless and often get inquiries from academic researchers looking to analyze social data from Twitter. We’re excited by the progress the Library of Congress has made so far,” the company states.
Twitter says that they will be completely caught up on collecting all older tweets sometime during January.