Frequently we turn to the web to see a picture of ourselves. Our news media present stories and opinions from Twitter and Reddit. Companies look at trending topics and what's popular on the Internet, and try to base marketing strategies around them. Social and political movements are born and organized on the web. The social web has become integrated into our everyday lives, and our lives are increasingly reflected on it.
Perhaps unsurprisingly, academic researchers have seized on this unique opportunity to study people on a massive scale. The sheer volume of content posted everyday to the web provides an opportunity to study social structure, people's attitudes and habits, what people are doing and buying, and a multitude of other interesting topics. A well-known example of this type of research would be Google Flu trends. Every year, Google monitors what people search for, mining for queries -- such as flu symptoms or searches for cold remedies -- that might suggest that the searcher has influenza or a similar illness. They combine this information with their best guess of where the searcher lives, and are able to produce a geographic breakdown of where and how quickly the flu is spreading. This information is helpful to supplement traditionally gathered data about flu incidence reported by physicians -- a process takes a couple weeks, whereas Google Flu Trends can detect outbreaks in near real time, allowing health resources to be allocated more proactively.
Less well known is work on economic and political prediction based on web data. In the last few years, there has been a spate of academic papers about predicting the stock market, movie box office take or political elections based on weblogs, Twitter and other social media. This work is broadly similar to the Google Flu Trends work. Researchers pick a phenomenon they would like to study using web analysis (e.g. a stock price or an election). They scour social media data, looking for mentions of that topic (e.g. a company name or political candidates' names). Occasionally, this work will also analyze the context of these mentions, such as whether a web user is describing the company or candidate as great or terrible. They will tally up the mentions and make an estimate of how things will occur in real life (e.g. will a stock go up or down; will a candidate win or lose).
This is where my work fits in. There is a frequently acknowledged shortcoming in this domain of research: the issue of bias. In this context, bias means that social data is systematically unreflective of real-life society. This is the case in a couple of ways.
The first is that people have varying motivations for what they write about. For example, in the case of election prediction, people who write about candidates are likely the most partisan, enthusiastic members of society, and are writing because of their excitement or in order to persuade and influence other voters. While having very engaged supporters may be an indicator of a campaign's success, at best it is an indirect indicator, and at worst can be gamed to make a candidate appear more popular. More useful than users who are highly engaged on one topic are users who are highly prolific, but across many topics.
In my lab, we have been collecting personal stories people write on their weblogs. These are narratives describing a specific series of events in the past, spanning minutes, hours, or days, where the storyteller or a close associate is a participant. We have developed a system that automatically identifies personal stories posted to weblogs, and since 2010 have collected over 33 million of these stories. In the course of analyzing this data, I noticed a few thousand of the most prolific web storytellers, who post about their lives almost every day, some of them going back as far as nine years. Intrigued by these users dedication, I conducted interviews with 10 of them, seeking answers to three questions: What motivates these people to post so frequently and publicly about their personal lives? To what degree do these people embellish their stories to make them more interesting than reality? What expectations do these authors have about their readers, and what are the ethical implications for researchers like us who analyze their posts?
The bloggers I spoke to were nearly unanimous in their perspectives on these questions. Most commonly, each blogger started their blog as a writing project or an opportunity to keep a record of everyday life. Some were undertaking interesting projects, or recently had children, and the blog provided a way to broadcast events in their lives without having to have the same conversation with many different friends and family members. While many began writing for a close audience of a few friends and family, posting publicly provides them an opportunity to share with a larger audience than they would be able to reach otherwise. These authors typically did not start with the expectation that they would post so frequently over such a long time period, but have maintained a high throughput of posts out of habit or a sense of obligation to their readers.
While a few have reported that they've received theatre tickets or similar small gifts from organizations hoping they would promote their products most never received any compensation for their activities, and those who did eventually stopped as it was neither their primary motivation nor worth their while. These bloggers indicated that they are generally truthful when they write. The most common untruths were by omission, either to protect their privacy or to avoid hurting or offending readers they know personally. Finally, while most of these authors never expected to have an audience beyond a handful of people they knew personally, most understand that their public posting means that anyone can read or analyze the content they post, including researchers. These answers, and the strong consensus conveyed, indicated that this group of bloggers -- prolific storytellers -- are particularly useful for social analysis. Their general motivation for blogging means they are less likely to bias subsequent analysis, they are comfortable with researchers analyzing their content, and that content is mostly reflective of the reality of their lives.
In conducting this study I undertook an additional, unusual step. I met face-to-face with many of the people I interviewed, and a professional film crew came along and filmed these interviews and edited them into a documentary short. This film -- titled Friends You Haven't Met Yet -- was written with the intention of presenting our findings, and more generally the field of social media research, to an audience beyond a handful of academics. It will premiere at the Dances With Film festival in Hollywood, CA.
While taking care when selecting which web users to select for social analysis can mitigate one bias, there is the concern of demographic bias. Your typical web users can be very different from non-web users. For instance, studies have shown that Twitter users in the U.S. are younger, more male, and disproportionately live in more populated areas than the general population. This can be a serious problem when trying to characterize what the beliefs and actions of a population, especially so for characteristics of a population that are highly dependent on demographics -- such as how people vote. It is vital to take these discrepancies into account.
Fortunately, techniques to correct for these kinds of biases are well studied. Researchers and scientists that depend on survey data have been resolving unrepresentative samples for decades. They use techniques to re-weight responses, so under-sampled groups can be weighted up and oversampled groups can be weighted down. For instance, if you know that 13.7 percent of Americans are 65 years or older, but only 11 percent of Americans who responded to your survey are 65 years or older, then you can adjust the survey by giving the 65+ respondents more sway on your final estimate for the population.
Of course, traditional surveys ask the respondents about the respondents' background, as well as questions about the topic they are trying to measure. Web data does not provide that luxury. Most platforms, especially blogging platforms, do not provide a space for users to fill in a basic demographic background. And for those platforms that do, users typically leave them blank, or provide incomplete or false information. What are needed are techniques to reliably determine a user's background from the text they write.
I am working on developing techniques to accomplish this. A beneficial aspect of the data I am working with is the sheer amount of content these web users have provided. With hundreds of posts to consult for each person, high-confidence inferences about the bloggers' age, race, gender, and other characteristics can be made. Already, our system can determine a blogger's hometown with a high degree of accuracy. It works by finding locations mentioned by these bloggers, and then assessing whether that location is where the blogger lives, or if it is a place that the blogger has visited, wants to visit, or is simply discussing. This assessment is done by looking at the context in which the location is mentioned. Locations mentioned in the context of attending city council meetings or visiting parks are likely to be near to where the blogger lives, while locations mentioned when talking about vacations are unlikely to be near where the blogger lives. Similar techniques are under development to determine bloggers' occupations, marital status, age, gender and other categories. Similar techniques can be used to extract characteristics of the population that traditional survey scientists might be interested in, such as outlook on the economy or how they might vote in an election.
The overall vision is a panel composed of the most prolific storytellers of the web, who are used to make inferences about the population at large. These estimates are unlikely to match the precision and accuracy of traditionally-gathered survey data. However, it will provide an opportunity to quickly test hypotheses without putting expensive and time-consuming surveys in the field. By combining carefully selected data, algorithms that can determine important characteristics of people from text they write about their everyday lives, and techniques to correct for the inevitable demographic differences of web users and non-web users, we can see a clear reflection of ourselves on the web.
The HuffPost College Thesis Project gives students a chance to share with a wide audience the fruit of their hard academic work. The project is launching with about a dozen partner schools, which comprise students from public and private, two- and four-year colleges. To read all posts in the series, visit here.