Huffpost TED Weekends
Big Data Can Get Very Personal
Benjamin K. Bergen Headshot

Little People Meet Big Data

Posted: Updated:

Click here to read an original op-ed from the TED speaker who inspired this post and watch the TEDTalk below.

One of the great mysteries of the human mind is how a child learns her first language. It's easy to overlook exactly how monumental a feat this is. After all, all typically developing children quickly master at least one language. You did it, and you probably never noticed making any particular effort. But behind a veil of apparent simplicity, learning language is actually among the most complex things we do. Consider that children learn language at breakneck speed -- between age 1, when their first words appear, and the day they graduate high school, they learn an average of 8 words per day. They also learn subtle statistical details of use, for instance that you can say "I gave him the book" but not "I contributed him the book." And they learn almost all of this without explicit instruction. The long and the short of it is that learning language is so difficult that no other animal or piece of software even comes close to the average human second-grader. And the mystery is: How do we do it?

The answer has been the matter of serious debate for centuries. Some people -- those who lean toward "nature" to explain the human mind -- reason that there must be something innate and specific that shepherds the process along. Naturally, a child can't be born knowing his native language, because it could end up being Chinese or English or Swahili or any other language. But if there are features that all languages share, then it might help children to come to the drawing board with at least expectations about what any human language can be like. For instance, they might know that their language will either place adjectives before nouns, like English does, or after nouns, as in French (Moulin Rouge, for instance, means "Windmill Red"). That might give the child a leg up on learning whichever language he ends up being exposed to.

Others have argued -- those who lean toward "nurture" -- that children are born knowing nothing specifically about language, and that they learn English or Swahili or Chinese or whatever they're exposed to from the bottom up, using sophisticated learning abilities that we, and not our tongue-tied animal relatives, possess. These abilities, the argument goes, aren't specific to language, but are the same sorts of abilities to reason and generalize that allow us, uniquely, to learn math, music, and logic, unlike other animals. We're special, but language isn't.

The stakes in the debate are quite high. Philosophically, better understanding our biological endowment would paint part of the picture of what is unique about humans -- what makes us humans and not chimpanzees. Medically, knowing what brain resources children bring to bear on learning language helps us diagnose and treat children with developmental language disorders.

But it's hard to find discriminating data that teases apart nature from nurture. Because we don't have a record of everything a child has ever heard, we don't know when they're merely imitating or modifying something previously heard (nurture) and when they're saying something they could only know if they were born with a knack for language (nature). In order to tell, what you'd really need is the full record of everything a child has ever said and heard. Exactly how different is this thing he just said from things he previously heard? In what ways is it different? Could it possibly have been learned?

Enter the Roys. When they wired their house, from top to bottom, to chronicle their son's acquisition of English, they put in place an infrastructure to collect exactly the kind of data that would open a window onto how children learn language. This is a monumental leap forward, and the upside potential is massive. But because of technical challenges, we've barely scratched the surface of what the data can tell us.

One challenge is the sheer enormity of the data: 250,000 hours of it. To use it -- to be able to automatically search through it -- you'd have to transcribe every second. Top stenographers who do courtroom or closed caption transcriptions work almost as fast as people talk, so at best it would take a team of ten full-time stenographers (each working 40 hours per week) 12 years to transcribe the whole record. Big data is a big problem if you want people to process it by hand.

The solution that Dr. Roy and his team have adopted is to automate as much of the transcription as possible. This is also not easy. Imagine trying to build a tool that's more accurate than Siri while operating on both adult and child speech in a noisy and unpredictable home setting. Dr. Roy's team has made some important technological advances in the past several years -- they can extract certain specific words like "water" with good reliability -- but they still can't automatically transcribe the speech in general.

As a result, we still don't know how much of what children know about language is innate. The major technical advance of being able to record and store a lifetime of language puts us one step closer to the answer. But this first advance will only have its full impact when we make a second one -- when we can fully explore the rich data source housing answers that are currently just out of reach.

Ideas are not set in stone. When exposed to thoughtful people, they morph and adapt into their most potent form. TEDWeekends will highlight some of today's most intriguing ideas and allow them to develop in real time through your voice! Tweet #TEDWeekends to share your perspective or email tedweekends@huffingtonpost.com to learn about future weekend's ideas to contribute as a writer.

About

Think of TED Weekends as time to be curious about the world of ideas... a weekend break from shouting heads, celebrity soundbites and kitten videos. We combine a thought-provoking TED Talk with new perspectives from contributing writers and invite you to join in the conversation. (Coffee and OJ not included.)