11/17/2015

Information Might Want to Be Free, But That Doesn't Make it Easy

Recently, the diligence of two scholar-detectives was rewarded by the discovery of two cultural landmarks. This past summer, there was a sweet find in the Cadbury Research Library at the University of Birmingham of the earliest known pages of the Quran, interleaved in a newer copy of the book. One hundred kilometers to the east, pages of the earliest known draft of the King James Bible were found among a pile of old papers in the archives of Sidney Sussex College of Cambridge. These discoveries resulted from old-fashioned, hands-on research, accomplished by physically digging through papers and books in archives. In a world where more and more objects are digitized and research in silico replaces in situ, these kinds of stories seem archaic. Projects like Google's (aborted) attempt to scan all the books in the world or the Internet Archive hint at (and some even promise) a future in which all information is a keystroke or voice command away. But we're not nearly there.

Happily, when the archive-divers came across these early religious texts, they were immediately able to evaluate their contents. In short, with this ancient media, human hardware and software remains happily "backwards compatible." In the digital realm this is not always the case, and it may prove to be the Achilles heel of a new age of cultural preservation. We are instead more like the astronauts in 2001: A Space Odyssey finding the monolith. Early digital materials pose a new set of problems for the modern historian: it is not uncommon that we no longer possess or have easy access to the technologies able to read the media on which the artist's or writer's work resides. These kinds of concerns are part of efforts to archive operating systems (and the operating systems that run the operating systems) and other kinds of software and hardware that ensure access to a growing accumulation of "digital ephemera" and "digital marginalia." Recently, some Dartmouth colleagues were fortunate enough to possess the hard drives of long out of date and out of commission computers (one old Sun and one KayPro) of the cinema scholars Paul Spehr and his late wife Susan Dalton. After several days of hard work, my colleagues finally cracked the code of these en capsae digital materials, and important data on early modern international cinema history were rescued. The archives of writers now regularly contain old floppy disks, DVDs and computers. Their contents run the risk of forever being hidden in plain sight.

The phenomenon of being "so near yet so far" can also be true of the treasures of the Web. Generally we engage through the mechanism of "search", a context and process very different from exploration and one not designed for "rummaging around". The companies that bring you search have many competing interests to balance as they aim to provide you with a good "search experience." This more or less translates into your clicking on a return high up in the list and then spending a good amount of time there, possibly even continuing to click through and through (yes, they know all of that...). The list of things you get and the order in which you get them is determined by some complicated combination of information about your behavior as well as information about other similar searches. Business considerations can also come into play. For example, companies can pay to have certain pages preferences over others. They can also design pages to take advantage of their understanding of how the search engines work. This "search engine optimization" is designed to optimize the company's web experience, not the users'. The digital equivalents of the ancient Quran and King James Bible copy are the kind of "unknown unknowns" that may escape the embedded biases of a search engine.

Search considerations aside, digital research can face other surprising obstacles. Suppose, for example, that you start a research adventure with a visit to Wikipedia (often the first return from your search engine). If you then "let your fingers do the walking," how much of the cited information is correct and accessible? My colleagues and I looked at the 5,000 most viewed pages of 2014 and the nearly 100,000 book and journal citations they contain. In this sample, roughly 85 percent of these citations were "verifiable" in the sense that they had either valid ISBNs (in the case of books), valid Google Books IDs, or valid "Digital Object Identifiers" (the standard digital document ID). This is not perfect but not bad. However, verifiability is just the first step. If the reference interests you, then you need to actually find the book or document. It might be online but only available directly - i.e., without leaving your seat -- if you pay for it. Of the various Google Books that are in theory digitally available, 71 percent were partially available (i.e., samples were available), while only 12 percent or so were fully available and the rest not available at all. When we turn our attention to articles, only 13 percent were freely available in the standard open access digital repositories: the "Arxiv" (maintained at Cornell University) and NIH's PubMed Central that enables researchers funded by the NIH to fulfill their obligation to make NIH-funded work freely available to all.

This wouldn't be so problematic if a lack of immediate access didn't stop the entire enterprise - but a 2012 study of UK academics suggests that it does. Therein we find that almost 30 percent of the over 3000 respondents say that if they can't find the source locally (either online or in their local physical library) then they "often" give up and move on to another one. (The proportion goes up to almost 70% if soften the modifier to "occasionally".) These kind of actions -- or inactions -- might be traced to a recent Kaspersky Lab finding: that there is a growing desire for instantaneous information and an accompanying belief that all the information we need and want is available with a touch, tap, or swipe .

These brief examples are just cautionary call-outs to consider as a large part of our knowledge base moves online. Of course there is much, very much, that is good about this directionality - as it brings a wealth of information to many who previously had little or no access. However, even though information now spills effortlessly onto the screens in front of many of us - but still only for those on the green side of the digital divide -- the data deluge can give the illusion that everything we find immediately is all there is to find and that it must also be correct. Eternal vigilance is the price of free data. Whether it's a dusty archive or a rusty hard drive, information may "want to be free" (in the famous words of Stewart Brand), but that doesn't mean that obtaining it will be easy.