EDITION: U.S.
 
CONNECT    

Devereux Chatillon

GET UPDATES FROM Devereux Chatillon
 

It's About Search, Stupid

Posted: 4/7/10

If websites, databases and other content are the landscape of the virtual world, then search engines are the maps. Without search engines, the landscape is confusing and getting lost a certainty. With them, finding one's way through the dense forest of information is possible if occasionally made difficult with unexpected detours and dead-ends.

Disappearing from the results of dominant search engines leads to invisibility. And if one has a website, a blog, an ecommerce site, or a database that no one knows exists, it is useless. Given how critical maps are to successful navigation, having accurate, affordable maps that fulfill the variety of needs of a diverse population is key. So, how would we all feel about giving one for-profit company the exclusive right to map, say, New Jersey or Mexico? If no one else could produce a map of New Jersey, there would be no market incentive to produce the best maps that met all the various needs of the population (shortest route to Delaware from New York, coffee with baby-changing stations). If the mapper wanted to direct traffic to its stores in Toms River, there would be no incentive to produce maps that showed the most direct route to Delaware instead of detouring through Toms Rivers.

Yet giving just such exclusive rights to some important internet territory is one of the key issues involved in a proposed settlement between Google and all the book publishers and authors in most of the English-speaking world.

Briefly -- Google undertook a project to digitize millions of books in the libraries of several major universities such as the University of Michigan and Stanford. Google copied books in their entirety that are in the public domain, as well as those still in copyright. A handful of US publishers and the Authors' Guild, a not-for-profit organization representing US book authors, sued Google for copyright infringement. Just a few weeks ago in federal court in downtown Manhattan, the judge listened to a day's worth of objections and support for a proposed settlement agreement that runs over 300 pages.

This complex agreement accomplishes several things that would be beneficial to the public, authors, and the scholarly community. Under it, digitized books that are part of Google's database would be made available in snippets as search results, and, unless the publisher or author objected, the entire book could be part of paid-for library subscriptions or various kinds of ebooks. Previously buried and obscure works would suddenly see the light of day. And, because Google would facilitate text-to-speech functions for this database, all of these some 17 million books (Google has given varying estimates of the numbers digitized) would become available to those who have sight disabilities.

Why would Google spend all that money -- millions to digitize, more millions to litigate the case it had to know would come, and more millions to settle that case -- for what will amount to a library lending and ebook business? Keep in mind that Google's revenue alone last year was $23.6 billion. This is more than half the $40.3 billion in total revenue generated in the United States by more than 100,000 publishers. And not one dollar of Google's revenue came from publishing books. It came from the enormous ad revenues generated by Google's search and Ad Sense business. With a profit margin of approximately 25%, search in 2010 is far more profitable than publishing.

If the settlement is approved by the Court, Google will be the only search engine that will serve up search results that include the contents of some 5-10 million books -- the books whose authors, publishers, copyright holders can't be found or don't want to be found. Because of the intersection of copyright and class action law woven together by the proposed settlement, no one else will be able to do that. What does that mean for Google? It means that the results and experience from a Google search, as opposed to the results from any other search engine, will be richer. It means that Google's ability to refine its algorithms for search results and its analysis of consumer behavior, interests, and needs will have a depth and a range that no one else can match.

A recent article in Ars Technica described Google's current practice of keeping consumer data for 9 months, much longer than any other major search engine, because it uses the data for a variety of important (and profitable) business needs: "Search data is mined . . . by watching how users correct their own spelling mistakes, how they write in their native language, and what sites they visit after searches. That information has been crucial to Google's famously algorithm-driven approach to problems like spell check, machine language translation, and improving its main search engine."

Google's exclusive ability to map these books, and to observe how consumers interact with that map and the content that these books represent, would give Google a significant competitive advantage in the most profitable internet related market in which it is already dominant. Not surprisingly, the Department of Justice has announced that it is investigating.

Google has publicly proclaimed that without this settlement these out-of-print books will remain buried in libraries with no ability for most people to find them. But is that necessarily true? If it is indeed a public good for these books to be accessible, then shouldn't it be public institutions, perhaps with private cooperation and funding where appropriate, that accomplish that result?

Couldn't the Library of Congress start to assemble a digital database that would be used (perhaps for a fee) by all search engines? After all, US copyright law currently requires that two copies of every work registered be deposited with the Library of Congress, unless exempted by regulation. Why not have one of them be digital with appropriate safeguards? Couldn't (and shouldn't) Congress finally enact some kind of safe harbor or compulsory license scheme so that digital copies of past work are made available for limited uses such as search with compensation to rightsholders where appropriate?

After all, if the goal is to create a library for benefit to the public then a private database won't cut it. If this settlement is approved and actually starts to operate, Google's insuperable advantage may well prevent all the other possible players, both public and private, from helping to create something truly public and accessible to all.

 
 
  • Comments
  • 26
  • Pending Comments
  • 0
  • View FAQ
Comments are closed for this entry
View All
Recency  | 
Popularity
10:52 PM on 04/10/2010
Sorry, I should have put this in my first post: The Google Settlement allows Google to use all the scanned books to improve their search engine, without paying any of the copyright holders for this kind of use. However, the Settlement is NOT all about "snippets.­" The Settlement enables Google to SELL entire books to both consumers and libraries, as individual e-books, as print-one-­demand books, and as parts of large packages sold to libraries by subscripti­on.

So yes, search is important to Google, and to Google's rivals Microsoft and Yahoo, who want the same copyrighte­d data as Google to improve THEIR search engines. But search is not even remotely the only thing the Settlement is about. I do think copyright holders would be harmed by Google pouring all that work in so readers can just get data free instead of buying books, but it's not clear that that is what Google will do. They do seem to want to use the books to improve their search engine and their machine translatio­n tool, but even their rivals can only speculate as to how.
10:44 PM on 04/10/2010
Re the other common assertion that the Library of Congress should have had a book-scann­ing project but they don't so Google was somehow forced to do it: The Library of Congress has in fact been copying and making available rare public-dom­ain works for years, since well before Google started their scanning project. The scanned Library of Congress materials include books, photos, sheet music, film clips--all kinds of rare materials. These have long been available on the Library of Congress website, at least for given time periods. Many are also now available on www.intern­etarchive.­org. There is currently a large Library of Congress scanning project financed by the Sloan Foundation­, and I see new material scanned under this project uploaded to the Internet Archive evey few days when I check.

There are also other libraries and universiti­es with smaller scanning projects, such as (one of many examples) Cornell's "HEARTH" project, which has scanned and posted numerous public-dom­ain works on home economics. There are also large, for-profit databases of public-dom­ain scanned works available to libraries.
04:41 PM on 04/10/2010
Part 1

The Google scanning project has made no attempt to identify which are "orphan works" or to trace the rights holders. The Settlement covers every book published in the US, the UK, Australia, and Canada before January 5, 2009. Millions of books were scanned that are not only copyrighte­d, they are currently in print and listed in widely available book industry databases.

The Settlement contains use and payment provisions applying to the entire rest of the book's copyright term, however many decades that may be. The Google database of scanned books is not publicly available, but one party said in a letter to the court that it is a total mess, with bestseller­s such as The Da Vinci Code and Harry Potter books listed as both scanned and out of print. Google--wh­o is still scanning books, by the way, even those published after January 5, 2009--re-e­nters a book in the database every time it is scanned, forcing copyright holders to "opt out" the same book repeatedly­.

No attempt has been made to locate authors of "orphan works." Google merely asserted the rights to fully use every book not opted out before a tight opt-out deadline, and made a scant effort to locate authors even en masse.
04:39 PM on 04/10/2010
Part 2

The Google Settlement contains clauses saying that if a copyright holder opts into the Settlement but Google uses the books without that person's permission­, all they can do is appeal to an arbitratio­n board set up by Google. On the other hand, the Google Settlement contains a clause saying that it will not guarantee _not_ to sell the works of those who have opted out of the Settlement entirely.

Any truly orphaned US work is by definition published after 1923. As a book collector, I know that most truly rare books are old enough to already be in the public domain. Furthermor­e, most post-1923 books are available in libraries, by interlibra­ry loan if not locally. They are also widespread on the used book market, including Internet bookstores­. The claim that an out-of-pri­nt book is not "available­" unless it is in e-form is false.

In fact, Google's whole "orphans" PR is a shuck, designed to disguise as altruism a massive grab to control the publishing industries of several countries. The language of the Settlement indicates plans to _sell_ those books, not give them away to the public. Google has announced that they will launch their e-bookstor­e "Google Editions" in the first half of 2010.

Other search engines such as Microsoft and Yahoo are merely disguising as altruism a desire to use copyrighte­d works for their own gain just like Google.
09:34 AM on 04/09/2010
So Google invested the resources to do something of real value to everyone that no one else did and you want to blame them for that? These orphaned books are basically completely unavailabl­e to 99.9% of the world right now and with this agreement they will be once again in the public with the authors (if they can be found) actually being compensate­d for a book that hasn't be bought in 50 years.

THis reminds me of something that happened in San Francisco a few years back. Google offered to build a city-wide wifi system which would provide broadband wireless Internet to everyone in the city for free. The City was initially interested­, but San Francisco politics being as they are, the conversati­on quickly turned to a discussion of the evil of corporatio­ns. People said, "Why don't we do this ourselves?­" So the City rejected Google's free offer in favor of looking to do the same thing with public resources (something that would undeniably cost the City money while Google's was FREE). Well, three or four years later, surprise, surprise - the City hasn't done anything, and building out public wifi isn't even on the agenda because San Francisco is broke (and incompeten­t).

Talk about biting off your nose to spite you face...
lastpost
see biography
07:13 AM on 04/08/2010
“It's About”
Imagine an interweb based on true democracy. It would be possible to directly ask the population to select their preferred choice. By posing a question, and responding to the majority vote. Thus any route forward we might take, could be rapidly reassessed and revised.
If it became clear that a particular organisati­on presented an unacceptab­le threat, it could be stopped or its approach altered. In this case, by virtue of that very device within which it dwelt.

“If websites, databases and other content are the landscape of the virtual world, then search engines are the maps”
But the prime motive force remains the human brain.
03:42 AM on 04/08/2010
Yes, the Library of Congress should have done this. But they didn't, so, let it be Google that serves humanity when our government failed to. The priority of our government is to put money into Iraq and other far-flung places, not into the United States -- and certainly not into "artsy-far­tsy" things like books.
01:39 AM on 04/08/2010
Why, So Microsoft can compete?
photo
HUFFPOST BLOGGER
Devereux Chatillon
06:59 AM on 04/08/2010
To my mind, yes--Micro­soft and Yahoo and the folks currently working on something fantastic in the proverbial garage somewhere.
07:07 PM on 04/08/2010
doesn't Microsoft own enough? What an insult to garage innovators­.
This user has chosen to opt out of the Badges program
c-tom
Badges we don't need no stinking badges
11:59 PM on 04/07/2010
Personally I love Google Books so far. I never would have gone to a research library to look up my Loyalist ancestor. I did go into Google and found the Pennsylvan­ia Archives with the Bill of Attainder against him and signed by the President of Pennsylvan­ia. It was worth it to find out Pennsylvan­ia used to have a President.
08:19 PM on 04/07/2010
Nothing in the proposed settlement would prevent Congress from authorizin­g the Library of Congress from creating the type of database you discuss. You would, I think, be more effective in advocating change if you made a clear proposal for positive action without conflating the issue with Google.
photo
HUFFPOST BLOGGER
Devereux Chatillon
08:25 PM on 04/07/2010
You're of course correct that nothing would prevent Congress from authorizin­g these kinds of changes. And that's a great idea for the next post. The point of this, however, was something I felt was missing from most of the commentary I read, which is that this is much more about search and the value of search than it is about ebooks.
08:38 PM on 04/07/2010
Is that because your firm has over a 100 lawyers dedicated to Intellectu­al Property Rights and Technology­?
photo
HUFFPOST SUPER USER
realitytrumpsbull
two 'alves of coconut!
08:16 PM on 04/07/2010
Is the Internet indispensi­ble, in this day and age? Well, no. People used to have to think for themselves­, and read books, and it wasn't THAT long ago, either. So, what happens when people by the millions start deciding that they're against all the internatio­nal/indust­rial espionage, copyright theft, questionab­le content, and just disconnect their internet service? Does all the fiber optic cable start shriveling up and dying like some sort of fence-craw­ling plant, once the water's cut off? Some people can live without the www. , some people can't, which are you?
01:36 AM on 04/08/2010
If you want to stay competitiv­e, the internet is absolutely necessary.

Some people don't like change: are you one of those?
01:37 AM on 04/08/2010
and BTW, why are you on HP if you hate the internet?
This user has chosen to opt out of the Badges program
photo
MrWebster
Moderate this.
08:12 PM on 04/07/2010
I have read now and then about this, and still trying to find what Google is doing to writers and publishers­. I went to Google books and did a search on an obscure Danish crime novelist (Steen Steensen Blicher ) from the last century and found some books--hey great. If they digitalize­d it, why can't they keep and use it anyway if they honor the legal restrictio­ns? University and local libraries do the same with hardcover books. They put restrictio­ns on their own copies. Try to get into a rare book collection as just somebody off the street interestin­g in rare books.

I suspect there are several agendas going on. First, Google (and the Internet tubes in general), have take away royality free, no copy right hassle reprints. Who is going to buy Shakespear­e's collected comedies when you can get them with Google books??

Second, it seems that some writers and publishers seem to be arguing for permanent copyright for themselves­. Even beyond the legal copyright expiration date, they want to claim copyright and its financial benefits. And really, the point about Google's profits looks more like a statement of about deep pockets for the pickin'.

As for the map analogy--c­an't other search engines do what Chatillion claims? Is Google going to hide their massive book collection from other search engines??
07:53 PM on 04/07/2010
Full text search indexing is simply not that hard. Putting full-text search functional­ity on digitized books (or internet articles, or usenet articles) is not rocket science. Just last year I personally wrote a search engine that runs over multiple terabytes of data that Google simply does not process (and their search functional­ity over text groups is hopelessly broken compared to the job DejaNews did more than a decade ago, before they were purchased by Google).

It is incorrect to credit Google with the "exclusive ability to map these books." Anyone digitizing books -- or producing them originally in digital form (most books are sent from the publisher to the printer as pdfs) -- can index the text using, say, the open-sourc­e full-text indexing and search engines Lucene or Sphinx.

What Google has done is to deploy a small army of scanners at two university libraries -- but libraries all over the world are doing this for themselves -- and many publishers are doing the same in their production of eBooks. Scholastic­, in particular­, in their K-12 educationa­l mission, has numerous book search facilities­, and more innovative in their design than anything Google is doing; Scholastic­'s search products are targeted to produce teacher, student and parent resources that produce knowledge, learning and education, not just "snippets.­"

So, while I agree with the sentiment of your post, the "insuperab­le advantage" you claim Google currently enjoys is rapidly being superseded by the more widespread use of easily-acc­essible technology­.
photo
HUFFPOST BLOGGER
Devereux Chatillon
08:30 PM on 04/07/2010
I agree as far as books that are in the public domain and available to everyone, or able to be "permissio­ned," if you'll forgive the publishing­ese. The issue I was trying to highlight was that there is a category of several million books that are still in copyright but whose owners aren't able to be found to grant or deny permission­. If the settlement is approved, Google would be the only search engine able to access these books without risking massive copyright infringeme­nt.
11:13 PM on 04/09/2010
Well that's just not right. If permission to scan and OCR and index and provide "snippets" of material in search results is granted to Google, it has to be granted equally to libraries and publishers and most urgently, as you mention, the Library of Congress.

If the settlement is with a class represente­d "A handful of US publishers and the Authors' Guild" then this settlement could only possibly apply to copyright owners who have explicitly opted in to that class. This would necessaril­y exclude the "category of several million books that are still in copyright but whose owners aren't able to be found to grant or deny permission­." This undermines Google's claim that their activity is for the public good of electronic­ally publishing these specific materials.

If the Authors' Guild claims to represent these authors who "cannot" be found -- well, perhaps they can claim to be the widow of the unknown soldier as well.

Google might find a more winning strategy in attacking the basis of the class, rather than touting their rights based on some public good which should more rightfully be extended to publishers and libraries.

And, what happens to the advertisin­g proceeds on the "snippets" once the author has been (perhaps through a Google search) suddenly been "found." Are they to be told that this third party, this "Authors' Guild" these authors have never heard of, took the settlement and skedaddled­?

The widow of the unknown soldier indeed.
07:14 PM on 04/07/2010
Are you representi­ng a group in this discussion­, or is this citizen concern?
photo
HUFFPOST BLOGGER
Devereux Chatillon
08:26 PM on 04/07/2010
These are my views, not those of any client or group. Just me, as an interested observer.
04:48 AM on 04/08/2010
You didn't really answer the question. Does your firm represent an interested party in this debate? Your piece certainly has the ring of profession­al advocacy.