Seth Maislin's Indexing Blog: Demands for quantity are misplaced

14 May 2006

Demands for quantity are misplaced

There is a continuing trend in search engines: more, more, more.

Press releases from Google, like “Google Checks Out Library Books” [December 14, 2004] and “Google Tunes Into TV” [January 25, 2005], hit the media waves in a grand style. For the first time, entire libraries of books, from Harvard and Stanford Universities to the Universities of Michigan and Oxford, and soon the New York City Public Library, will be available from Google’s website. Within limits of copyright, the words of entire books can be searched. Information philosophers are all over this story, essentially declaring that Google will become the public library of the next generation, excited about how the very nature of libraries might change, and scratching their heads over how the book publishing industry is going to survive yet another hit in the market.

In the second, Google (as well as Yahoo!) applauds themselves for once again providing access to a greater diversity of the world’s information, because television’s closed captioning content has been indexed into a Google Video database. Viewers of public broadcasting and basketball are early adopters, and why not? Finally, all those oh-so-deprived sports consumers can satisfy themselves on more than just the videos, statistics databases, press releases, articles, blogs, commentaries, and (don’t forget) live games themselves. Because now they can search among the announcers’ words.

Now when I search for 76ers, instead of getting 1.61 million hits, I’ll get 1.62 million. Phooey.

Our instinctive reaction is to be impressed. I’m thinking of all those Ph.D. theses gathering dust in the Physics-Optics-Astronomy Library at my alma mater, 150-page books without indexes. I’m thinking about Red Sox fans who, for the first time in a very long time, are interested in the World Series.

But I’m also thinking about the catalog search system at my public library, which won’t improve with Google’s additions. For a book already in the catalog, adding its content doesn’t help at all. Instead, we’d be cluttering up the database with a few trillion new words.

So our instincts are wrong.

Why Quantity Hurts

Every few years, search engine companies are finding new ways to promote themselves by bragging about how much they can find. In the early 1990s, Northern Light was independently rated top among competitors because they searched the largest percentage of the World Wide Web: sixteen percent. Time and Newsweek contributors warned, “We’re not finding everything.”

In the late 1990s, articles about the “invisible web” appeared in popular magazines and newspapers, explaining how search engines cataloged only text, image, and sound files, thus skipping over the good content stored as spreadsheets, databases, and fonts. Again came the cry, “We’re not finding everything!”

And now, in just two months, Google and Yahoo have added more to the huge pile of information: library books and closed captioning data. No longer are our searches limited to the billions of files already on the Web. Yay!

It’s all about quantity. Nobody seems to care about quality any more.

Libraries have been struggling to redefine themselves ever since the Internet (and more, the World Wide Web) reached people’s homes. Book publishers also have suffered. The failure isn’t that of these institutions and industries, however, but of the public. The public seems unaware of the natural filtering process inherent in human behavior. Publishers choose which titles to publish, and libraries choose which titles to add to their catalogs. You might not agree with their reasoning or results, but you do have to admit that there are human beings at the helm.

(By the way, Google's library program was put on hold because of criticism. This isn't because it's a bad idea or anything, but rather because the traditional publishing industries got scared they'd lose money. It's an I-was-here-first money-by-copyright battle.)

For many, this filtering-by-design is a major disappointment, which explains the astounding popularity of the World Wide Web. The Web allows everyone to speak up: to post pictures of their pets and babies, their ideas about government, their Harry Potter fan fiction. But in a room where everyone is shouting, nothing gets heard. Northern Light and Yahoo earned their money offering a way through the noise. Other companies, calling themselves search optimization experts, profit by offering their clients the means to be noticed by these search engines, the hypertext equivalent of megaphones.

The filtering process is missing.

Okay, yes, adding content to search databases is a good thing. I might make fun of sport fanaticism, but the desire to retrieve information of choice is a valuable privilege of the individual. I might not care about the Red Sox, but I respect that there are others who do. I also feel extremely happy for the researchers who now have access to volumes of scientific research. In many ways, adding content to a database is like translating content into new languages. Really, these are not bad things.

Even so, we are only adding to the number of people shouting in a room. Google and Yahoo are definitely improving the scope of what we can find, but they are not improving our ability to find. I might proudly accumulate more and more in my attic, while simultaneously making it harder and harder to retrieve anything.

Here’s a real example. My wife and I had been expecting our first child (15 months ago). We were struggling in our decision of her middle name. When we searched for “baby names” on the Web, with quotation marks around the phrase, we found 2.5 million sites at Google, 0.9 million at Yahoo, 0.7 million at MSN. Now, imagine that all the contents of library books and scientific articles and sports broadcasters are added to the Web. Although there may exist a few anthropology articles that would have helped us choose a name, I sincerely believe that over 99% of this new content would have proven unhelpful. I also believe that some of this unhelpful content includes the unusual phrase “baby names.” For example, consider this sentence, which appeared on the Web in November 2004: “Julia Roberts now joins the list of celebrities who have jumped on the Hollywood bandwagon, which gives license to choosing odd baby names.”

After my wife and I decide on a candidate name, we search for that name online, looking for its meaning. The query “Ryan meaning” (without quotes) for the boy’s name Ryan gets 1.2 million hits at Google. (By the way, Google suppresses near-duplicated content and never displays beyond the first 1000 results, so the “true” result set is quite inaccessible.) Because Ryan is a common name, it likely appears numerous times within bibliographies. The word meaning is also extremely common among scholarly articles. If Google indeed adds university libraries to it’s already large database, 1.2 million will become a very small number. As the scientists benefit from a library search, my wife and I will find it that much harder to learn about a particular name using Google.

It is common knowledge among library scientists and search engine experts that you cannot improve the accuracy of a search at the same time you improve its comprehensiveness. Either you get perfect relevance but miss something useful, or you get everything you want along with content that you don’t. As search engines trend toward larger and larger databases, results pages grow more cluttered.

Please, Sir, May I Have Some Less?

Google’s popularity as a search engine has nothing to do with the numbers of results. When I ask people why they like Google (or whatever search engine they prefer), they answer, “Because what I want is usually within the first few results.” I’ve also gotten the answer, “Because it thinks the way that I do.” Most people don’t want millions of results. They want three. Three quality results.

I can’t remember the last time I heard about a search engine improving its algorithms. Perhaps they do this all the time in secret, inventing features behind the scenes. I do know that if a search engine started regularly serving up nonsense, it would go out of business.

So why are these efforts at improving quality so unpronounced? Did you know that while Google pays attention to quotation marks, Lycos doesn’t? That you can type a zip code into Google to get a map? Many of Google’s best features are published in books like Google Hacks and even Google Maps Hacks (O’Reilly & Associates, 2004 and 2006 respectively), where few people are going to look for them. Either nobody cares, or nobody knows the difference.

But when Google adds sports commentary to its search engine, watch out! The story appears in all the major newspapers.

At times like these, I get rather discouraged. I feel as though I am trying to hold back the ocean. It takes me more than a week to index a single, average book. The information world is growing at such an insane pace, my job seems absurd.

At times like these, I have to remind myself of two perspective. First, context. When I write the index for a book with 350 pages, it doesn’t matter that the “book of Google” has over 8 billion. Someone decided that these 350 pages needed to be written, and it’s my job to make them accessible. My work improves this book. No, I haven’t changed the world, but I have made a difference within the context of this one book, in a segment of this one industry, to a small set of readers. For me, indexing is like the civic duty of voting: few win by one vote, and yet every vote counts. It’s also contagious, because voting begets voting. And indexing does beget indexing, because 5% of the people I talk to about my job want to know more, offer me work, or express a desire to become an indexer themselves.

The second perspective is one of application. I don’t have to index books. If I wanted to make a difference at the source, there are many other applications of skills.

The key to these perspectives is a willingness to become activists. We are environmentalists in an information world. Just as scientists show concern with over a one-degree rise in ocean temperature, so should we show concern with a one-percent increase in information dissemination. Bulk up our search engines? This is not an environmentally friendly choice.

I want a search engine—it doesn’t even have to be Google—to announce that they’ve found a way to help me filter out the pages I couldn’t possibly want. The important word here is announce. The modus operandi of these companies is to bring more shouting people into the room, and then publicize this with pride. No wonder the libraries and publishers are in trouble: they’re not being praised for what they do. Neither are the indexers. In the public media, quality gets a whole lot less attention than quantity.

If the search engine is being improved, they’re not telling anyone. Apparently it’s a secret. I don’t want to hear that someone has added billions of pages to the database, unless I hear also about a system that filters billions of pages away.

I think it’s wonderful that more esoteric content is being added to the database. I applaud search engine companies who continue to improve their algorithms. What drives me crazy is that everyone is talking about the first, but not the second. The publicity is lopsided. Why won’t anyone talk about quality any more?

When we asked search engine companies for more, that’s what we got. And we lost precision. Maybe it’s time for us to ask for less.

Labels: books, Google, power of information, search engines

# posted by taxonomist @ 6:17 PM

Comments: Post a Comment

<< Home

Seth Maislin's Indexing Blog

14 May 2006

Demands for quantity are misplaced

About Me

Relevant Links

Some Blogs Seth Might Visit

archives