Seth Maislin's Indexing Blog

25 March 2007

Indexers indexing infinitely ... like monkeys

Three ideas have merged.

First, there's the idea I published last December as "A needle in a haystack with 100,000,000 blades," where I argued how the Web, or an approximation thereof, could be indexed by humans for a reasonable amount of money.

Second, there's The New York Times article "Artificial Intelligence, With Help From the Humans," in which we learn that the Amazon Mechanical Turk service subcontracts human workers to perform tasks that are especially challenging for computers to accomplish, such as matching images to textual descriptions. For some jobs, Turkworkers might make one penny per transaction.

And finally, there's the infinite money theorem, which states that a monkey hitting keys at a typewriter for an infinite amount of time will "almost surely" type the complete works or Shakespeare, or something similar. I first heard this ideas as a "million monkeys and million years," but I bet the math's a bit different. After all, "infinite" is much bigger than a million million.

Putting these ideas together seems to provide a rather obvious solution: third-world indexers. After all, if it costs only a nickel to get someone to write a few keywords for something, we can get a lot of indexing done very cheaply; I say "third world" because no indexer I've ever known is willing to work for a penny per word.

The indexing industry is facing the very real possibility that our workload will be taken from us and delivered to those in economies that allow lower prices. But what if we went a step further and, instead of looking for less expensive indexers with good qualifications, we decided to look for dirt cheap indexers with no qualification other than time to waste? What if, I ask, we asked monkeys to pound away at their keyboards?

I find the idea amusing but too close to the truth. After all, the intelligence behind Google is the social intelligence, the uneven and culturally biased workings of millions of Internet users plugging away at their disparate tasks. What Mechanical Turk has going for it, then, is the human decision making at the back end. Whereas most search engines look for better and greater stores of metadata with which to judge content, one man in a back room can make smarter decisions upon command. No, the real problem is that today's human intelligence is worth only pennies per word. Computers do their best, and humans sweep up afterwards. Our natural intelligence isn't worth a whole lot, I guess.

That's how we know computers are smart. Computers own us monkeys.

Labels: business of indexing, future of indexing, human factors, indexing process, search engines

# posted by taxonomist @ 6:38 PM 1 comments

28 December 2006

Eighteen million people can't be wrong

No matter how much you and I might like Google, the fact is that Google has some very serious problems with it comes to finding content. More specifically, if you're looking for the "right" answer, or if you're attempting to do any serious research, Google is likely to fail you miserably.

The flaw lies in Google's strength: social algorithms. Social algorithms are processes in which decisions are made by watching and following the majority of people in a community. If blogger.com tends to be the place people go to create blogs, then a social algorithm will see blogger.com as "better." When a search engine is managed by a social algorithm, a website might appear first in search results not because of the quality of site, but rather because a larger number of people treated the site as if were of higher quality. In other words, social algorithms equate "majority" with "best," something that often looks right but actually is patently untrue.

When you perform a search at Google.com, your results are sorted based on majority behavior and little else. For simple questions about anything -- as well as complex questions about cultural issues, for which "lots of people" is critical -- frequently the majority opinion is rather close to what you want -- which is why Google is so successful. But the gap between "close to what you want" and "accurate" is an invisible one, and that makes it insidious and dangerous.

For example, search for "Seth Maislin." The first hit is my website. The second hit is this blog. The third hit is an interview I did for O'Reilly & Associates in July 1999. An investigation of why these are the top three sites is rather interesting. First of all, these are the only results in which my name actually appears in the title; the fourth link and beyond have my name in the document, but not the title. Second, my website appears at the top not because it's the definitive website about "Seth Maislin," but because Google knows of 24 people linking to it. In comparison, the only person who ever created a link to this blog is me -- a number far less than 24! The same goes for the O'Reilly interview, except that the single linker isn't even a valid site any more: it's broken. The popularity of my home page (in comparison to this blog, for example) is why it's a better hit for my name. But if you folks out there started to actually link to this blog, that would change.

You should look into the search results for the word "Jew." A website known as JewWatch.com, an offensive and inflammatory collection of antisemitic content, had appeared as the number-one result at Google.com for this one-word query. This happened because a large number of supporters of this site tended to build links to it; then, those were were outraged or amused also linked to it within their protestations. In the end, the social algorithms at Google recognized how popular (i.e., "linked to") this site was, and in response rated it very highly -- in fact, rated it first -- compared to all other websites with the word "Jew" in the title. Eventually, those who were enraged by this content fought back by asking as many people as possible to link somewhere else -- specifically, the Wikipedia definition of Jew -- just as I have here. Over time, more people linked to Wikipedia than to JewWatch, and so the latter dropped into second place at Google. This process of building networks of links in order to influence Google's social algorithm is called "Google bombing." In other words, when the people who hated the site acted together in a large group, Google's social algorithms responded.

(By the way, you'll notice that I do not create a link to the offensive site. I see no reason to contribute to its success.)

Do you see the problem? The success of Google bombing is analogous to the squeaky wheel metaphor, that the loudest complainer gets the best service. Social algorithms reward the most popular, regardless of whether they deserve it. JewWatch made it to the top because it was popular first; Wikipedia's definition moved to the top because those offended banded together to demonstrate even more loudly. And in the end, there's no reason for me to think either of these links is best.

Whether popularity is a good thing or a bad thing is often subjective. In language, some people lament the existence of the word ain't, while others applaud its existence as an inevitable sign of change; either way, the word is showing up in our dictionaries because more and more people are using it. But I'm not talking about language; I'm talking about truth.

Do you think vitamin C is good at preventing colds? Well, it isn't; there have been no studies demonstrate its effectiveness, but there have been studies that show it makes no real difference. (It's believed that vitamin C will shorten the length of a cold, but studies are still inconclusive.) But after a doctor popularized the idea of vitamin megadosing, our entire culture suddenly believes taking the vitamin will keep you extra healthy. Untrue.

Do you know why "ham and eggs" is considered a typical American breakfast? Because an advertising executive in the pork industry used Freudian psychology to convince people to eat ham for breakfast. He did it by asking American doctors if they thought hearty breakfasts were a good thing (which they did); the ad-man then asked if ham were a hearty food. Voila: ham, sausage, and bacon are American breakfast staples, and the continental breakfast vanished from our culture.

In both of these examples, majority belief trumps the truth. And look at the arguments about global warming! I won't repeat the arguments laid out by Al Gore in An Inconvenient Truth, but his argument is that as long as enough people insist that global warming isn't true, its dangers will remain unheeded. In fact, I'm not even going to argue here whether global warming is a real thing or not; it doesn't matter what I believe. What matters is that the debate over global warming isn't a fight over the facts. Instead, it's a shouting match, in which the majority wins. Right now, so many influential people have argued that it doesn't exist (or isn't such a big deal) that very little has been done in this country in response to its possible existence. But as more and more people start to believe it's at least possible, it's becoming a reality. Doesn't that just drive you nuts? Why are the facts behind global warming driven by democracy? Can't something be true even if no one believes in it?

One last look at this "majority rules" concept, only this time let's avoid politics and focus on simple word spelling. If you search for the word millennium, correctly spelled with two Ls and two Ns, you'll get about 54 million hits at Google (English-language pages only). If you search for the word millenium, misspelled with two Ls and only one N, you'll get 18 million hits. Twenty-five percent of all websites have this misspelling in them! For content that's published, that by its very nature is biased toward having only correct spellings, this error rate is monstrous! But does Google let you know that millenium is misspelled? Does it ask you if you "meant to type millennium?" No! After all, Google considers the misspelled word correct.

I mean, eighteen million people can't be wrong, right?

Labels: Google, misspellings and other errors, search engines, social algorithms, spamming and similar behaviors, web indexing

# posted by taxonomist @ 10:49 AM 0 comments

24 December 2006

I met a famous indexer the other day

In my March 20 post ("Frustrated by a lack of meaning"), I made reference to a Microsoft clip art mess that was quite public. The story is that the keyword "monkey bars" caused certain images to appear when someone searched for the word "monkey," and that these results were misinterpreted in a strongly negative way.

Well, I met the indexer who actually wrote those keywords -- someone I've known for a long time -- and I have to say, there's something really cool about realizing that one of your good colleagues was behind that story. I also find it reassuring that the indexer is someone who really knows what she's doing, because it emphasizes just how far apart good indexing is from good search: smart people, dumb tools.

For more on this subject, I recommend reading The Inmates Are Running the Asylum. The book is about computer programming in general, but the sentiment is dead on.

Labels: keywording, search engines

# posted by taxonomist @ 3:33 PM 0 comments

A needle in a haystack with 100,000,000 blades

The Internet has more than 100 million websites, according to the November Netcraft survey. If you were standing on top of the growth curve, by now your stomach would have nothing left to vomit up.

I did some math, and I've figured out a way to make sure that all of these websites are indexed. Here's what I discovered.

Between October and November 2006, approximately 3.5 million sites were created. Assuming that my team would be responsible for inventing a set of keywords for the whole site -- and not for individual pages or parts of pages -- we would have to build 3.5 million keyword sets.
Let's further assume that on average, every website would have four keywords or key phrases. For example, this blog would get the keywords "Seth Maislin," "indexing," "blog," and perhaps my company name, "Focus Information Services." Ideally we'd have the time to invent many more, since it's our goal to help the website perform well at the various search engines, but this team simply can't give everyone special attention. So I'm making the executive decision to limit ourselves to creating 4 terms each for 3.5 million sites.
Assuming that we can invent and type one keyword every two seconds -- a conservative estimate, given that my company name takes me a minimum of two seconds to type -- we'll need 28 million seconds to get the job done.
Now remember, we're just taking about the new sites created in October 2006. Consequently, we have only a month to get the job done before we have to start indexing the November 2006 sites. For this reason, I'm going to build a team of several people, with each one putting in eight hours per day, twenty days each month. That's 576,000 seconds per person per month.
Dividing 28,000,000 seconds per month by 576,000 seconds per person per month gives me 48.1 people, which I'll round to a nice 50 people. That means I need a team of just 50 people to get the job done.

So there you go: a team of 50 people can index the Internet. That doesn't sound nearly as bad as I thought. Of course, everyone will have to type rather quickly, and we'll need a system in place to prevent us from accidentally indexing any one website more than once, but that shouldn't be too bad. And yes, I'm assuming that all of these websites are in English, but most of them are; I'll bring a few translators to work on the few remaining.

At U.S.$50,000 per year per indexer, which is quite modest for a highly intense round-the-clock job like this, plus $100,000 for me as manager, I could probably put together a bid of about $350,000/year to get the job done. Given how many billions of dollars are spent or exchanged over the Internet today, that seems quite reasonable, too. Heck, I should triple the whole thing, since we'd have to re-index the old sites every once in a while. Maybe I should double it again, too, so we'd be allowed to use eight keywords instead of four.

So let's see, that brings the total bill to to $2.1 million. Gosh, that isn't bad at all, is it? I mean, we all agree that indexing the Internet is at least a two-million-dollar-per-year business, right?

Except it's not. Indexing the Internet is a zero-dollar-per-year business. No one is doing it. Just about no one seems to care about quality keywords. In fact, there are only two industries that exist around keyword creation. One of them is misnamed "search optimization," which is about spamming the heck out of the Web. Optimize, I think not: this is the opposite of the intelligent product my team would be build. The other business is the search business itself, companies springing up around those fancy algorithms that Google, Yahoo, Lycos, Ask Jeeves, and the rest use. The thing is, those algorithms are just word-matching machines. These engines are looking for keywords, but none of them is actually writing any. So you see, no one with indexing training is writing any keywords. The inexpensive market for human indexers is being completely overlooked.

Guess it's not worth the two million.

Labels: indexing process, keywording, search engines, spamming and similar behaviors, web indexing

# posted by taxonomist @ 1:58 PM 0 comments

27 August 2006

Information is owned by the few

Consider the history of manufactured items. At one time in United States history, the manufacturer in your neighborhood was the primary source of whatever it made. If Maytag had a plant in your city, you bought Maytag. There was almost no question of buying a competitor product in a distant city, with reasons ranging from the practical (delivery requirements) to the social (your family was employed there) to the unconscious (you heard about this company every day in the news). Manufacturers were king: they made, you bought (assuming you could afford), and no questions were asked.

Then came the middlemen, resellers like Sears, who discovered that if you brought a number of competing products into the same show room, customers came to that show room to make an educated decision. No longer convinced to buy from one manufacturer, you could shop among several models. This was how resellers made their money: providing you a service you'd pay extra money for. Products and manufacturers that failed to compete well in side-by-side arrangements were abolished in the face of consumer choice.

And finally came the Internet. The World Wide Web provided you with not only all the same information the resellers had, but much more: professional and amateur reviews, community-level and industry-specific emails filled with recommendations and warnings, and manufacturers' contact information in case you had questions. Now you could shop intelligently around the world. Much of the resale industry was demolished, now that their services paled in comparison to what consumers could do themselves. Look at the fate of independent bookstores, who all-but-vanished in a wired world where consumers read reviews and compare prices among Amazon.com, BN.com, and Borders.com, only to buy the book from an Internet-based reseller with massively discounted prices. Travel agents, too, disappeared in the face of Expedia.com and Travelocity.com.

It is thus believed, therefore, that the Internet has empowered the individual.

Not true. I'm sorry to say that it's all an illusion.

First of all, the online resellers are no better than the brick-and-mortar resellers. After browsing the options available at an online travel agency, it's often cheaper to then go to the airline site itself to buy your tickets. For example, if I want to fly from Boston to San Francisco, I'll plug my dates into a search engine at Expedia (and later, Travelocity), find the cheapest option at the best times, and then buy my ticket at Delta.com, Southwest.com, or another airline, or perhaps call a human travel agent after all at AAA and start again. As long as I have access to the source of that service or product -- the manufacturer, the service provider, etc. -- the reseller is a source of information without sale.

Second, the online resellers are limited in scope. Thanks to partnerships and other marketing choices, not all of my options are provided. For example, both Expedia and Travelocity tend to overlook small, unaffiliated airlines. Additionally, at one time (and perhaps still today) Expedia charged extra money if I wanted to buy a ticket for USAir flights, without telling me. The bottom line is that going through an online reseller is not necessary more comprehensive or cheaper than my other options.

But biggest of all, however, is that for me to perform ANY search these days, I'm going to have to use a search engine, like Google.

Without even getting into problems with spam, search engines are responsible for providing me with the information I'll need to do anything on the Web, if I don't already know precisely how and with whom to do it myself. Google is the next Sears. If I wanted to find some good choices for a boy's name, Google will provide me with so many choices that I'll inevitably stop after the first twenty (and more likely, stop after three). Google is filtering my search, valuing some choices above others just as my supermarket creates end-of-aisle displays to sell me things. The only difference is that I know the supermarket makes money from the sale. With search engines, you have no way of guaranteeing you're not clicking on a link the search engine company prefers.

Consider the unscrupulous used car salesman. Let's step through the process.

I approach the salesman asking a simple question: "I want a reliable automobile for a good price."

The salesman immediately points out a few models. The first one he shows me is way too expensive. The second one is terrible. In comparison, the third one he shows me seems wonderful at first glance, but then I ask more questions.

The salesman doesn't give me precisely the information I want. Some of his answers sound ridiculous. He's reluctant to show me any more cars. But when I keep pushing, he finally gives in and shows me a fourth car, without much enthusiasm.

Finally, I ask for specific kinds of cars, things I've heard rumors about. "What about a Toyota Sienna? Is there a good Ford minivan?" The salesman is completely unhelpful. Clearly this was a terrible place to come shopping. Maybe I'll visit some dealers, or talk to my neighbor.

Let's compare this to a Google search for boys' names. I choose Google here because it's currently a very popular search engine that, people seem to believe, does an honest job in helping people search both online and offline content.

I start with a simple request: "I want to find a good boy's name." My query is "boys names."

Google gives me some immediate results. Some of them are immediately terrible and can be skipped over, but it doesn't take long to find something promising. I visit the website and, although looks like what I want might be there, I have a hard time using it. I decide to give up and return to Google and its search result list.

I try a second website, but I've lost confidence. Maybe it's not Google's fault in any obvious way, but none of these websites is helping me in the way I want to be helped.

I decide to try some new queries. Maybe "boy names"? Do I need an apostrophe? Or perhaps, because I'm interested in a boy's name that isn't too ethnically different from the names I know in the United States, I should try a search like "American boy names." Unfortunately, my search choices are even worse. I give up. The Web is a terrible place to search for boy names. I'll try the bookstore.

You see? No practical difference.

You might think this exercise was a bit silly, but I'm not wrong. The people, companies, or machines that control what you want are the same entities that control the process. The car salesman controls which cars you buy; even if you trust him, the process is his, not yours. He's just nice about it. The same is true with Google. Sure, we all tend to trust Google -- and what's not to trust or like -- but we do not own the information-seeking process. Google owns it. Here's why:

Not only doesn't Google find everything, it doesn't tell you there are things missing. The Google database isn't as up-to-date as the Web. Your search words don't match every relevant result in every relevant language. Sure, it looks as if there are 2,600,000 hits for you search, but that doesn't mean it found everything. What's more, you can't even see all 2,600,000 hits if you wanted to! Google shuts you out after only a few hundred.

Google doesn't explain what it's doing, or why. The search algorithm is never explained; it's a patent secret. We know what kinds of ingredients go into the mix, but we don't know the precise details. And although sponsored links appear separate from search results -- something not all search engines do -- we have no certainty that there are some other sponsorships happening in there.

If Google is biased, we have no way of knowing. I guarantee Google is biased, because its algorithm is based on how people use the Web. Google News collects stories more often from the AP Wire than the Boston Globe, and more often from the Globe than the Arlington Tab. That's well-intentioned bias. There are less favorable biases, too, like social biases. Because there are fewer computer users who are poor or homeless, the websites of interest to these people never show up at the tops of list. Because the Google default language in the United States is English, U.S.-based news articles are far favored over newspapers in other countries, even when the news takes place in those countries. And because most people have heard of large companies like Amazon.com, smaller companies like independent booksellers are pushed into obscurity. There are also language-based biases. It's easier to find websites related to money because this word is both singular and plural, whereas finance has a plural form. It's easier to search for words like mistress and misogyny, which exist, than for the nonexistent gender-opposite versions. And it's nearly impossible to find a company that sells windows because your search results will be overwhelmed by companies that sell [Microsoft] Windows.

But we don't have a choice. There is too much information in the world. We must go through an information repackager if we're not going to do the work ourselves. (Librarians do the work themselves; the results are of excellent quality, of limited quantity, and of almost negligible relevance for our day-to-day needs of airline tickets and boys' names. Libraries have some excellent information with which we can arm ourselves -- like using Consumer Reports to choose a quality used car -- but in general we still have to take the final steps on our own.)

Regardless of their motives, search engines OWN the information access. Maybe that's good enough. Maybe you're comfortable performing your searches in ignorance of the engine's inner workings, generally satisfied with the results most of the time. But please, that doesn't make it a good thing. What if Google started charging you for some of your searches? What if Google integrated its sponsored links into the search engine (as other engines did or do)?

Here's a real-life, immediate example. Search for Pluto. There has been a ton of recent press regarding Pluto's demotion as a planet in our solar system. Where is all that news in the search results page? There's just a tiny news area that most people won't see because it looks different, and then there's a bunch of sponsored links. This is a branding decision; Google thinks "news" and "sites" are very different things and doesn't even combine their results.

Don't kid yourself. The power of the Internet has moved, but not to you.

Labels: content management, Google, power of information, search engines

# posted by taxonomist @ 8:38 PM 0 comments

10 July 2006

Meta-indexing

Many indexers in the business are frustrated that search engines are getting so much publicity and credit, when in fact indexes are, well, more effective!

Usability testing has shown that people prefer indexes over search when it comes to accuracy and comprehensiveness, and yet the same tests show that people prefer search as a technology. In other words, people prefer search engines even though they unanimously agree that indexes are more accurate. This is not that different from the person who insists on lifting the heavy box himself, even though someone stronger and better able has volunteered. "No, no, I'll do it myself," says the searcher.

And then he injures himself. Silly, silly person.

What is at stake, apparently, is something more psychological or emotional. Search engines may offer users a sense of power and control, or a sense of speed, that indexes don't. Further, indexes seem so much more complicated when you glance at them -- words, words everywhere -- and in comparison search is so simple: an empty box. Just type a word and bingo! If you were to stop and look at this behavior you'd realize that there's something subsconscious going on; rationality is losing to some deeper sense of emotionality and self. Search simply feels right in a way that using an index does not, at least not instinctively.

Some indexers take this news with a strong sense of pessimism, seeing this "shift toward the emotional" as paralleling our current lifestyle of sensationalist news and entertainment. They believe that indexes will become extinct in most practical circumstances, because search engines are psychologically preferred -- not to mention faster, cheaper, online, and scalable.

These pessimists aren't wrong.

However, I contend that the pessimists are also looking at the situation completely upside-down. Ask yourself what makes a search engine effective or likeable at all -- that is, what does Google have that seems to draw a majority of Web users not only to the Google.com website but also to license Google technology at their own sites -- and you'll realize that there's indexing on the back end. People don't call it "indexing," necessarily, but the intellectual, rational processes that comprise indexing are still taking place.

The difference, however, is that a search company like Google doesn't really look at the individual words and their instances. Instead, the designers of Google search (and other tools) are looking at how people respond to these words. They are looking at behavioral patterns, and using those patterns to do the indexing for them.

My brother and I used to play a game at ballparks. One of us, when it was his turn, would attempt to turn as many heads as possible without speaking. My brother would turn his head and look over his shoulder casually, then allow his eyes to lock on something imaginary but far behind all the people sitting behind us. He'd tap my on the shoulder and get me to look; I'd play along. He'd point. I'd point, and he'd correct me. Then he'd stand up. And so on. After a while, some of those people who can see us directly in front of them would be curious to know what we're looking at, and they'd turn their heads to see. This would inspire other people to turn their heads, and so on. If we'd done our job well -- it was a game of timing as well as body language -- we could get hundreds of people to look behind them, at nothing.

This kind of behavior explains the popularity of some really stupid websites. Get enough people to visit your website, and Google will acknowledge that there's something about this website worth looking at. Then more people will look at it. Internet-based fads occur weekly, from paparazzi photos to cool advertisements.

If a human were indexing this, the indexer might think, "This isn't so important that it needs to be found a million times." That human is right. But the meta-human looks at what all the humans are already doing and thinks, "There is a cultural need for this content."

For a back-of-the-book indexer to break into the world of mass search, he'll have to give up the words and instead figure out the rules -- linguistic as well as social -- behind how these words are being used. Those rules, which govern how we find things (and not what we find), don't describe the indexing we know at all.

If indexing as we indexers know it is going to survive, we'll have to find that nifty middle ground between the words and the people. It should be easy, given that we already do this, but so far we haven't managed to break into this field at all. Hopefully we'll evolve.

In the next generation, we'll index the indexes.

(Brian Pinkerton developed the first full-text retrieval search engine back in 1994. "Picture this," he explained. "A customer walks into a huge travel outfitters store, with every type of item, for vacations anywhere in the world, looks at the guy who works there, and blurts out, 'Travel.' Now where's that sales clerk supposed to begin?")

Labels: business of indexing, human factors, search engines

# posted by taxonomist @ 7:16 PM 0 comments

14 May 2006

Demands for quantity are misplaced

There is a continuing trend in search engines: more, more, more.

Press releases from Google, like “Google Checks Out Library Books” [December 14, 2004] and “Google Tunes Into TV” [January 25, 2005], hit the media waves in a grand style. For the first time, entire libraries of books, from Harvard and Stanford Universities to the Universities of Michigan and Oxford, and soon the New York City Public Library, will be available from Google’s website. Within limits of copyright, the words of entire books can be searched. Information philosophers are all over this story, essentially declaring that Google will become the public library of the next generation, excited about how the very nature of libraries might change, and scratching their heads over how the book publishing industry is going to survive yet another hit in the market.

In the second, Google (as well as Yahoo!) applauds themselves for once again providing access to a greater diversity of the world’s information, because television’s closed captioning content has been indexed into a Google Video database. Viewers of public broadcasting and basketball are early adopters, and why not? Finally, all those oh-so-deprived sports consumers can satisfy themselves on more than just the videos, statistics databases, press releases, articles, blogs, commentaries, and (don’t forget) live games themselves. Because now they can search among the announcers’ words.

Now when I search for 76ers, instead of getting 1.61 million hits, I’ll get 1.62 million. Phooey.

Our instinctive reaction is to be impressed. I’m thinking of all those Ph.D. theses gathering dust in the Physics-Optics-Astronomy Library at my alma mater, 150-page books without indexes. I’m thinking about Red Sox fans who, for the first time in a very long time, are interested in the World Series.

But I’m also thinking about the catalog search system at my public library, which won’t improve with Google’s additions. For a book already in the catalog, adding its content doesn’t help at all. Instead, we’d be cluttering up the database with a few trillion new words.

So our instincts are wrong.

Why Quantity Hurts

Every few years, search engine companies are finding new ways to promote themselves by bragging about how much they can find. In the early 1990s, Northern Light was independently rated top among competitors because they searched the largest percentage of the World Wide Web: sixteen percent. Time and Newsweek contributors warned, “We’re not finding everything.”

In the late 1990s, articles about the “invisible web” appeared in popular magazines and newspapers, explaining how search engines cataloged only text, image, and sound files, thus skipping over the good content stored as spreadsheets, databases, and fonts. Again came the cry, “We’re not finding everything!”

And now, in just two months, Google and Yahoo have added more to the huge pile of information: library books and closed captioning data. No longer are our searches limited to the billions of files already on the Web. Yay!

It’s all about quantity. Nobody seems to care about quality any more.

Libraries have been struggling to redefine themselves ever since the Internet (and more, the World Wide Web) reached people’s homes. Book publishers also have suffered. The failure isn’t that of these institutions and industries, however, but of the public. The public seems unaware of the natural filtering process inherent in human behavior. Publishers choose which titles to publish, and libraries choose which titles to add to their catalogs. You might not agree with their reasoning or results, but you do have to admit that there are human beings at the helm.

(By the way, Google's library program was put on hold because of criticism. This isn't because it's a bad idea or anything, but rather because the traditional publishing industries got scared they'd lose money. It's an I-was-here-first money-by-copyright battle.)

For many, this filtering-by-design is a major disappointment, which explains the astounding popularity of the World Wide Web. The Web allows everyone to speak up: to post pictures of their pets and babies, their ideas about government, their Harry Potter fan fiction. But in a room where everyone is shouting, nothing gets heard. Northern Light and Yahoo earned their money offering a way through the noise. Other companies, calling themselves search optimization experts, profit by offering their clients the means to be noticed by these search engines, the hypertext equivalent of megaphones.

The filtering process is missing.

Okay, yes, adding content to search databases is a good thing. I might make fun of sport fanaticism, but the desire to retrieve information of choice is a valuable privilege of the individual. I might not care about the Red Sox, but I respect that there are others who do. I also feel extremely happy for the researchers who now have access to volumes of scientific research. In many ways, adding content to a database is like translating content into new languages. Really, these are not bad things.

Even so, we are only adding to the number of people shouting in a room. Google and Yahoo are definitely improving the scope of what we can find, but they are not improving our ability to find. I might proudly accumulate more and more in my attic, while simultaneously making it harder and harder to retrieve anything.

Here’s a real example. My wife and I had been expecting our first child (15 months ago). We were struggling in our decision of her middle name. When we searched for “baby names” on the Web, with quotation marks around the phrase, we found 2.5 million sites at Google, 0.9 million at Yahoo, 0.7 million at MSN. Now, imagine that all the contents of library books and scientific articles and sports broadcasters are added to the Web. Although there may exist a few anthropology articles that would have helped us choose a name, I sincerely believe that over 99% of this new content would have proven unhelpful. I also believe that some of this unhelpful content includes the unusual phrase “baby names.” For example, consider this sentence, which appeared on the Web in November 2004: “Julia Roberts now joins the list of celebrities who have jumped on the Hollywood bandwagon, which gives license to choosing odd baby names.”

After my wife and I decide on a candidate name, we search for that name online, looking for its meaning. The query “Ryan meaning” (without quotes) for the boy’s name Ryan gets 1.2 million hits at Google. (By the way, Google suppresses near-duplicated content and never displays beyond the first 1000 results, so the “true” result set is quite inaccessible.) Because Ryan is a common name, it likely appears numerous times within bibliographies. The word meaning is also extremely common among scholarly articles. If Google indeed adds university libraries to it’s already large database, 1.2 million will become a very small number. As the scientists benefit from a library search, my wife and I will find it that much harder to learn about a particular name using Google.

It is common knowledge among library scientists and search engine experts that you cannot improve the accuracy of a search at the same time you improve its comprehensiveness. Either you get perfect relevance but miss something useful, or you get everything you want along with content that you don’t. As search engines trend toward larger and larger databases, results pages grow more cluttered.

Please, Sir, May I Have Some Less?

Google’s popularity as a search engine has nothing to do with the numbers of results. When I ask people why they like Google (or whatever search engine they prefer), they answer, “Because what I want is usually within the first few results.” I’ve also gotten the answer, “Because it thinks the way that I do.” Most people don’t want millions of results. They want three. Three quality results.

I can’t remember the last time I heard about a search engine improving its algorithms. Perhaps they do this all the time in secret, inventing features behind the scenes. I do know that if a search engine started regularly serving up nonsense, it would go out of business.

So why are these efforts at improving quality so unpronounced? Did you know that while Google pays attention to quotation marks, Lycos doesn’t? That you can type a zip code into Google to get a map? Many of Google’s best features are published in books like Google Hacks and even Google Maps Hacks (O’Reilly & Associates, 2004 and 2006 respectively), where few people are going to look for them. Either nobody cares, or nobody knows the difference.

But when Google adds sports commentary to its search engine, watch out! The story appears in all the major newspapers.

At times like these, I get rather discouraged. I feel as though I am trying to hold back the ocean. It takes me more than a week to index a single, average book. The information world is growing at such an insane pace, my job seems absurd.

At times like these, I have to remind myself of two perspective. First, context. When I write the index for a book with 350 pages, it doesn’t matter that the “book of Google” has over 8 billion. Someone decided that these 350 pages needed to be written, and it’s my job to make them accessible. My work improves this book. No, I haven’t changed the world, but I have made a difference within the context of this one book, in a segment of this one industry, to a small set of readers. For me, indexing is like the civic duty of voting: few win by one vote, and yet every vote counts. It’s also contagious, because voting begets voting. And indexing does beget indexing, because 5% of the people I talk to about my job want to know more, offer me work, or express a desire to become an indexer themselves.

The second perspective is one of application. I don’t have to index books. If I wanted to make a difference at the source, there are many other applications of skills.

The key to these perspectives is a willingness to become activists. We are environmentalists in an information world. Just as scientists show concern with over a one-degree rise in ocean temperature, so should we show concern with a one-percent increase in information dissemination. Bulk up our search engines? This is not an environmentally friendly choice.

I want a search engine—it doesn’t even have to be Google—to announce that they’ve found a way to help me filter out the pages I couldn’t possibly want. The important word here is announce. The modus operandi of these companies is to bring more shouting people into the room, and then publicize this with pride. No wonder the libraries and publishers are in trouble: they’re not being praised for what they do. Neither are the indexers. In the public media, quality gets a whole lot less attention than quantity.

If the search engine is being improved, they’re not telling anyone. Apparently it’s a secret. I don’t want to hear that someone has added billions of pages to the database, unless I hear also about a system that filters billions of pages away.

I think it’s wonderful that more esoteric content is being added to the database. I applaud search engine companies who continue to improve their algorithms. What drives me crazy is that everyone is talking about the first, but not the second. The publicity is lopsided. Why won’t anyone talk about quality any more?

When we asked search engine companies for more, that’s what we got. And we lost precision. Maybe it’s time for us to ask for less.

Labels: books, Google, power of information, search engines

# posted by taxonomist @ 6:17 PM 0 comments

20 March 2006

Frustrated by a lack of meaning

Not for the first time, search engines have been wrongly criticized for the politics of their results.
As reported by The New York Times today ("Amazon Says Technology, Not Ideology, Skewed Results," March 20, 2006), an abortion-rights organization discovered and reported the appearance of biased results in its search engine. Apparently books with anti-abortion leanings appeared as more relevant on Amazon's search results pages. I am not taking sides on this highly charged issue; I am taking offense at the ignorance demonstrated by people who don't seem to understand how search works. (And I'm not singling out this issue either, as you'll see from my later examples.)

See, there isn't a search engine on the planet that can cull actual meaning from its databases. They can only look at the words themselves. Even search engines that analyze the behavior of their users still look at words and numbers, without interpretation.

Let me explain what really happened with Amazon, and why Amazon is not automatically in the wrong. Someone went to the search engine and typed in the word abortion. Now imagine that you're the search engine, and you have two results to give back. Result one is a book whose title is simply Abortion. The second is a book whose title is Understanding Abortion. Tell me: which result is more relevant? Answer: you have no clue.

When faced with this impossible question, the search engines at Amazon and elsewhere attempt to apply certain generalizations that might work in other situations, but simply don't work here. For example, there might exist a rule that puts Abortion ahead of Understanding Abortion because the title of the first book matches the query exactly, whereas the second title is only "half right." Or perhaps one of the books is 500 pages long, but the other is 200 pages long, and Amazon favors longer books. Maybe Amazon is interested in selling you the more expensive book, the book more recently published, or the book that gets a higher rating from all the people visiting the website. In the end, however, all of this analysis fails -- completely and utterly fails -- to answer a very simple question: which of these books is against abortion? Heck, even I don't know, and I invented them!

With search, meaning is irrelevant. Search engines can look only quantitatively at the letters of the words, and at innumerable statistics (e.g., number of Web views) that have at best a tangential relationship with meaning.

Before we look at another example, let me also talk about another thing that Amazon did at one time. If you searched for the word abortion, in addition to your results you received what should be interpreted as a helpful search hint: "Did you mean adoption?" This might sound political, but the logic of this lies in the similar spelling of the words adoption and abortion. Given that there are many more books about adoption than abortion at Amazon, the search engine guessed that someone typing the word abortion might have misspelled something; the computer offered what it considered a reasonable alternative. Had that suggested word been something different -- "Did you mean apportion?" -- no one would have cared.

By the way, I will admit that it is always possible that a company, like Amazon, could consciously manipulate its search results to accomplish some kind of selfish ends. Lycos puts sponsored links at the top; Yahoo promotes its internal products over those of others; Amazon presents the products of its more lucrative partners over all others. It is not far-fetched to imagine a company exercising editorial control for political or religious purposes, especially in today's age. The problem is that some issues are perceived as so volatile that no one is willing to consider coincidence of language as just that, a coincidence. Language is powerful stuff; spelling women as womyn to avoid the "men" letter subset is a powerful choice, whether you agree or not.

Here's another story, from the late 1980s. A search for the word monkey within a database of clip art provided by Microsoft produced a seemingly offensive result: a picture of African-American children. There was an uproar, and although Microsoft denied that it had done anything intentionally racist, it quickly removed the image from the database. The real problem, however, is that the children in the image were playing on monkey bars. Interestingly, if you stop to think about it, the only racism in this example is caused by the person who performed the search! That's the person who actually connected the word monkey with the children (and not the playground equipment); no one at Microsoft did. In this example, the giant void where meaning should have been was automatically filled in by the searcher, by association and as a reflex.

Here's another story, from last year. An article (I can't remember where) expressed how a bad critical review of a specific performer appeared more relevant in a Google search that the good reviews -- of the performer's own website. This would be equivalent to searching for me ("Seth Maislin") and getting a top result of "Seth Maislin Has Bad Teeth" instead of this blog, my website, or one of my interview at O'Reilly & Associates. In this case, Google isn't passing judgment, but it certainly feels like it! Instead, it's looking at how popular that Bad Teeth article might be, or its host (for example, it might be a Wall Street Journal or People Magazine article, periodicals that have readerships thousands of times larger than anything I've ever done), and using that popularity to push the article to the top. It's assuming -- wrongly, in this case -- that people looking for me are less interested in my website than in what Teen People or the WSJ has to say.

Search just doesn't care. If you're looking for meaning, don't ask a search engine.