Seth Maislin's Indexing Blog

05 March 2007

Interpretation, not computation

After explaining the limitations of Microsoft Word's auto-indexing feature to one of the many people who write me asking for indexing advice, I got an interesting response. Clearly frustrated by the nonexistence of computer tools to do something as simple as generate a name index, he wrote:

> I'm amazed at the poor development of the science of indexing for printed matter such as books.

I wrote back, "You misunderstand!"

The science of indexing is quite broad, given that it has a history in long-ago library science. What seems undeveloped in this case are the tools, but that's a misunderstanding of what indexing is. Indexing is an editorial field, not an automatic one. You might say it's a lot like writing, in that the writer must decide what their readers want to read, and then the writer must communicate those ideas in an organized and approachable way. Indexing is the same: analysis of text to discover what readers might find interesting, and then multiply labeling and organizing those ideas so people can find them.

Computers will never be able to write indexes because they can't (a) interpret importance of a concept, (b) understand concepts over simple words, and (c) connect ideas in contextually relevant ways. As much as I admire the Google.com search engine for what it can do, once again I will demonstrate what it can't do. Google finds 10,000,000 things when we really only want 3 (or 10 or 20). It finds what we type, but it doesn't find synonyms. And there's no guarantee that Google is searching everything that's out there, though it appears to come close; in book indexing, however, there's a human to make sure every page was considered.

How often has Microsoft Word attempted to auto-correct you in a completely inaccurate way? Spell-check? Auto-format? Auto-complete? Half-intelligent humans don't make the kinds of mistakes that these tools do.

Here's what I wish he had written:

> I'm amazed that people who know full well that computers could never write newspaper articles still believe computers can write indexes.

Another problem, of course, is that indexes aren't respected in the industry. The reason Microsoft Word even has an automatic indexing feature is because the people who wrote that software have no idea of the damage such a tool provides. That Word's {XE} functionality is so miserable is even further proof. There's a nasty cycle: people use inferior tools, quality indexing grows less likely, and inferior tools become the standard.

Indexing is an editorial process, just like writing and editing. Indexing requires interpretation, not computation.

Computers will not and should not be used as indexers. If my job ever dies because computer programmers have found a way to make me obsolete, at least I know I'll be in the enlightening company of human writers and artists.

Labels: Google, human factors, indexing process, Microsoft Word indexing

# posted by taxonomist @ 12:12 AM 0 comments

28 December 2006

Eighteen million people can't be wrong

No matter how much you and I might like Google, the fact is that Google has some very serious problems with it comes to finding content. More specifically, if you're looking for the "right" answer, or if you're attempting to do any serious research, Google is likely to fail you miserably.

The flaw lies in Google's strength: social algorithms. Social algorithms are processes in which decisions are made by watching and following the majority of people in a community. If blogger.com tends to be the place people go to create blogs, then a social algorithm will see blogger.com as "better." When a search engine is managed by a social algorithm, a website might appear first in search results not because of the quality of site, but rather because a larger number of people treated the site as if were of higher quality. In other words, social algorithms equate "majority" with "best," something that often looks right but actually is patently untrue.

When you perform a search at Google.com, your results are sorted based on majority behavior and little else. For simple questions about anything -- as well as complex questions about cultural issues, for which "lots of people" is critical -- frequently the majority opinion is rather close to what you want -- which is why Google is so successful. But the gap between "close to what you want" and "accurate" is an invisible one, and that makes it insidious and dangerous.

For example, search for "Seth Maislin." The first hit is my website. The second hit is this blog. The third hit is an interview I did for O'Reilly & Associates in July 1999. An investigation of why these are the top three sites is rather interesting. First of all, these are the only results in which my name actually appears in the title; the fourth link and beyond have my name in the document, but not the title. Second, my website appears at the top not because it's the definitive website about "Seth Maislin," but because Google knows of 24 people linking to it. In comparison, the only person who ever created a link to this blog is me -- a number far less than 24! The same goes for the O'Reilly interview, except that the single linker isn't even a valid site any more: it's broken. The popularity of my home page (in comparison to this blog, for example) is why it's a better hit for my name. But if you folks out there started to actually link to this blog, that would change.

You should look into the search results for the word "Jew." A website known as JewWatch.com, an offensive and inflammatory collection of antisemitic content, had appeared as the number-one result at Google.com for this one-word query. This happened because a large number of supporters of this site tended to build links to it; then, those were were outraged or amused also linked to it within their protestations. In the end, the social algorithms at Google recognized how popular (i.e., "linked to") this site was, and in response rated it very highly -- in fact, rated it first -- compared to all other websites with the word "Jew" in the title. Eventually, those who were enraged by this content fought back by asking as many people as possible to link somewhere else -- specifically, the Wikipedia definition of Jew -- just as I have here. Over time, more people linked to Wikipedia than to JewWatch, and so the latter dropped into second place at Google. This process of building networks of links in order to influence Google's social algorithm is called "Google bombing." In other words, when the people who hated the site acted together in a large group, Google's social algorithms responded.

(By the way, you'll notice that I do not create a link to the offensive site. I see no reason to contribute to its success.)

Do you see the problem? The success of Google bombing is analogous to the squeaky wheel metaphor, that the loudest complainer gets the best service. Social algorithms reward the most popular, regardless of whether they deserve it. JewWatch made it to the top because it was popular first; Wikipedia's definition moved to the top because those offended banded together to demonstrate even more loudly. And in the end, there's no reason for me to think either of these links is best.

Whether popularity is a good thing or a bad thing is often subjective. In language, some people lament the existence of the word ain't, while others applaud its existence as an inevitable sign of change; either way, the word is showing up in our dictionaries because more and more people are using it. But I'm not talking about language; I'm talking about truth.

Do you think vitamin C is good at preventing colds? Well, it isn't; there have been no studies demonstrate its effectiveness, but there have been studies that show it makes no real difference. (It's believed that vitamin C will shorten the length of a cold, but studies are still inconclusive.) But after a doctor popularized the idea of vitamin megadosing, our entire culture suddenly believes taking the vitamin will keep you extra healthy. Untrue.

Do you know why "ham and eggs" is considered a typical American breakfast? Because an advertising executive in the pork industry used Freudian psychology to convince people to eat ham for breakfast. He did it by asking American doctors if they thought hearty breakfasts were a good thing (which they did); the ad-man then asked if ham were a hearty food. Voila: ham, sausage, and bacon are American breakfast staples, and the continental breakfast vanished from our culture.

In both of these examples, majority belief trumps the truth. And look at the arguments about global warming! I won't repeat the arguments laid out by Al Gore in An Inconvenient Truth, but his argument is that as long as enough people insist that global warming isn't true, its dangers will remain unheeded. In fact, I'm not even going to argue here whether global warming is a real thing or not; it doesn't matter what I believe. What matters is that the debate over global warming isn't a fight over the facts. Instead, it's a shouting match, in which the majority wins. Right now, so many influential people have argued that it doesn't exist (or isn't such a big deal) that very little has been done in this country in response to its possible existence. But as more and more people start to believe it's at least possible, it's becoming a reality. Doesn't that just drive you nuts? Why are the facts behind global warming driven by democracy? Can't something be true even if no one believes in it?

One last look at this "majority rules" concept, only this time let's avoid politics and focus on simple word spelling. If you search for the word millennium, correctly spelled with two Ls and two Ns, you'll get about 54 million hits at Google (English-language pages only). If you search for the word millenium, misspelled with two Ls and only one N, you'll get 18 million hits. Twenty-five percent of all websites have this misspelling in them! For content that's published, that by its very nature is biased toward having only correct spellings, this error rate is monstrous! But does Google let you know that millenium is misspelled? Does it ask you if you "meant to type millennium?" No! After all, Google considers the misspelled word correct.

I mean, eighteen million people can't be wrong, right?

Labels: Google, misspellings and other errors, search engines, social algorithms, spamming and similar behaviors, web indexing

# posted by taxonomist @ 10:49 AM 0 comments

27 August 2006

Information is owned by the few

Consider the history of manufactured items. At one time in United States history, the manufacturer in your neighborhood was the primary source of whatever it made. If Maytag had a plant in your city, you bought Maytag. There was almost no question of buying a competitor product in a distant city, with reasons ranging from the practical (delivery requirements) to the social (your family was employed there) to the unconscious (you heard about this company every day in the news). Manufacturers were king: they made, you bought (assuming you could afford), and no questions were asked.

Then came the middlemen, resellers like Sears, who discovered that if you brought a number of competing products into the same show room, customers came to that show room to make an educated decision. No longer convinced to buy from one manufacturer, you could shop among several models. This was how resellers made their money: providing you a service you'd pay extra money for. Products and manufacturers that failed to compete well in side-by-side arrangements were abolished in the face of consumer choice.

And finally came the Internet. The World Wide Web provided you with not only all the same information the resellers had, but much more: professional and amateur reviews, community-level and industry-specific emails filled with recommendations and warnings, and manufacturers' contact information in case you had questions. Now you could shop intelligently around the world. Much of the resale industry was demolished, now that their services paled in comparison to what consumers could do themselves. Look at the fate of independent bookstores, who all-but-vanished in a wired world where consumers read reviews and compare prices among Amazon.com, BN.com, and Borders.com, only to buy the book from an Internet-based reseller with massively discounted prices. Travel agents, too, disappeared in the face of Expedia.com and Travelocity.com.

It is thus believed, therefore, that the Internet has empowered the individual.

Not true. I'm sorry to say that it's all an illusion.

First of all, the online resellers are no better than the brick-and-mortar resellers. After browsing the options available at an online travel agency, it's often cheaper to then go to the airline site itself to buy your tickets. For example, if I want to fly from Boston to San Francisco, I'll plug my dates into a search engine at Expedia (and later, Travelocity), find the cheapest option at the best times, and then buy my ticket at Delta.com, Southwest.com, or another airline, or perhaps call a human travel agent after all at AAA and start again. As long as I have access to the source of that service or product -- the manufacturer, the service provider, etc. -- the reseller is a source of information without sale.

Second, the online resellers are limited in scope. Thanks to partnerships and other marketing choices, not all of my options are provided. For example, both Expedia and Travelocity tend to overlook small, unaffiliated airlines. Additionally, at one time (and perhaps still today) Expedia charged extra money if I wanted to buy a ticket for USAir flights, without telling me. The bottom line is that going through an online reseller is not necessary more comprehensive or cheaper than my other options.

But biggest of all, however, is that for me to perform ANY search these days, I'm going to have to use a search engine, like Google.

Without even getting into problems with spam, search engines are responsible for providing me with the information I'll need to do anything on the Web, if I don't already know precisely how and with whom to do it myself. Google is the next Sears. If I wanted to find some good choices for a boy's name, Google will provide me with so many choices that I'll inevitably stop after the first twenty (and more likely, stop after three). Google is filtering my search, valuing some choices above others just as my supermarket creates end-of-aisle displays to sell me things. The only difference is that I know the supermarket makes money from the sale. With search engines, you have no way of guaranteeing you're not clicking on a link the search engine company prefers.

Consider the unscrupulous used car salesman. Let's step through the process.

I approach the salesman asking a simple question: "I want a reliable automobile for a good price."

The salesman immediately points out a few models. The first one he shows me is way too expensive. The second one is terrible. In comparison, the third one he shows me seems wonderful at first glance, but then I ask more questions.

The salesman doesn't give me precisely the information I want. Some of his answers sound ridiculous. He's reluctant to show me any more cars. But when I keep pushing, he finally gives in and shows me a fourth car, without much enthusiasm.

Finally, I ask for specific kinds of cars, things I've heard rumors about. "What about a Toyota Sienna? Is there a good Ford minivan?" The salesman is completely unhelpful. Clearly this was a terrible place to come shopping. Maybe I'll visit some dealers, or talk to my neighbor.

Let's compare this to a Google search for boys' names. I choose Google here because it's currently a very popular search engine that, people seem to believe, does an honest job in helping people search both online and offline content.

I start with a simple request: "I want to find a good boy's name." My query is "boys names."

Google gives me some immediate results. Some of them are immediately terrible and can be skipped over, but it doesn't take long to find something promising. I visit the website and, although looks like what I want might be there, I have a hard time using it. I decide to give up and return to Google and its search result list.

I try a second website, but I've lost confidence. Maybe it's not Google's fault in any obvious way, but none of these websites is helping me in the way I want to be helped.

I decide to try some new queries. Maybe "boy names"? Do I need an apostrophe? Or perhaps, because I'm interested in a boy's name that isn't too ethnically different from the names I know in the United States, I should try a search like "American boy names." Unfortunately, my search choices are even worse. I give up. The Web is a terrible place to search for boy names. I'll try the bookstore.

You see? No practical difference.

You might think this exercise was a bit silly, but I'm not wrong. The people, companies, or machines that control what you want are the same entities that control the process. The car salesman controls which cars you buy; even if you trust him, the process is his, not yours. He's just nice about it. The same is true with Google. Sure, we all tend to trust Google -- and what's not to trust or like -- but we do not own the information-seeking process. Google owns it. Here's why:

Not only doesn't Google find everything, it doesn't tell you there are things missing. The Google database isn't as up-to-date as the Web. Your search words don't match every relevant result in every relevant language. Sure, it looks as if there are 2,600,000 hits for you search, but that doesn't mean it found everything. What's more, you can't even see all 2,600,000 hits if you wanted to! Google shuts you out after only a few hundred.

Google doesn't explain what it's doing, or why. The search algorithm is never explained; it's a patent secret. We know what kinds of ingredients go into the mix, but we don't know the precise details. And although sponsored links appear separate from search results -- something not all search engines do -- we have no certainty that there are some other sponsorships happening in there.

If Google is biased, we have no way of knowing. I guarantee Google is biased, because its algorithm is based on how people use the Web. Google News collects stories more often from the AP Wire than the Boston Globe, and more often from the Globe than the Arlington Tab. That's well-intentioned bias. There are less favorable biases, too, like social biases. Because there are fewer computer users who are poor or homeless, the websites of interest to these people never show up at the tops of list. Because the Google default language in the United States is English, U.S.-based news articles are far favored over newspapers in other countries, even when the news takes place in those countries. And because most people have heard of large companies like Amazon.com, smaller companies like independent booksellers are pushed into obscurity. There are also language-based biases. It's easier to find websites related to money because this word is both singular and plural, whereas finance has a plural form. It's easier to search for words like mistress and misogyny, which exist, than for the nonexistent gender-opposite versions. And it's nearly impossible to find a company that sells windows because your search results will be overwhelmed by companies that sell [Microsoft] Windows.

But we don't have a choice. There is too much information in the world. We must go through an information repackager if we're not going to do the work ourselves. (Librarians do the work themselves; the results are of excellent quality, of limited quantity, and of almost negligible relevance for our day-to-day needs of airline tickets and boys' names. Libraries have some excellent information with which we can arm ourselves -- like using Consumer Reports to choose a quality used car -- but in general we still have to take the final steps on our own.)

Regardless of their motives, search engines OWN the information access. Maybe that's good enough. Maybe you're comfortable performing your searches in ignorance of the engine's inner workings, generally satisfied with the results most of the time. But please, that doesn't make it a good thing. What if Google started charging you for some of your searches? What if Google integrated its sponsored links into the search engine (as other engines did or do)?

Here's a real-life, immediate example. Search for Pluto. There has been a ton of recent press regarding Pluto's demotion as a planet in our solar system. Where is all that news in the search results page? There's just a tiny news area that most people won't see because it looks different, and then there's a bunch of sponsored links. This is a branding decision; Google thinks "news" and "sites" are very different things and doesn't even combine their results.

Don't kid yourself. The power of the Internet has moved, but not to you.

Labels: content management, Google, power of information, search engines

# posted by taxonomist @ 8:38 PM 0 comments

14 May 2006

Demands for quantity are misplaced

There is a continuing trend in search engines: more, more, more.

Press releases from Google, like “Google Checks Out Library Books” [December 14, 2004] and “Google Tunes Into TV” [January 25, 2005], hit the media waves in a grand style. For the first time, entire libraries of books, from Harvard and Stanford Universities to the Universities of Michigan and Oxford, and soon the New York City Public Library, will be available from Google’s website. Within limits of copyright, the words of entire books can be searched. Information philosophers are all over this story, essentially declaring that Google will become the public library of the next generation, excited about how the very nature of libraries might change, and scratching their heads over how the book publishing industry is going to survive yet another hit in the market.

In the second, Google (as well as Yahoo!) applauds themselves for once again providing access to a greater diversity of the world’s information, because television’s closed captioning content has been indexed into a Google Video database. Viewers of public broadcasting and basketball are early adopters, and why not? Finally, all those oh-so-deprived sports consumers can satisfy themselves on more than just the videos, statistics databases, press releases, articles, blogs, commentaries, and (don’t forget) live games themselves. Because now they can search among the announcers’ words.

Now when I search for 76ers, instead of getting 1.61 million hits, I’ll get 1.62 million. Phooey.

Our instinctive reaction is to be impressed. I’m thinking of all those Ph.D. theses gathering dust in the Physics-Optics-Astronomy Library at my alma mater, 150-page books without indexes. I’m thinking about Red Sox fans who, for the first time in a very long time, are interested in the World Series.

But I’m also thinking about the catalog search system at my public library, which won’t improve with Google’s additions. For a book already in the catalog, adding its content doesn’t help at all. Instead, we’d be cluttering up the database with a few trillion new words.

So our instincts are wrong.

Why Quantity Hurts

Every few years, search engine companies are finding new ways to promote themselves by bragging about how much they can find. In the early 1990s, Northern Light was independently rated top among competitors because they searched the largest percentage of the World Wide Web: sixteen percent. Time and Newsweek contributors warned, “We’re not finding everything.”

In the late 1990s, articles about the “invisible web” appeared in popular magazines and newspapers, explaining how search engines cataloged only text, image, and sound files, thus skipping over the good content stored as spreadsheets, databases, and fonts. Again came the cry, “We’re not finding everything!”

And now, in just two months, Google and Yahoo have added more to the huge pile of information: library books and closed captioning data. No longer are our searches limited to the billions of files already on the Web. Yay!

It’s all about quantity. Nobody seems to care about quality any more.

Libraries have been struggling to redefine themselves ever since the Internet (and more, the World Wide Web) reached people’s homes. Book publishers also have suffered. The failure isn’t that of these institutions and industries, however, but of the public. The public seems unaware of the natural filtering process inherent in human behavior. Publishers choose which titles to publish, and libraries choose which titles to add to their catalogs. You might not agree with their reasoning or results, but you do have to admit that there are human beings at the helm.

(By the way, Google's library program was put on hold because of criticism. This isn't because it's a bad idea or anything, but rather because the traditional publishing industries got scared they'd lose money. It's an I-was-here-first money-by-copyright battle.)

For many, this filtering-by-design is a major disappointment, which explains the astounding popularity of the World Wide Web. The Web allows everyone to speak up: to post pictures of their pets and babies, their ideas about government, their Harry Potter fan fiction. But in a room where everyone is shouting, nothing gets heard. Northern Light and Yahoo earned their money offering a way through the noise. Other companies, calling themselves search optimization experts, profit by offering their clients the means to be noticed by these search engines, the hypertext equivalent of megaphones.

The filtering process is missing.

Okay, yes, adding content to search databases is a good thing. I might make fun of sport fanaticism, but the desire to retrieve information of choice is a valuable privilege of the individual. I might not care about the Red Sox, but I respect that there are others who do. I also feel extremely happy for the researchers who now have access to volumes of scientific research. In many ways, adding content to a database is like translating content into new languages. Really, these are not bad things.

Even so, we are only adding to the number of people shouting in a room. Google and Yahoo are definitely improving the scope of what we can find, but they are not improving our ability to find. I might proudly accumulate more and more in my attic, while simultaneously making it harder and harder to retrieve anything.

Here’s a real example. My wife and I had been expecting our first child (15 months ago). We were struggling in our decision of her middle name. When we searched for “baby names” on the Web, with quotation marks around the phrase, we found 2.5 million sites at Google, 0.9 million at Yahoo, 0.7 million at MSN. Now, imagine that all the contents of library books and scientific articles and sports broadcasters are added to the Web. Although there may exist a few anthropology articles that would have helped us choose a name, I sincerely believe that over 99% of this new content would have proven unhelpful. I also believe that some of this unhelpful content includes the unusual phrase “baby names.” For example, consider this sentence, which appeared on the Web in November 2004: “Julia Roberts now joins the list of celebrities who have jumped on the Hollywood bandwagon, which gives license to choosing odd baby names.”

After my wife and I decide on a candidate name, we search for that name online, looking for its meaning. The query “Ryan meaning” (without quotes) for the boy’s name Ryan gets 1.2 million hits at Google. (By the way, Google suppresses near-duplicated content and never displays beyond the first 1000 results, so the “true” result set is quite inaccessible.) Because Ryan is a common name, it likely appears numerous times within bibliographies. The word meaning is also extremely common among scholarly articles. If Google indeed adds university libraries to it’s already large database, 1.2 million will become a very small number. As the scientists benefit from a library search, my wife and I will find it that much harder to learn about a particular name using Google.

It is common knowledge among library scientists and search engine experts that you cannot improve the accuracy of a search at the same time you improve its comprehensiveness. Either you get perfect relevance but miss something useful, or you get everything you want along with content that you don’t. As search engines trend toward larger and larger databases, results pages grow more cluttered.

Please, Sir, May I Have Some Less?

Google’s popularity as a search engine has nothing to do with the numbers of results. When I ask people why they like Google (or whatever search engine they prefer), they answer, “Because what I want is usually within the first few results.” I’ve also gotten the answer, “Because it thinks the way that I do.” Most people don’t want millions of results. They want three. Three quality results.

I can’t remember the last time I heard about a search engine improving its algorithms. Perhaps they do this all the time in secret, inventing features behind the scenes. I do know that if a search engine started regularly serving up nonsense, it would go out of business.

So why are these efforts at improving quality so unpronounced? Did you know that while Google pays attention to quotation marks, Lycos doesn’t? That you can type a zip code into Google to get a map? Many of Google’s best features are published in books like Google Hacks and even Google Maps Hacks (O’Reilly & Associates, 2004 and 2006 respectively), where few people are going to look for them. Either nobody cares, or nobody knows the difference.

But when Google adds sports commentary to its search engine, watch out! The story appears in all the major newspapers.

At times like these, I get rather discouraged. I feel as though I am trying to hold back the ocean. It takes me more than a week to index a single, average book. The information world is growing at such an insane pace, my job seems absurd.

At times like these, I have to remind myself of two perspective. First, context. When I write the index for a book with 350 pages, it doesn’t matter that the “book of Google” has over 8 billion. Someone decided that these 350 pages needed to be written, and it’s my job to make them accessible. My work improves this book. No, I haven’t changed the world, but I have made a difference within the context of this one book, in a segment of this one industry, to a small set of readers. For me, indexing is like the civic duty of voting: few win by one vote, and yet every vote counts. It’s also contagious, because voting begets voting. And indexing does beget indexing, because 5% of the people I talk to about my job want to know more, offer me work, or express a desire to become an indexer themselves.

The second perspective is one of application. I don’t have to index books. If I wanted to make a difference at the source, there are many other applications of skills.

The key to these perspectives is a willingness to become activists. We are environmentalists in an information world. Just as scientists show concern with over a one-degree rise in ocean temperature, so should we show concern with a one-percent increase in information dissemination. Bulk up our search engines? This is not an environmentally friendly choice.

I want a search engine—it doesn’t even have to be Google—to announce that they’ve found a way to help me filter out the pages I couldn’t possibly want. The important word here is announce. The modus operandi of these companies is to bring more shouting people into the room, and then publicize this with pride. No wonder the libraries and publishers are in trouble: they’re not being praised for what they do. Neither are the indexers. In the public media, quality gets a whole lot less attention than quantity.

If the search engine is being improved, they’re not telling anyone. Apparently it’s a secret. I don’t want to hear that someone has added billions of pages to the database, unless I hear also about a system that filters billions of pages away.

I think it’s wonderful that more esoteric content is being added to the database. I applaud search engine companies who continue to improve their algorithms. What drives me crazy is that everyone is talking about the first, but not the second. The publicity is lopsided. Why won’t anyone talk about quality any more?

When we asked search engine companies for more, that’s what we got. And we lost precision. Maybe it’s time for us to ask for less.

Labels: books, Google, power of information, search engines

# posted by taxonomist @ 6:17 PM 0 comments

Seth Maislin's Indexing Blog

05 March 2007

Interpretation, not computation

28 December 2006

Eighteen million people can't be wrong

27 August 2006

Information is owned by the few

14 May 2006

Demands for quantity are misplaced

About Me

Relevant Links

Some Blogs Seth Might Visit

archives