18 May 2006
Bias in indexing
I was reading Struck by Lightning: The Curious World of Probabilities earlier this week, in which the author writes of biases in scientific studies. I realized that these same biases occur with indexes and indexers as well, and I wondered if I can list them all.
(The biggest bias in indexing isn't one of the index at all, but rather the limitations on what the authors write. For example, if a book on art history didn't include information about Vincent van Gogh, I would expect van Gogh to be missing from the index; this absence might be caused by an author bias. However, I am going to focus on biases that affect indexing decisions themselves.)
Inclusion bias. Indexers may demonstrate a bias by including more entries related to subjects that appear more interesting or important to that indexer. For example, I live in Boston, and so I might consider Boston-related topics to be less trivial (more important) than the average indexer; consequently, documentation that includes information about Boston is more likely to appear in my index. I imagine inclusion bias is a common phenomenon in documentation that includes information about contentious social issues -- immigration, tobacco legislation, energy policy -- because the drive to communicate one's ideas on these issues is stronger. I also believe that inclusion bias is not entirely subconscious, and that indexers may purposefully choose to declare their ideas with asymmetric inclusion. It should be noted, however, that biased inclusion would not necessarily provide insight into the indexer's opinion on the subject; creating an entry like "death penalty morality" does not clearly demonstrate whether the indexer actually disagrees with capital punishment.
Noninclusion bias. Similar to inclusion bias, indexers might feel that certain mentions in the text are not worth including in the index because of their personal interests or beliefs. Unlike inclusion bias, however, I suspect noninclusion bias does not appear in regards to contentious issues; conflict is going to be indexed as long as the indexer recognized the conflict has value. Instead, an indexer is likely to exclude things that "seem obvious"; rarely are these tidbits of information controversial. For example, an indexer who is very familiar with computers is likely to exclude "obvious computer things," subjectively speaking; you probably won't find "keyboard, definition of" in such a book.
Familiarity (unfamiliarity) bias. When an indexer is particularly interested in or knowledgeable about a subject, the indexer is likely to create more entry points for the same content than another indexer might. For example, an indexer who is familiar with "Rollerblading" might realize that Rollerblade is a brand name, and that the actual items are called inline skates. This indexer is more likely to include "inline skates" as an entry. Unfamiliarity bias would be opposite, in that multiple entry points are not provided because the indexer doesn't think of them, or perhaps doesn’t know they exist.
Positive value bias. An indexer who has reason to make certain content more accessible for readers to find (and read) is likely to create more entry points for that idea. At the extreme, the indexer will overload access by using multiple categorical and overlapping subtopics, where those subcategories are at a higher granularity than the information itself. For the generic topic of "immigration," for example, an indexer might include categorical entries like "Hispanic immigrants," "European immigrants," and "Asian immigrants," as well as overlapping topics like "Asian immigrants," "Chinese immigrants," and "Taiwanese immigrants," with all of them pointing to "immigration" in general.
There are three types of positive value bias. Personal positive value bias is demonstrated when the indexer himself believes that the information is of greater-than-average value. Environment-based positive value bias is demonstrated when the index is swayed by environment forces, such as social pressures, political pressures, pre-existing media bias, and so on. Finally, other-based positive value bias is demonstrated when the index bows to pressures imposed by the author, client, manager, or sales market (i.e., the person paying the indexer for the job). Although it can be argued that this last type of bias is not the indexer's bias, strictly speaking the indexer can choose to fight any bias forced upon him. For example, a client who instructs the indexer to "index all the names in this book" might interpret this instruction as some kind of market bias, and thus refuse to follow this guideline. In reality, however, most indexers do accept the pressures placed upon them by the work environment, and thus in my opinion take on the responsibility and ethical consequences of this choice.
Negative value bias. It's possible for an indexer to provide fewer entry points for content that he feels is not of great importance to readers -- the direct opposite of positive value bias -- but the reasons for limiting access to that content are probably not related to indexer's perceived value of that content for readers. Instead, indexers are likely to limit access to content when there is a significant amount of similar content in the book, and as such including access to those ideas would either bulk up the index unnecessarily or waste a lot of the indexer's time. For example, if an indexer were faced with a 40-page table of computer terms, it's unlikely that each term would be heavily indexed, even if such indexing were possible and even helpful to readers.
For this reason, I believe that there are three kinds of negative value bias: time-based negative value bias, in which the indexer skimps on providing access in an effort to save time; financially motivated negative value bias, in which the indexer skimps on providing access in an effort to earn or save money; and logistical negative value bias, in which the indexer skimps on providing access in response to logistical issues like software limitations, file size requirements, page count requirements, controlled vocabulary limitations, and the like.
Topic combination (lumper's) bias. This bias is exhibited by indexers who are likely to combine otherwise dissimilar ideas because they find this "lumping together" of ideas to be aesthetically pleasing or especially useful. This kind of bias is visible in the ratio between locators (page numbers) and subentries, in that entries are more likely to have multiple locators than multiple subentries, on average. For example, an entry like "death penalty, 35, 65, 95" shows that the indexer believes the content on these three pages is similar enough that subentries are not required or useful. Topics that start with the same words might also be combined in a more general topic (such as combining "school lunches" and "school cafeterias" into a combined "school meals.") It is worth noting that some kinds of audiences or documentation subjects may tend toward topic combination bias; for this reason, it may be difficult to recognize lumper's bias.
Topic separation (splitter's) bias. This bias is exhibited by indexers who are likely to separate otherwise similar ideas because they find this "splitting apart" of ideas to be aesthetically pleasing or especially useful. As with lumper's bias, splitter's bias is represented by the ratio of locators to subentries throughout an index, in that splitters are likely to create more subentries than would other indexers, on average. It is worth noting that some kinds of audiences or documentation subjects may tend toward topic separation bias; for this reason, it may be difficult to recognize splitter's bias.
These are all the biases I've found or experienced. If you think there's another kind of bias that indexers exhibit, let me know.
The remaining question is this: Is it wrong for an indexer to have bias? That is, should indexers study their own tendencies and work to avoid them? I don't think it's that simple. The artistry that an indexer can demonstrate is fueled by these biases -- experiences, opinions, backgrounds, interpretations -- and perhaps should even be encouraged. An indexer's strengths come from his understanding of not just the material, but also his perceptions of the audience, the publication environment, and the audience's environments. Further, indexers who know and love certain subjects are going to be drawn to them, just as many readers are; these biases aren't handicaps so much as commonalities shared between indexers and readers. Biases will hurt indexers working on unfamiliar materials in unfamiliar media, but under those conditions the biases are the least of our worries; when the indexer is working without proper knowledge, the higher possibility of bad judgment or error is a much greater concern.
If anything, indexers should be aware of their biases because they can serve as strengths -- especially in comparison to what computers attempt to do.
14 May 2006
Demands for quantity are misplaced
Press releases from Google, like “Google Checks Out Library Books” [December 14, 2004] and “Google Tunes Into TV” [January 25, 2005], hit the media waves in a grand style. For the first time, entire libraries of books, from Harvard and Stanford Universities to the Universities of Michigan and Oxford, and soon the New York City Public Library, will be available from Google’s website. Within limits of copyright, the words of entire books can be searched. Information philosophers are all over this story, essentially declaring that Google will become the public library of the next generation, excited about how the very nature of libraries might change, and scratching their heads over how the book publishing industry is going to survive yet another hit in the market.
In the second, Google (as well as Yahoo!) applauds themselves for once again providing access to a greater diversity of the world’s information, because television’s closed captioning content has been indexed into a Google Video database. Viewers of public broadcasting and basketball are early adopters, and why not? Finally, all those oh-so-deprived sports consumers can satisfy themselves on more than just the videos, statistics databases, press releases, articles, blogs, commentaries, and (don’t forget) live games themselves. Because now they can search among the announcers’ words.
Now when I search for 76ers, instead of getting 1.61 million hits, I’ll get 1.62 million. Phooey.
Our instinctive reaction is to be impressed. I’m thinking of all those Ph.D. theses gathering dust in the Physics-Optics-Astronomy Library at my alma mater, 150-page books without indexes. I’m thinking about Red Sox fans who, for the first time in a very long time, are interested in the World Series.
But I’m also thinking about the catalog search system at my public library, which won’t improve with Google’s additions. For a book already in the catalog, adding its content doesn’t help at all. Instead, we’d be cluttering up the database with a few trillion new words.
So our instincts are wrong.
Why Quantity Hurts
Every few years, search engine companies are finding new ways to promote themselves by bragging about how much they can find. In the early 1990s, Northern Light was independently rated top among competitors because they searched the largest percentage of the World Wide Web: sixteen percent. Time and Newsweek contributors warned, “We’re not finding everything.”
In the late 1990s, articles about the “invisible web” appeared in popular magazines and newspapers, explaining how search engines cataloged only text, image, and sound files, thus skipping over the good content stored as spreadsheets, databases, and fonts. Again came the cry, “We’re not finding everything!”
And now, in just two months, Google and Yahoo have added more to the huge pile of information: library books and closed captioning data. No longer are our searches limited to the billions of files already on the Web. Yay!
It’s all about quantity. Nobody seems to care about quality any more.
Libraries have been struggling to redefine themselves ever since the Internet (and more, the World Wide Web) reached people’s homes. Book publishers also have suffered. The failure isn’t that of these institutions and industries, however, but of the public. The public seems unaware of the natural filtering process inherent in human behavior. Publishers choose which titles to publish, and libraries choose which titles to add to their catalogs. You might not agree with their reasoning or results, but you do have to admit that there are human beings at the helm.
(By the way, Google's library program was put on hold because of criticism. This isn't because it's a bad idea or anything, but rather because the traditional publishing industries got scared they'd lose money. It's an I-was-here-first money-by-copyright battle.)
For many, this filtering-by-design is a major disappointment, which explains the astounding popularity of the World Wide Web. The Web allows everyone to speak up: to post pictures of their pets and babies, their ideas about government, their Harry Potter fan fiction. But in a room where everyone is shouting, nothing gets heard. Northern Light and Yahoo earned their money offering a way through the noise. Other companies, calling themselves search optimization experts, profit by offering their clients the means to be noticed by these search engines, the hypertext equivalent of megaphones.
The filtering process is missing.
Okay, yes, adding content to search databases is a good thing. I might make fun of sport fanaticism, but the desire to retrieve information of choice is a valuable privilege of the individual. I might not care about the Red Sox, but I respect that there are others who do. I also feel extremely happy for the researchers who now have access to volumes of scientific research. In many ways, adding content to a database is like translating content into new languages. Really, these are not bad things.
Even so, we are only adding to the number of people shouting in a room. Google and Yahoo are definitely improving the scope of what we can find, but they are not improving our ability to find. I might proudly accumulate more and more in my attic, while simultaneously making it harder and harder to retrieve anything.
Here’s a real example. My wife and I had been expecting our first child (15 months ago). We were struggling in our decision of her middle name. When we searched for “baby names” on the Web, with quotation marks around the phrase, we found 2.5 million sites at Google, 0.9 million at Yahoo, 0.7 million at MSN. Now, imagine that all the contents of library books and scientific articles and sports broadcasters are added to the Web. Although there may exist a few anthropology articles that would have helped us choose a name, I sincerely believe that over 99% of this new content would have proven unhelpful. I also believe that some of this unhelpful content includes the unusual phrase “baby names.” For example, consider this sentence, which appeared on the Web in November 2004: “Julia Roberts now joins the list of celebrities who have jumped on the Hollywood bandwagon, which gives license to choosing odd baby names.”
After my wife and I decide on a candidate name, we search for that name online, looking for its meaning. The query “Ryan meaning” (without quotes) for the boy’s name Ryan gets 1.2 million hits at Google. (By the way, Google suppresses near-duplicated content and never displays beyond the first 1000 results, so the “true” result set is quite inaccessible.) Because Ryan is a common name, it likely appears numerous times within bibliographies. The word meaning is also extremely common among scholarly articles. If Google indeed adds university libraries to it’s already large database, 1.2 million will become a very small number. As the scientists benefit from a library search, my wife and I will find it that much harder to learn about a particular name using Google.
It is common knowledge among library scientists and search engine experts that you cannot improve the accuracy of a search at the same time you improve its comprehensiveness. Either you get perfect relevance but miss something useful, or you get everything you want along with content that you don’t. As search engines trend toward larger and larger databases, results pages grow more cluttered.
Please, Sir, May I Have Some Less?
Google’s popularity as a search engine has nothing to do with the numbers of results. When I ask people why they like Google (or whatever search engine they prefer), they answer, “Because what I want is usually within the first few results.” I’ve also gotten the answer, “Because it thinks the way that I do.” Most people don’t want millions of results. They want three. Three quality results.
I can’t remember the last time I heard about a search engine improving its algorithms. Perhaps they do this all the time in secret, inventing features behind the scenes. I do know that if a search engine started regularly serving up nonsense, it would go out of business.
So why are these efforts at improving quality so unpronounced? Did you know that while Google pays attention to quotation marks, Lycos doesn’t? That you can type a zip code into Google to get a map? Many of Google’s best features are published in books like Google Hacks and even Google Maps Hacks (O’Reilly & Associates, 2004 and 2006 respectively), where few people are going to look for them. Either nobody cares, or nobody knows the difference.
But when Google adds sports commentary to its search engine, watch out! The story appears in all the major newspapers.
At times like these, I get rather discouraged. I feel as though I am trying to hold back the ocean. It takes me more than a week to index a single, average book. The information world is growing at such an insane pace, my job seems absurd.
At times like these, I have to remind myself of two perspective. First, context. When I write the index for a book with 350 pages, it doesn’t matter that the “book of Google” has over 8 billion. Someone decided that these 350 pages needed to be written, and it’s my job to make them accessible. My work improves this book. No, I haven’t changed the world, but I have made a difference within the context of this one book, in a segment of this one industry, to a small set of readers. For me, indexing is like the civic duty of voting: few win by one vote, and yet every vote counts. It’s also contagious, because voting begets voting. And indexing does beget indexing, because 5% of the people I talk to about my job want to know more, offer me work, or express a desire to become an indexer themselves.
The second perspective is one of application. I don’t have to index books. If I wanted to make a difference at the source, there are many other applications of skills.
The key to these perspectives is a willingness to become activists. We are environmentalists in an information world. Just as scientists show concern with over a one-degree rise in ocean temperature, so should we show concern with a one-percent increase in information dissemination. Bulk up our search engines? This is not an environmentally friendly choice.
I want a search engine—it doesn’t even have to be Google—to announce that they’ve found a way to help me filter out the pages I couldn’t possibly want. The important word here is announce. The modus operandi of these companies is to bring more shouting people into the room, and then publicize this with pride. No wonder the libraries and publishers are in trouble: they’re not being praised for what they do. Neither are the indexers. In the public media, quality gets a whole lot less attention than quantity.
If the search engine is being improved, they’re not telling anyone. Apparently it’s a secret. I don’t want to hear that someone has added billions of pages to the database, unless I hear also about a system that filters billions of pages away.
I think it’s wonderful that more esoteric content is being added to the database. I applaud search engine companies who continue to improve their algorithms. What drives me crazy is that everyone is talking about the first, but not the second. The publicity is lopsided. Why won’t anyone talk about quality any more?
When we asked search engine companies for more, that’s what we got. And we lost precision. Maybe it’s time for us to ask for less.