Seth Maislin's Indexing Blog

25 March 2007

Indexers indexing infinitely ... like monkeys

Three ideas have merged.

First, there's the idea I published last December as "A needle in a haystack with 100,000,000 blades," where I argued how the Web, or an approximation thereof, could be indexed by humans for a reasonable amount of money.

Second, there's The New York Times article "Artificial Intelligence, With Help From the Humans," in which we learn that the Amazon Mechanical Turk service subcontracts human workers to perform tasks that are especially challenging for computers to accomplish, such as matching images to textual descriptions. For some jobs, Turkworkers might make one penny per transaction.

And finally, there's the infinite money theorem, which states that a monkey hitting keys at a typewriter for an infinite amount of time will "almost surely" type the complete works or Shakespeare, or something similar. I first heard this ideas as a "million monkeys and million years," but I bet the math's a bit different. After all, "infinite" is much bigger than a million million.

Putting these ideas together seems to provide a rather obvious solution: third-world indexers. After all, if it costs only a nickel to get someone to write a few keywords for something, we can get a lot of indexing done very cheaply; I say "third world" because no indexer I've ever known is willing to work for a penny per word.

The indexing industry is facing the very real possibility that our workload will be taken from us and delivered to those in economies that allow lower prices. But what if we went a step further and, instead of looking for less expensive indexers with good qualifications, we decided to look for dirt cheap indexers with no qualification other than time to waste? What if, I ask, we asked monkeys to pound away at their keyboards?

I find the idea amusing but too close to the truth. After all, the intelligence behind Google is the social intelligence, the uneven and culturally biased workings of millions of Internet users plugging away at their disparate tasks. What Mechanical Turk has going for it, then, is the human decision making at the back end. Whereas most search engines look for better and greater stores of metadata with which to judge content, one man in a back room can make smarter decisions upon command. No, the real problem is that today's human intelligence is worth only pennies per word. Computers do their best, and humans sweep up afterwards. Our natural intelligence isn't worth a whole lot, I guess.

That's how we know computers are smart. Computers own us monkeys.

Labels: business of indexing, future of indexing, human factors, indexing process, search engines

# posted by taxonomist @ 6:38 PM 1 comments

17 March 2007

Notes on automatic indexing

"Automated indexing software" is, according to the common definition, software that analyzes text and produces an index without human involvement. I'm a firm believer that the technology doesn't exist, and that a human being is required to write an index. Thus I don't use the software, and I also don't recommend it.

There are those who advocate it, arguing that it's "not as bad as an indexer would have you think." These people are often coming from the standpoint that automatic software is faster and cheaper, and they're right. Thus the issue surrounds quality.

I believe that good automatic indexes will exist once there's good artificial intelligence, something that presently doesn't exist. In very limited circumstances, however, it does; a machine can easily cull capitalized words from a textbook to create an approximation of an index of names -- although, again, the machine isn't going to differentiate between names like "David Kelley" and places like "San Francisco," since they are both of the same format and used the same way. It also won't know that "Bill Clinton" is also "William Jefferson Clinton." And certainly it can't tell when the name is being mentioned in an unuseful and trivial way, as are the names in this paragraph! So imagine the problems trying to get a machine to parse full sentences of ideas and recognizing the core ideas, the important terms, and the relationships between related concepts throughout the entire text.

FYI, those who advocate automatic software, however, would argue that the machine gets "close enough" so that a human being can edit the resulting product. However, expert evaluators unanimously agree that the software fails; those who disagree are likely those who are sufficiently ignorant of indexing in the first place such that they are unable to determine the quality differences.

Oh, I should mention that there are software programs that human indexers use to simplify and speed up the mechanics of the index process. For example, it would be silly to disallow a computer to alphabetize the entries, reformat the index, and manipulate page numbers. There are a few software packages that do this exclusively, which are considered top of the line; other applications that have indexing capabilities, such as Microsoft Word and Adobe FrameMaker, have some of these capabilities, with notable limitations.

For information on the various software available, see http://www.asindexing.org/site/software.shtml. If you have feedback, especially differing opinions, I'd love to hear them. Write me at seth@maislin.com.

(This article was originally published in 2002 and 2004 -- and it's still 100% accurate.)

Labels: indexing process, indexing tools

# posted by taxonomist @ 10:11 PM 0 comments

05 March 2007

Interpretation, not computation

After explaining the limitations of Microsoft Word's auto-indexing feature to one of the many people who write me asking for indexing advice, I got an interesting response. Clearly frustrated by the nonexistence of computer tools to do something as simple as generate a name index, he wrote:

> I'm amazed at the poor development of the science of indexing for printed matter such as books.

I wrote back, "You misunderstand!"

The science of indexing is quite broad, given that it has a history in long-ago library science. What seems undeveloped in this case are the tools, but that's a misunderstanding of what indexing is. Indexing is an editorial field, not an automatic one. You might say it's a lot like writing, in that the writer must decide what their readers want to read, and then the writer must communicate those ideas in an organized and approachable way. Indexing is the same: analysis of text to discover what readers might find interesting, and then multiply labeling and organizing those ideas so people can find them.

Computers will never be able to write indexes because they can't (a) interpret importance of a concept, (b) understand concepts over simple words, and (c) connect ideas in contextually relevant ways. As much as I admire the Google.com search engine for what it can do, once again I will demonstrate what it can't do. Google finds 10,000,000 things when we really only want 3 (or 10 or 20). It finds what we type, but it doesn't find synonyms. And there's no guarantee that Google is searching everything that's out there, though it appears to come close; in book indexing, however, there's a human to make sure every page was considered.

How often has Microsoft Word attempted to auto-correct you in a completely inaccurate way? Spell-check? Auto-format? Auto-complete? Half-intelligent humans don't make the kinds of mistakes that these tools do.

Here's what I wish he had written:

> I'm amazed that people who know full well that computers could never write newspaper articles still believe computers can write indexes.

Another problem, of course, is that indexes aren't respected in the industry. The reason Microsoft Word even has an automatic indexing feature is because the people who wrote that software have no idea of the damage such a tool provides. That Word's {XE} functionality is so miserable is even further proof. There's a nasty cycle: people use inferior tools, quality indexing grows less likely, and inferior tools become the standard.

Indexing is an editorial process, just like writing and editing. Indexing requires interpretation, not computation.

Computers will not and should not be used as indexers. If my job ever dies because computer programmers have found a way to make me obsolete, at least I know I'll be in the enlightening company of human writers and artists.

Labels: Google, human factors, indexing process, Microsoft Word indexing

# posted by taxonomist @ 12:12 AM 0 comments

07 February 2007

Indexes are the speed limit on the information highway

The growing demand for indexes that point to online, changing, and custom content is forcing a huge gap into the indexing industry, and that gap is physical.

With traditional publishing, if I wanted a reader to find content on page 114, I would simply have the number 114 show up in my index: "credit card fraud, 114." To make this work, however, that page number must be immutable across the lifetime of my index. Should the content be republished in a different format, layout, or language, or with significant edits within the first 100-plus pages, my index could be rendered inaccurate. In other words, if I type "114" in my index, that content had better be on page 114.

The appeal of fluid content, however, is slowly making traditional information delivery obsolete. Not only are books republished for lots of "traditional reasons" (e.g., updated editions, new languages, different book and print sizes), but technology is enabling books to be published without a single physical page. With the possible exception of the Adobe PDF format (which purposefully preserves the overall book-like format in an electronic file), page breaks are optional and subjective. A Web page or HTML document can have a scroll bar, such that there are no pages; an e-book intended for a handheld reader is paged according to the size of the reader; a news or magazine article of any length can be broken in two or three simply to increase ad sales; and some electronic documents can be edited by the readers such that anything goes.

Ah, how I miss the days when 114 meant 114.

Indexing content that changes is going to be hard, but the fundamental challenge isn't about keeping up with what was newly published today, or even in the last twenty minutes. It's about content ownership. When content is moving around all the time, indexers don't have a good way to tracking where that content is going.

As an analogy, consider a classroom filled with thirty students, with one student at each desk. If you have a photograph of where every student is sitting, you could leave the room and generate a spreadsheet that lists each person's name and seat location. But what happens when the students are playing musical chairs? Every photograph you take is outdated almost immediately; even staying in the room wouldn't be good enough, because your typing speed will never match the speed of twenty kids jumping around. In fact, the only way you could manage a spreadsheet that shows where each student is sitting at all moments is if that spreadsheet operated in real time, by reference. In other words, if all thirty students carried GPS locator chips in their pockets, you could track the chips -- and thus the students -- by satellite. Your map could be as dynamic as what it is you're mapping.

Embedded indexing, or indexing by reference, is a rudimentary and imperfect example of this process. With embedded indexing, I can have some kind of information inserted into the content -- like the GPS chip in the student's pocket -- and then I can generate an index based on where that information is at any one time. This blog entry, for example, has keywords attached to it; the website where my blog is published can, at any time, generate a list of all entries with that keyword. This kind of dynamic indexing is not uncommon these days; website content is served according to a number of immediate rules, and the result can be as simple as a website that publishes "Hello Seth Maislin" on my page but no one else's, or as complicated as an online stock trading program that keeps track of millions of private transactions.

I say this is rudimentary, however, because it's still a snapshot. Perhaps it's convenient to have that snapshot taking at the moment I arrive at a website, but if I leave my browser at a website and walk away for 20 minutes, the picture doesn't have to change. The "Hello Seth Maislin" greeting made sense when I was sitting at the computer, but if I walk away and my wife sits down, it's now wrong. The snapshot is old. Google search results can change from one minute to the next. Even stock trading programs sport copious warnings that despite the best efforts of the website, the price you think you're getting may not be the *actual* price when you complete a transaction; the delay between your clicking the mouse and the machines at the other end doing something is a legitimate and unavailable delay. Some website attempt to minimize this by taking a snapshot every fraction of a second, as if you were watching what was happening "live." In reality, there's still a delay, and there's still no way to truly synchronize everyone's machine.

My point is that indexes to changing documentation must live apart from the documentation. If they really lived completely together, the content and the index would be essentially the same thing, just as the GPS chip and the student are really one merged object. But because indexes are interpretations of content, there is always going to be a gap. The generation of the index be removed from the content that is being indexed, in order for that interpretation to take place.

The only way for indexing to survive, I think, is for content to slow down. And because I believe indexing -- interpretation -- is critical for learning, the only logical conclusion is that content will slow down.

The need for an index is the logical limit of just how fast data can travel.

Labels: future of indexing, indexing process, power of information

# posted by taxonomist @ 10:13 PM 0 comments

24 December 2006

A needle in a haystack with 100,000,000 blades

The Internet has more than 100 million websites, according to the November Netcraft survey. If you were standing on top of the growth curve, by now your stomach would have nothing left to vomit up.

I did some math, and I've figured out a way to make sure that all of these websites are indexed. Here's what I discovered.

Between October and November 2006, approximately 3.5 million sites were created. Assuming that my team would be responsible for inventing a set of keywords for the whole site -- and not for individual pages or parts of pages -- we would have to build 3.5 million keyword sets.
Let's further assume that on average, every website would have four keywords or key phrases. For example, this blog would get the keywords "Seth Maislin," "indexing," "blog," and perhaps my company name, "Focus Information Services." Ideally we'd have the time to invent many more, since it's our goal to help the website perform well at the various search engines, but this team simply can't give everyone special attention. So I'm making the executive decision to limit ourselves to creating 4 terms each for 3.5 million sites.
Assuming that we can invent and type one keyword every two seconds -- a conservative estimate, given that my company name takes me a minimum of two seconds to type -- we'll need 28 million seconds to get the job done.
Now remember, we're just taking about the new sites created in October 2006. Consequently, we have only a month to get the job done before we have to start indexing the November 2006 sites. For this reason, I'm going to build a team of several people, with each one putting in eight hours per day, twenty days each month. That's 576,000 seconds per person per month.
Dividing 28,000,000 seconds per month by 576,000 seconds per person per month gives me 48.1 people, which I'll round to a nice 50 people. That means I need a team of just 50 people to get the job done.

So there you go: a team of 50 people can index the Internet. That doesn't sound nearly as bad as I thought. Of course, everyone will have to type rather quickly, and we'll need a system in place to prevent us from accidentally indexing any one website more than once, but that shouldn't be too bad. And yes, I'm assuming that all of these websites are in English, but most of them are; I'll bring a few translators to work on the few remaining.

At U.S.$50,000 per year per indexer, which is quite modest for a highly intense round-the-clock job like this, plus $100,000 for me as manager, I could probably put together a bid of about $350,000/year to get the job done. Given how many billions of dollars are spent or exchanged over the Internet today, that seems quite reasonable, too. Heck, I should triple the whole thing, since we'd have to re-index the old sites every once in a while. Maybe I should double it again, too, so we'd be allowed to use eight keywords instead of four.

So let's see, that brings the total bill to to $2.1 million. Gosh, that isn't bad at all, is it? I mean, we all agree that indexing the Internet is at least a two-million-dollar-per-year business, right?

Except it's not. Indexing the Internet is a zero-dollar-per-year business. No one is doing it. Just about no one seems to care about quality keywords. In fact, there are only two industries that exist around keyword creation. One of them is misnamed "search optimization," which is about spamming the heck out of the Web. Optimize, I think not: this is the opposite of the intelligent product my team would be build. The other business is the search business itself, companies springing up around those fancy algorithms that Google, Yahoo, Lycos, Ask Jeeves, and the rest use. The thing is, those algorithms are just word-matching machines. These engines are looking for keywords, but none of them is actually writing any. So you see, no one with indexing training is writing any keywords. The inexpensive market for human indexers is being completely overlooked.

Guess it's not worth the two million.

Labels: indexing process, keywording, search engines, spamming and similar behaviors, web indexing

# posted by taxonomist @ 1:58 PM 0 comments

12 December 2006

Indexing moving content

Has it been three months? Almost!

Fact is, the world has a way of throwing curve balls on a regular basis. For me, those curve balls include a family-wide influenza epidemic, teething babies, travel plans, and the like. Trying to keep a grip on life is like trying to catch fish with your hands.

Tonight I give a presentation about trying to index moving targets. I was surprised to discover that of all the presentations I've ever given, this was absolutely the hardest to write. In fact, I just finished a few minutes ago. I've taught three-day classes, with eight hours of material on each day, but this 45-minute presentation really stymied me. There are two reasons for this.

First, trying to index moving content is, no matter what, a mess. The simplest example of a problem is creating an index entry like "software development, 111-121," and then finding out that pages 111 and 121 have moved respectively to pages 113 and 123. With standalone indexing (where you type in the page numbers), the only real way to fix this is manually: go back and rewrite all your page numbers. It's a MESS. So here I am, hoping to provide some tips to indexers and technical writers, something to help them avoid these kinds of corrections -- only to realize that there's no good answer. (A bad answer is to not index at all. :-)

The second problem is that even if I did have a list of useful tools, they don't make for interesting presentation materials. The first draft of my presentation would have resembled a public reading of the weather report for ever American city, in alphabetical order: if you're lucky, you're interested in Albuquerque and Atlanta and can walk out early.

The fact is, our growing reliable on live and custom information is wreaking havoc on the indexing world. It's becoming harder and harder to collate information in relevant chunks. Search will never do it; even if there were human beings out there developing controlled vocabularies, full-text search still retrieves a tremendous amount of flotsam. But creating keywords for something that won't live an hour seems kind of pointless, too. We're all just pounding sand.

I'm looking forward to what the participants have to say. Must we accept the false imprisonment of uncatalogued real-time information flow, or will writers finally catch on that indexers have an important role on the creation side as well?

Labels: cataloguing, embedded indexing, indexing process, keywording

# posted by taxonomist @ 11:41 PM 0 comments

08 September 2006

Unfindable (a virtue)

Indexers work to make things findable, but there's another side to this coin. Indexers also work to make things unfindable.

An important and often overlooked consequence of the culling process that indexers hone when deciding what should be indexed or labeled, and how, is that every decision an indexer does not make makes something that more unfindable. In fact, just as there are infinite number of misspellings for any one word, there an infinite number of indexing choice that an indexer can choose. But unlike misspelled words, which by definition are "mistakes," every indexing choice and every keyword is a good keyword in the right context. The information space is too big for mistakes; for each conscious and unconscious inaction, the best we can hope for is "highly unlikely." If we don't do it, perhaps no one will need it.

There's a sign in a general store in Lake George, New York: "If you don't see it, you don't need it." This is the inadvertent motto of all indexers. We can only pray that everything concept we leave unindexed, every word we don't choose, and every relationship we don't articulate is unneeded. Then again, there are an infinity of choices we never even see, aren't there?

Unfindability is a pandemic, a glorious desert that stretches beyond our senses and imaginings. In today's word of RFID technology, in which every object and person (and object-person combination) can potentially be mapped and tracked over an arbitrary length of time, the vast wasteland of unfindability starts to rank up there with a good vacation.

Let's turn the indexing process around. As indexers, what do we want to make lost?

Systems of belief that destructively conflict with our own. Even if you believe in an open exchange of ideas, perhaps the close-minded ideas and people could disappear to make your world a better place. Some indexers face this challenge regularly, being asked to index materials that they disagree with on political or religious grounds, or to use indexing methods that conflict with their professional ethics.

Falsehoods. Wisdom is knowing what you don't know, but what if you believe something that is completely untrue? Certainly there are layers of truth, and ignorance isn't inherently bad (e.g., it drives scientific research and discovery), but what of the urban legends and purposeful deceit that scatters our information space? If they can't be labeled as more than 90% false, shouldn't they just disappear?

Hurtful knowledge from which there is no clear benefit. Constructive truths hurt but spur growth; destructive truths are better left unsaid. We've all had the experience in which someone commented on our selves, perhaps even politely and with good intention; I know a woman who was told she could never excel at tennis without surgery, because her body shape interfered with a good backhand. I'm sure the instructor was trying to helpful, but she's never enjoyed tennis since. If these comments can't be left unsaid, unfound is the next best option.

Private information, and the deep past. Is my street address on the World Wide Web? What about my childhood photos? Medical and financial records? Candid post-mortem comments about my behavior in past relationships? Or should I accept that my life is an open and Google-accessible book? With millions of blog pages being written by today's schoolchildren, what will happen when someday they run for office and the world (re)discovers their underage exploits? There is no expiration date on personal content, something we sometimes regret.

And finally,

Stuff no one needs. How do indexers judge what nobody needs now, or in the future, under any circumstances? And yet we do it every day at work. We tell ourselves that we can take out our biases, ignoring our beliefs and lives and emotions and ethics and needs, just as so many others become detached emotionally when performing their jobs. But unfindability is inevitable; sooner or later (and usually sooner), our choices will have a direct effect on the abilities of others to find something they want, from historical knowledge to insider secrets, from biographical summaries to photographed nudity. I say that it is our responsibility to make these decisions, to apply our beliefs and biases as well as our knowledge, to unmap the information space we want left behind.

As indexers, we shape the worlds that no one sees. Now let's do it on purpose.

Labels: findability, indexing process, misspellings and other errors, privacy

# posted by taxonomist @ 12:06 PM 0 comments

04 July 2006

The detailed analysis of indexing mistakes

In linguistics, the analysis of error is one means of learning how we cognitively process language. For example, when someone accidentally misspeaks "unplugged the phone" as "unphugged the plone," we discover that both the speaker is a visual learner (because he switched the P blends in the phrase, despite their different sounds) and that the speaker processes language in its component sounds. On the contrary, a speaker who says "unphoned my plug" processes language in morphemes (e.g., root words), and a speaker who says "unplugged my feet" is an aural learner (because phone and feet start with the same f sound). There seems to be an infinity of spoken-language errors possible, including absences, duplications, inclusions, misalignments, substitutions, and transpositions of letters, sounds, morphemes, words, and phrases.

When I evaluate an index, my job is to look for mistakes. As a now-experienced indexer who himself has made mistakes, I know that I can learn much about how an indexer thinks (or doesn't think) by analyzing her errors and accidents. And as with speech, there are innumerable kinds of mistakes available for the unwary indexer: absences, duplications, inclusions, misalignments, misrepresentations, and missortings of page numbers, letters, words, structures, and ideas.

Consider the incorrect page number, such as when content on page 42 is indexed as if it were on page 44. This kind of error tells us that the indexer did not attend properly to detail, perhaps because the working environment (deadlines, tools, etc.) was less than ideal. When a page range appears simplified to a single number, such as when 42-45 appears simply as 42, I am more likely to consider the indexer lazy instead of scatterbrained, though again it is also possible to blame the working environment (including client demands).

Entries that appear in an index but have no value to readers (e.g., the inclusion of passing mentions and other trivia) demonstrate the indexer's ignorance of the audience, or of the indexing process itself. Entries that fail to appear in an index but should (e.g., the under-indexing of a concept) demonstrate either the indexer's ignorance of the audience, the indexer's ignorance of the subject content, or a sloppy or otherwise rushed working process.

Awkward categorizations, such as entries that are mistakenly combined or that doesn't relate well to their subentries, are a clear sign that the indexer misunderstands the content or is too new to indexing to understand how structure is supposed to work. For example, an indexer who creates

American
....Idol (television program), 56
....Red Cross (organization), 341

doesn't think of indexing as a practice of making ideas accessible, but rather as a concordance of words without meaning. Under no circumstances should American Idol or American Red Cross have been broken into halves, let alone combined. Since categorization can be subtle, however, evaluators can learn something interesting about indexers by looking closely at their choices:

writing
....as artistic skill, 84
....fiction vs. nonfiction, 62

In this example, the first subentry defines writing as a trade; it's clear the indexer is comfortable with the idea of a writer. The second subentry defines writing as a process, with a start and finish, such that the process (or journey) of writing could be different when you're writing fiction instead of nonfiction. Analysis of this entry tells us that the indexer doesn't recognize or appreciate the difference between writing (trade) and writing (process). Is the indexer revealing her inner disdain for writers, does she believe that all writers are the same no matter what they produce, or does she simply know nothing about the writing life?

One of the big challenges for indexers is to provide the language that readers will need to find the content they're looking for. When an indexer either offers language that no one will look up or omits the terms that readers prefer, she is demonstrating an ignorance of the audience or of the content, or hinting that the overall indexing process or environment is inadequate. Further, when the indexer fails to provide access from an already existing category entry (for example, if the index has an entry for "writing, fiction vs. nonfiction" but fails to provide the cross reference "See also author" when there are author entries), she tells us clearly that she is unfamiliar with the material. No other combination of errors speaks of subject ignorance as clearly; by failing to connect existing concepts, the indexer shows us gaps in her knowledge of the information map.

There are several kinds of text errors. Misspellings and other typographical errors are a sign of carelessness or insufficient tools. Accidental missortings are a sign of ignorance, poor tools, accelerated schedules, or a failure of communication among publication staff. Ambiguous terms that aren't clarified are caused by indexers who are too limited in their thinking or their assumptions about the audience, indexers who don't know the material, and authors who failed to communicate the ideas clearly enough for the indexer to understand. Finally, odd grammatical choices usually signal a poor production process, such as when two indexes are combined automatically with insufficient editing effort, or a brand new indexer with no formal training.

Before concluding, I would be amiss to ignore errors of formatting. A failure to use consistent styles signals a deficit in tools or attention, whereas awkward or unreadable decisions regarding indentations, margins, and column widths are a big sign that the index designer (who is not necessarily the indexer) has no clear idea whatsoever how indexes work. Missing continued lines communicate the same thing. (On the other hand, exceptional use of formatting, such as the isolated use of italics within a textual label, is a clear sign that the indexer really does understand both the audience and how they approach the index.)

Ignorance, sloppiness, indifference, and confusion: these are shortcomings even a professionally trained, experienced indexer might have, but thankfully they often manifest as isolated exceptions in her practice of creating quality work. But when a single kind of mistake appears multiple times throughout an index -- numerous misspellings, huge inconsistencies of language, globally insufficient access, awkward structures -- we need to be concerned. When we see these, we have an obligation to analyze the indexer. By properly arming ourselves with this knowledge, we can determine for ourselves if the indexer was the wrong choice for a particular project, struggled with the challenges of inferior tools, or simply had a bad day.

Meanwhile, if indexes written by different indexers are plagued by the same exact problem, it's unmistakably clear that the problem is in the systemically faulty publication process: ridiculous deadlines, uncooperative authors, uncaring editors, poor style guides, and so on. In other words, you shouldn't evaluate indexes in isolation. Instead, look at the work of other indexers for the same publisher, as well as the work of other publishers by the same indexer.

Okay, but what if the index is essentially perfect, with no errors at all? Can we still learn something? Yes, we can. The absence of all error tells us something very important about the indexer: She's being underpaid.

Labels: human factors, indexing process, misspellings and other errors

# posted by taxonomist @ 11:04 AM 0 comments

18 May 2006

Bias in indexing

The greatest advantage that indexing processes have over automated (computer-only) processes is the human component. Of course, as someone who has worked with humans before, you probably recognize there can be imperfections.

I was reading Struck by Lightning: The Curious World of Probabilities earlier this week, in which the author writes of biases in scientific studies. I realized that these same biases occur with indexes and indexers as well, and I wondered if I can list them all.

(The biggest bias in indexing isn't one of the index at all, but rather the limitations on what the authors write. For example, if a book on art history didn't include information about Vincent van Gogh, I would expect van Gogh to be missing from the index; this absence might be caused by an author bias. However, I am going to focus on biases that affect indexing decisions themselves.)

Inclusion bias. Indexers may demonstrate a bias by including more entries related to subjects that appear more interesting or important to that indexer. For example, I live in Boston, and so I might consider Boston-related topics to be less trivial (more important) than the average indexer; consequently, documentation that includes information about Boston is more likely to appear in my index. I imagine inclusion bias is a common phenomenon in documentation that includes information about contentious social issues -- immigration, tobacco legislation, energy policy -- because the drive to communicate one's ideas on these issues is stronger. I also believe that inclusion bias is not entirely subconscious, and that indexers may purposefully choose to declare their ideas with asymmetric inclusion. It should be noted, however, that biased inclusion would not necessarily provide insight into the indexer's opinion on the subject; creating an entry like "death penalty morality" does not clearly demonstrate whether the indexer actually disagrees with capital punishment.

Noninclusion bias. Similar to inclusion bias, indexers might feel that certain mentions in the text are not worth including in the index because of their personal interests or beliefs. Unlike inclusion bias, however, I suspect noninclusion bias does not appear in regards to contentious issues; conflict is going to be indexed as long as the indexer recognized the conflict has value. Instead, an indexer is likely to exclude things that "seem obvious"; rarely are these tidbits of information controversial. For example, an indexer who is very familiar with computers is likely to exclude "obvious computer things," subjectively speaking; you probably won't find "keyboard, definition of" in such a book.

Familiarity (unfamiliarity) bias. When an indexer is particularly interested in or knowledgeable about a subject, the indexer is likely to create more entry points for the same content than another indexer might. For example, an indexer who is familiar with "Rollerblading" might realize that Rollerblade is a brand name, and that the actual items are called inline skates. This indexer is more likely to include "inline skates" as an entry. Unfamiliarity bias would be opposite, in that multiple entry points are not provided because the indexer doesn't think of them, or perhaps doesn’t know they exist.

Positive value bias. An indexer who has reason to make certain content more accessible for readers to find (and read) is likely to create more entry points for that idea. At the extreme, the indexer will overload access by using multiple categorical and overlapping subtopics, where those subcategories are at a higher granularity than the information itself. For the generic topic of "immigration," for example, an indexer might include categorical entries like "Hispanic immigrants," "European immigrants," and "Asian immigrants," as well as overlapping topics like "Asian immigrants," "Chinese immigrants," and "Taiwanese immigrants," with all of them pointing to "immigration" in general.

There are three types of positive value bias. Personal positive value bias is demonstrated when the indexer himself believes that the information is of greater-than-average value. Environment-based positive value bias is demonstrated when the index is swayed by environment forces, such as social pressures, political pressures, pre-existing media bias, and so on. Finally, other-based positive value bias is demonstrated when the index bows to pressures imposed by the author, client, manager, or sales market (i.e., the person paying the indexer for the job). Although it can be argued that this last type of bias is not the indexer's bias, strictly speaking the indexer can choose to fight any bias forced upon him. For example, a client who instructs the indexer to "index all the names in this book" might interpret this instruction as some kind of market bias, and thus refuse to follow this guideline. In reality, however, most indexers do accept the pressures placed upon them by the work environment, and thus in my opinion take on the responsibility and ethical consequences of this choice.

Negative value bias. It's possible for an indexer to provide fewer entry points for content that he feels is not of great importance to readers -- the direct opposite of positive value bias -- but the reasons for limiting access to that content are probably not related to indexer's perceived value of that content for readers. Instead, indexers are likely to limit access to content when there is a significant amount of similar content in the book, and as such including access to those ideas would either bulk up the index unnecessarily or waste a lot of the indexer's time. For example, if an indexer were faced with a 40-page table of computer terms, it's unlikely that each term would be heavily indexed, even if such indexing were possible and even helpful to readers.

For this reason, I believe that there are three kinds of negative value bias: time-based negative value bias, in which the indexer skimps on providing access in an effort to save time; financially motivated negative value bias, in which the indexer skimps on providing access in an effort to earn or save money; and logistical negative value bias, in which the indexer skimps on providing access in response to logistical issues like software limitations, file size requirements, page count requirements, controlled vocabulary limitations, and the like.

Topic combination (lumper's) bias. This bias is exhibited by indexers who are likely to combine otherwise dissimilar ideas because they find this "lumping together" of ideas to be aesthetically pleasing or especially useful. This kind of bias is visible in the ratio between locators (page numbers) and subentries, in that entries are more likely to have multiple locators than multiple subentries, on average. For example, an entry like "death penalty, 35, 65, 95" shows that the indexer believes the content on these three pages is similar enough that subentries are not required or useful. Topics that start with the same words might also be combined in a more general topic (such as combining "school lunches" and "school cafeterias" into a combined "school meals.") It is worth noting that some kinds of audiences or documentation subjects may tend toward topic combination bias; for this reason, it may be difficult to recognize lumper's bias.

Topic separation (splitter's) bias. This bias is exhibited by indexers who are likely to separate otherwise similar ideas because they find this "splitting apart" of ideas to be aesthetically pleasing or especially useful. As with lumper's bias, splitter's bias is represented by the ratio of locators to subentries throughout an index, in that splitters are likely to create more subentries than would other indexers, on average. It is worth noting that some kinds of audiences or documentation subjects may tend toward topic separation bias; for this reason, it may be difficult to recognize splitter's bias.

These are all the biases I've found or experienced. If you think there's another kind of bias that indexers exhibit, let me know.

The remaining question is this: Is it wrong for an indexer to have bias? That is, should indexers study their own tendencies and work to avoid them? I don't think it's that simple. The artistry that an indexer can demonstrate is fueled by these biases -- experiences, opinions, backgrounds, interpretations -- and perhaps should even be encouraged. An indexer's strengths come from his understanding of not just the material, but also his perceptions of the audience, the publication environment, and the audience's environments. Further, indexers who know and love certain subjects are going to be drawn to them, just as many readers are; these biases aren't handicaps so much as commonalities shared between indexers and readers. Biases will hurt indexers working on unfamiliar materials in unfamiliar media, but under those conditions the biases are the least of our worries; when the indexer is working without proper knowledge, the higher possibility of bad judgment or error is a much greater concern.

If anything, indexers should be aware of their biases because they can serve as strengths -- especially in comparison to what computers attempt to do.

Labels: hierarchical organization, human factors, indexing process, misspellings and other errors

# posted by taxonomist @ 9:55 PM 0 comments

02 April 2006

Standards in the indexing community

The American Society of Indexers has been deluged with opinions regarding its venture into credentialing individual indexers. Leaving aside all legitimate concerns about the implementation of credentials, there are still many strongly voiced opinions about whether credentialing is a good idea in the first place.

One thing about indexing certification and index standards building that I keep getting stuck on is the question of who they're really for. There are indexers who believe credentialing will impact all indexers in a negative way, or just ASI members. There are many who suspect new indexers will benefit, and that everyone else will get hurt.

But I don't believe the truest benefits to standards building are about the individual indexer. There will be effects, and certainly those effects will be different for different people, but the whole reason standards are created are to improve the industry as a whole. At least, that's how I understand it.

There is a standard for indexing already out there. It's the ISO 999 standard. Does it benefit you? If you were indexing when it was updated in 1996, did you feel any repercussions in your business? Gosh, it sounds as if I'm joking.

Recently I learned that Massachusetts now requires carbon monoxide detectors on every floor. That means I need to install two more in the house. If I don't install them, I won't have a problem until I attempt to sell the house. And truthfully, I'm kind of annoyed that safety regulations are being pushed onto me -- the vehicle seat belt law, the bicycle helmet law, and now this. I'm not saying that seat belts and helmets are stupid things, but I don't like feeling forced into wearing them.

ASI is attempting to create a standard -- certainly one that is harder to define than "wearing a seat belt or not," I'll freely admit -- and so yes, it is a bit disconcerting to be on the receiving end. I'm doing just fine without my carbon monoxide detectors, and I'll do just fine without professional credentialing.

But ask yourself if you believe in the ideal here. Do you believe that standards *should* exist? You don't have to.

I do.

There are many industries that survive very well without some kind of license, certificate, or degree, and some might think that indexing is one of them. But no matter how it affects me personally, I really believe indexers should be a part of a larger entity, something more important than a simple networking community. Do we -- and I mean ALL of us who write indexes -- have anything in common? Do we have anything to fight for as a group (other than higher rates)? Nah. As an industry, we really don't stand for anything without a standard. We're just a bunch of word inventors -- tinkerers -- working alone in our attics.

Most days, I'm fine standing for nothing. I like my income, I like most of my clients and projects, and I absolutely love being able to work at home and raise a daughter who stops to smell the flowers. Some people like Sudoku and crossword puzzles; I like indexing and teaching.

Other days, I feel like someone bailing out a rowboat with a hole in it, with indexes as buckets. I am frustrated by a complete lack of growth in the industry, by repeating myself to every new production editor I meet, by fighting for the right to use page ranges or decent indexing software, and again and again by having to justify my very reasonable rates.

The deeper meaning that can come from credentialing also comes from other things: professional development, education, and research -- including all those great index usability research ideas. Credentialing isn't the only pursuit of this association, nor would it truly succeed in isolation. But I need to be a part of an association that advocates not just for its individual members, but for the meaning of the industry itself.

Credentialing can be part of the solution.

Labels: American Society of Indexing, business of indexing, indexing process, training in indexing

# posted by taxonomist @ 9:28 PM 0 comments

24 March 2006

The granularity of an online "page number"

When writing a hyperlinked index (where hyperlinks are used instead of page numbers), to what should those links point?

Some people think they should point to the section title in which the information is provided; other people like to point right to the specific word used in the index. The "answer," obviously, is that hyperlinked index entries should take the readers to where the information is, right? The problem -- the reason there's this question of "where do I point my entries" in the first place is that readers of hypertext might find themselves bounced somewhere they don't understand. How many times have you followed a link, only to find yourself fiddling around with the scroll bar to figure out where you ended up? Following a hyperlink is like being blindfolded and transported to an unknown destination.

What you may not know is that book indexes aren't much different. :-)

Think about how book indexes actually work, and you realize that direct readers to the page on which the information starts. An entry like "buoyancy, 164" tells the reader to look somewhere on page 164; an entry like "global harmony, 164-167" tells the reader to start looking somewhere on page 164. The granularity of an index is defined as the smallest unit of area that can pointed to. For printed indexes, this area is the page number. Rarely will you find locators that use fractional or qualified page numbers like 164-1/2 or 164top. (There are such things as qualified locators, like 164f, which might point to the footnote on page 164, but even in the books in which they're used they comprise only a small number of all locators used.)

If you follow the standards of the industry, then, the granularity of a printed index is one physical page. For this reason, books that have lots of words on a page -- big pages, narrow margins, tiny print -- are less friendly to book indexers. It's like telling someone that there's a needle in that 164th haystack over there. Maybe we should count our blessings that someone bothered to number the haystacks, but ideally this is where the book designer starts earning her salary. Book pages don't have to look like haystacks -- more accurately, wordstacks -- if the book has legible headings and subheadings. Books can be written with quickly visible landmarks within the pages, like italics and boldface, larger and smaller font sizes, headings and callouts, footnotes, and so on. Going back to the blindfolded analogy, there's no reason we have to drop our readers into deserts of information, when we can drop them in a place surrounded by location clues and navigational signs, like at a train station.

On the Web, however, there is no such thing as a printed page. Web pages can be any length, from tiny pop-up windows with only a sentence fragment of information within, to long scrolls of endless paragraphs and images. Additionally, you don't have to direct the reader to just the page any more, but rather you can deposit him anywhere within the page. The granularity of a Web page is a word! You can send someone into the middle of a paragraph.

When you have tiny little windows of information, using that window as a destination is a no-brainer: the reader arrives at a single sentence of information, which is what he needs. It doesn't matter if you point him to the beginning, middle, or end of that sentence, because it's all they get to read. Pointing someone to an isolated window of information -- what Web authors call "chunks" -- is as easy as looking into a food pantry that contains only a single can. But when you have longer pages, and you have the ability to point someone to any spot within those longer pages, you have a decision to make. And it's a decision that didn't exist in the printed world, with its larger granularity.

The solution is to connect the text of the index entry with the text of the documentation. Not the meaning, but the actual words. If the index entries are written to almost identically match those of the documentation, then the reader won't mind as much because it won't look like a desert. They'll have exactly the landmark they need right in front of them. The entry "cancer, prevention of," for example, could point directly to this line without a problem:

... cessation of smoking. In fact, many physicians are well aware that one way to prevent cancer is to quit ...

That's because the words of your index entry, which are cancer and prevention, appear almost verbatim in that line of text. And if this information were part of a section titled "Using Peer Pressure to Help Patients Quit Smoking," then you really wouldn't want to point to the heading for context. That's because it's unclear to the reader that you're actually directing him to information about cancer or prevention. You're making them work at it.

And then there's the other situation. Using the same sentence and heading as above, where should the indexer point readers who look up the entry "smoking, how to quit"? Clearly they should go right to the heading. If they went to the line that talked about physicians, they wouldn't know where they are.

Our original question here was this: When writing a hyperlinked index (where hyperlinks are used instead of page numbers), to what should those links point? Clearly the only way to answer this question comprehensively is to suggest that the language of hyperlink indexes has two contexts: the index entry itself and the destination location. These two contexts need to work together. And as we saw, the same is true with the printed book: having arrived at page 164, how quickly can you find the idea you were looking for?

Looking this closely at hyperlinked indexes only emphasizes something we need for all indexing: use index entries that match the documentation text. If you have to write a slightly longer entry, that's okay. Instead of "cigarettes," use "cigarette smoking, quitting." Instead of "social networks," use "social networks and peer pressure." The people who work with search engines and Internet marketing are familiar with the term trigger words, which refers to visible language that matches the mental language of the searcher. If you're thinking of the words "white elephant," then a result of "pale pachyderm" doesn't work because it doesn't trigger your sense of recognition.

So the next time there's a white elephant in haystack 164, be sure to tell someone as explicitly as possible.

Labels: books, indexing process, keywording, pages and page ranges, web indexing

# posted by taxonomist @ 9:23 AM 0 comments