29 May 2007

 

Throughput in indexing

I gave a presentation last year (at the Toronto conference of the American Society of Indexers) about money. I synthesized some statistics to come up with something I hadn't seen expressed before.

According to the 2004 survey of ASI members (all numbers in U.S. dollars):

Note that MEDIAN rates do not necessarily match an indexer's lifestyle, workload, or typical projects. For example, some indexers work exclusively on the kinds of projects that earn more (or less) than the median. In other words, these are NOT target numbers; rather, they are reflective of the variety of everything that indexers do.

Synthesizing these numbers:

From the survey:

Synthesis:

If you want to make more money, focus on throughput: projects that are easier for you, more effective indexing tools, improved marketing, stronger client relationships, etc. In fact, the reason that advanced indexers tend to make more money is that they have been given the opportunity to build these skills: speed, marketing, relationships. For example, indexing a single book for a single author might be short-term lucrative, but building relationships with the author's institution is more lucrative in the long term. Also, experience clearly counts toward speed, too, while short-cutting quality can seriously damage relationships.

If you think of your career in terms of throughput, you might think about your day-to-day tasks differently. For example, there have been debates among indexers regarding the sharing of book mistakes caught (like misspellings); when thinking about throughput, sending such mistakes (a) slows you down, but (b) improves repeat business. On the other hand, when you've got a client who provides you with only one book a year, it's all loss, and no trade-off, in terms of income.

Finally, when I gave this presentation I made it clear that income isn't the only reason we're doing what we do. After all, there are more lucrative professions out there in the world. If you're earning a ton of money but destroying your health, sacrificing your happiness, hurting your family, or failing yourself in some other important way, then please reconsider your priorities.

Labels: ,


28 December 2006

 

Eighteen million people can't be wrong

No matter how much you and I might like Google, the fact is that Google has some very serious problems with it comes to finding content. More specifically, if you're looking for the "right" answer, or if you're attempting to do any serious research, Google is likely to fail you miserably.

The flaw lies in Google's strength: social algorithms. Social algorithms are processes in which decisions are made by watching and following the majority of people in a community. If blogger.com tends to be the place people go to create blogs, then a social algorithm will see blogger.com as "better." When a search engine is managed by a social algorithm, a website might appear first in search results not because of the quality of site, but rather because a larger number of people treated the site as if were of higher quality. In other words, social algorithms equate "majority" with "best," something that often looks right but actually is patently untrue.

When you perform a search at Google.com, your results are sorted based on majority behavior and little else. For simple questions about anything -- as well as complex questions about cultural issues, for which "lots of people" is critical -- frequently the majority opinion is rather close to what you want -- which is why Google is so successful. But the gap between "close to what you want" and "accurate" is an invisible one, and that makes it insidious and dangerous.

For example, search for "Seth Maislin." The first hit is my website. The second hit is this blog. The third hit is an interview I did for O'Reilly & Associates in July 1999. An investigation of why these are the top three sites is rather interesting. First of all, these are the only results in which my name actually appears in the title; the fourth link and beyond have my name in the document, but not the title. Second, my website appears at the top not because it's the definitive website about "Seth Maislin," but because Google knows of 24 people linking to it. In comparison, the only person who ever created a link to this blog is me -- a number far less than 24! The same goes for the O'Reilly interview, except that the single linker isn't even a valid site any more: it's broken. The popularity of my home page (in comparison to this blog, for example) is why it's a better hit for my name. But if you folks out there started to actually link to this blog, that would change.

You should look into the search results for the word "Jew." A website known as JewWatch.com, an offensive and inflammatory collection of antisemitic content, had appeared as the number-one result at Google.com for this one-word query. This happened because a large number of supporters of this site tended to build links to it; then, those were were outraged or amused also linked to it within their protestations. In the end, the social algorithms at Google recognized how popular (i.e., "linked to") this site was, and in response rated it very highly -- in fact, rated it first -- compared to all other websites with the word "Jew" in the title. Eventually, those who were enraged by this content fought back by asking as many people as possible to link somewhere else -- specifically, the Wikipedia definition of Jew -- just as I have here. Over time, more people linked to Wikipedia than to JewWatch, and so the latter dropped into second place at Google. This process of building networks of links in order to influence Google's social algorithm is called "Google bombing." In other words, when the people who hated the site acted together in a large group, Google's social algorithms responded.

(By the way, you'll notice that I do not create a link to the offensive site. I see no reason to contribute to its success.)

Do you see the problem? The success of Google bombing is analogous to the squeaky wheel metaphor, that the loudest complainer gets the best service. Social algorithms reward the most popular, regardless of whether they deserve it. JewWatch made it to the top because it was popular first; Wikipedia's definition moved to the top because those offended banded together to demonstrate even more loudly. And in the end, there's no reason for me to think either of these links is best.

Whether popularity is a good thing or a bad thing is often subjective. In language, some people lament the existence of the word ain't, while others applaud its existence as an inevitable sign of change; either way, the word is showing up in our dictionaries because more and more people are using it. But I'm not talking about language; I'm talking about truth.

Do you think vitamin C is good at preventing colds? Well, it isn't; there have been no studies demonstrate its effectiveness, but there have been studies that show it makes no real difference. (It's believed that vitamin C will shorten the length of a cold, but studies are still inconclusive.) But after a doctor popularized the idea of vitamin megadosing, our entire culture suddenly believes taking the vitamin will keep you extra healthy. Untrue.

Do you know why "ham and eggs" is considered a typical American breakfast? Because an advertising executive in the pork industry used Freudian psychology to convince people to eat ham for breakfast. He did it by asking American doctors if they thought hearty breakfasts were a good thing (which they did); the ad-man then asked if ham were a hearty food. Voila: ham, sausage, and bacon are American breakfast staples, and the continental breakfast vanished from our culture.

In both of these examples, majority belief trumps the truth. And look at the arguments about global warming! I won't repeat the arguments laid out by Al Gore in An Inconvenient Truth, but his argument is that as long as enough people insist that global warming isn't true, its dangers will remain unheeded. In fact, I'm not even going to argue here whether global warming is a real thing or not; it doesn't matter what I believe. What matters is that the debate over global warming isn't a fight over the facts. Instead, it's a shouting match, in which the majority wins. Right now, so many influential people have argued that it doesn't exist (or isn't such a big deal) that very little has been done in this country in response to its possible existence. But as more and more people start to believe it's at least possible, it's becoming a reality. Doesn't that just drive you nuts? Why are the facts behind global warming driven by democracy? Can't something be true even if no one believes in it?

One last look at this "majority rules" concept, only this time let's avoid politics and focus on simple word spelling. If you search for the word millennium, correctly spelled with two Ls and two Ns, you'll get about 54 million hits at Google (English-language pages only). If you search for the word millenium, misspelled with two Ls and only one N, you'll get 18 million hits. Twenty-five percent of all websites have this misspelling in them! For content that's published, that by its very nature is biased toward having only correct spellings, this error rate is monstrous! But does Google let you know that millenium is misspelled? Does it ask you if you "meant to type millennium?" No! After all, Google considers the misspelled word correct.

I mean, eighteen million people can't be wrong, right?

Labels: , , , , ,


08 September 2006

 

Unfindable (a virtue)

Indexers work to make things findable, but there's another side to this coin. Indexers also work to make things unfindable.

An important and often overlooked consequence of the culling process that indexers hone when deciding what should be indexed or labeled, and how, is that every decision an indexer does not make makes something that more unfindable. In fact, just as there are infinite number of misspellings for any one word, there an infinite number of indexing choice that an indexer can choose. But unlike misspelled words, which by definition are "mistakes," every indexing choice and every keyword is a good keyword in the right context. The information space is too big for mistakes; for each conscious and unconscious inaction, the best we can hope for is "highly unlikely." If we don't do it, perhaps no one will need it.

There's a sign in a general store in Lake George, New York: "If you don't see it, you don't need it." This is the inadvertent motto of all indexers. We can only pray that everything concept we leave unindexed, every word we don't choose, and every relationship we don't articulate is unneeded. Then again, there are an infinity of choices we never even see, aren't there?

Unfindability is a pandemic, a glorious desert that stretches beyond our senses and imaginings. In today's word of RFID technology, in which every object and person (and object-person combination) can potentially be mapped and tracked over an arbitrary length of time, the vast wasteland of unfindability starts to rank up there with a good vacation.

Let's turn the indexing process around. As indexers, what do we want to make lost?



As indexers, we shape the worlds that no one sees. Now let's do it on purpose.

Labels: , , ,


04 July 2006

 

The detailed analysis of indexing mistakes

In linguistics, the analysis of error is one means of learning how we cognitively process language. For example, when someone accidentally misspeaks "unplugged the phone" as "unphugged the plone," we discover that both the speaker is a visual learner (because he switched the P blends in the phrase, despite their different sounds) and that the speaker processes language in its component sounds. On the contrary, a speaker who says "unphoned my plug" processes language in morphemes (e.g., root words), and a speaker who says "unplugged my feet" is an aural learner (because phone and feet start with the same f sound). There seems to be an infinity of spoken-language errors possible, including absences, duplications, inclusions, misalignments, substitutions, and transpositions of letters, sounds, morphemes, words, and phrases.

When I evaluate an index, my job is to look for mistakes. As a now-experienced indexer who himself has made mistakes, I know that I can learn much about how an indexer thinks (or doesn't think) by analyzing her errors and accidents. And as with speech, there are innumerable kinds of mistakes available for the unwary indexer: absences, duplications, inclusions, misalignments, misrepresentations, and missortings of page numbers, letters, words, structures, and ideas.

Consider the incorrect page number, such as when content on page 42 is indexed as if it were on page 44. This kind of error tells us that the indexer did not attend properly to detail, perhaps because the working environment (deadlines, tools, etc.) was less than ideal. When a page range appears simplified to a single number, such as when 42-45 appears simply as 42, I am more likely to consider the indexer lazy instead of scatterbrained, though again it is also possible to blame the working environment (including client demands).

Entries that appear in an index but have no value to readers (e.g., the inclusion of passing mentions and other trivia) demonstrate the indexer's ignorance of the audience, or of the indexing process itself. Entries that fail to appear in an index but should (e.g., the under-indexing of a concept) demonstrate either the indexer's ignorance of the audience, the indexer's ignorance of the subject content, or a sloppy or otherwise rushed working process.

Awkward categorizations, such as entries that are mistakenly combined or that doesn't relate well to their subentries, are a clear sign that the indexer misunderstands the content or is too new to indexing to understand how structure is supposed to work. For example, an indexer who creates

American
....Idol (television program), 56
....Red Cross (organization), 341

doesn't think of indexing as a practice of making ideas accessible, but rather as a concordance of words without meaning. Under no circumstances should American Idol or American Red Cross have been broken into halves, let alone combined. Since categorization can be subtle, however, evaluators can learn something interesting about indexers by looking closely at their choices:

writing
....as artistic skill, 84
....fiction vs. nonfiction, 62

In this example, the first subentry defines writing as a trade; it's clear the indexer is comfortable with the idea of a writer. The second subentry defines writing as a process, with a start and finish, such that the process (or journey) of writing could be different when you're writing fiction instead of nonfiction. Analysis of this entry tells us that the indexer doesn't recognize or appreciate the difference between writing (trade) and writing (process). Is the indexer revealing her inner disdain for writers, does she believe that all writers are the same no matter what they produce, or does she simply know nothing about the writing life?

One of the big challenges for indexers is to provide the language that readers will need to find the content they're looking for. When an indexer either offers language that no one will look up or omits the terms that readers prefer, she is demonstrating an ignorance of the audience or of the content, or hinting that the overall indexing process or environment is inadequate. Further, when the indexer fails to provide access from an already existing category entry (for example, if the index has an entry for "writing, fiction vs. nonfiction" but fails to provide the cross reference "See also author" when there are author entries), she tells us clearly that she is unfamiliar with the material. No other combination of errors speaks of subject ignorance as clearly; by failing to connect existing concepts, the indexer shows us gaps in her knowledge of the information map.

There are several kinds of text errors. Misspellings and other typographical errors are a sign of carelessness or insufficient tools. Accidental missortings are a sign of ignorance, poor tools, accelerated schedules, or a failure of communication among publication staff. Ambiguous terms that aren't clarified are caused by indexers who are too limited in their thinking or their assumptions about the audience, indexers who don't know the material, and authors who failed to communicate the ideas clearly enough for the indexer to understand. Finally, odd grammatical choices usually signal a poor production process, such as when two indexes are combined automatically with insufficient editing effort, or a brand new indexer with no formal training.

Before concluding, I would be amiss to ignore errors of formatting. A failure to use consistent styles signals a deficit in tools or attention, whereas awkward or unreadable decisions regarding indentations, margins, and column widths are a big sign that the index designer (who is not necessarily the indexer) has no clear idea whatsoever how indexes work. Missing continued lines communicate the same thing. (On the other hand, exceptional use of formatting, such as the isolated use of italics within a textual label, is a clear sign that the indexer really does understand both the audience and how they approach the index.)

Ignorance, sloppiness, indifference, and confusion: these are shortcomings even a professionally trained, experienced indexer might have, but thankfully they often manifest as isolated exceptions in her practice of creating quality work. But when a single kind of mistake appears multiple times throughout an index -- numerous misspellings, huge inconsistencies of language, globally insufficient access, awkward structures -- we need to be concerned. When we see these, we have an obligation to analyze the indexer. By properly arming ourselves with this knowledge, we can determine for ourselves if the indexer was the wrong choice for a particular project, struggled with the challenges of inferior tools, or simply had a bad day.

Meanwhile, if indexes written by different indexers are plagued by the same exact problem, it's unmistakably clear that the problem is in the systemically faulty publication process: ridiculous deadlines, uncooperative authors, uncaring editors, poor style guides, and so on. In other words, you shouldn't evaluate indexes in isolation. Instead, look at the work of other indexers for the same publisher, as well as the work of other publishers by the same indexer.

Okay, but what if the index is essentially perfect, with no errors at all? Can we still learn something? Yes, we can. The absence of all error tells us something very important about the indexer: She's being underpaid.

Labels: , ,


18 May 2006

 

Bias in indexing

The greatest advantage that indexing processes have over automated (computer-only) processes is the human component. Of course, as someone who has worked with humans before, you probably recognize there can be imperfections.

I was reading Struck by Lightning: The Curious World of Probabilities earlier this week, in which the author writes of biases in scientific studies. I realized that these same biases occur with indexes and indexers as well, and I wondered if I can list them all.

(The biggest bias in indexing isn't one of the index at all, but rather the limitations on what the authors write. For example, if a book on art history didn't include information about Vincent van Gogh, I would expect van Gogh to be missing from the index; this absence might be caused by an author bias. However, I am going to focus on biases that affect indexing decisions themselves.)

Inclusion bias. Indexers may demonstrate a bias by including more entries related to subjects that appear more interesting or important to that indexer. For example, I live in Boston, and so I might consider Boston-related topics to be less trivial (more important) than the average indexer; consequently, documentation that includes information about Boston is more likely to appear in my index. I imagine inclusion bias is a common phenomenon in documentation that includes information about contentious social issues -- immigration, tobacco legislation, energy policy -- because the drive to communicate one's ideas on these issues is stronger. I also believe that inclusion bias is not entirely subconscious, and that indexers may purposefully choose to declare their ideas with asymmetric inclusion. It should be noted, however, that biased inclusion would not necessarily provide insight into the indexer's opinion on the subject; creating an entry like "death penalty morality" does not clearly demonstrate whether the indexer actually disagrees with capital punishment.

Noninclusion bias. Similar to inclusion bias, indexers might feel that certain mentions in the text are not worth including in the index because of their personal interests or beliefs. Unlike inclusion bias, however, I suspect noninclusion bias does not appear in regards to contentious issues; conflict is going to be indexed as long as the indexer recognized the conflict has value. Instead, an indexer is likely to exclude things that "seem obvious"; rarely are these tidbits of information controversial. For example, an indexer who is very familiar with computers is likely to exclude "obvious computer things," subjectively speaking; you probably won't find "keyboard, definition of" in such a book.

Familiarity (unfamiliarity) bias. When an indexer is particularly interested in or knowledgeable about a subject, the indexer is likely to create more entry points for the same content than another indexer might. For example, an indexer who is familiar with "Rollerblading" might realize that Rollerblade is a brand name, and that the actual items are called inline skates. This indexer is more likely to include "inline skates" as an entry. Unfamiliarity bias would be opposite, in that multiple entry points are not provided because the indexer doesn't think of them, or perhaps doesn’t know they exist.

Positive value bias. An indexer who has reason to make certain content more accessible for readers to find (and read) is likely to create more entry points for that idea. At the extreme, the indexer will overload access by using multiple categorical and overlapping subtopics, where those subcategories are at a higher granularity than the information itself. For the generic topic of "immigration," for example, an indexer might include categorical entries like "Hispanic immigrants," "European immigrants," and "Asian immigrants," as well as overlapping topics like "Asian immigrants," "Chinese immigrants," and "Taiwanese immigrants," with all of them pointing to "immigration" in general.

There are three types of positive value bias. Personal positive value bias is demonstrated when the indexer himself believes that the information is of greater-than-average value. Environment-based positive value bias is demonstrated when the index is swayed by environment forces, such as social pressures, political pressures, pre-existing media bias, and so on. Finally, other-based positive value bias is demonstrated when the index bows to pressures imposed by the author, client, manager, or sales market (i.e., the person paying the indexer for the job). Although it can be argued that this last type of bias is not the indexer's bias, strictly speaking the indexer can choose to fight any bias forced upon him. For example, a client who instructs the indexer to "index all the names in this book" might interpret this instruction as some kind of market bias, and thus refuse to follow this guideline. In reality, however, most indexers do accept the pressures placed upon them by the work environment, and thus in my opinion take on the responsibility and ethical consequences of this choice.

Negative value bias. It's possible for an indexer to provide fewer entry points for content that he feels is not of great importance to readers -- the direct opposite of positive value bias -- but the reasons for limiting access to that content are probably not related to indexer's perceived value of that content for readers. Instead, indexers are likely to limit access to content when there is a significant amount of similar content in the book, and as such including access to those ideas would either bulk up the index unnecessarily or waste a lot of the indexer's time. For example, if an indexer were faced with a 40-page table of computer terms, it's unlikely that each term would be heavily indexed, even if such indexing were possible and even helpful to readers.

For this reason, I believe that there are three kinds of negative value bias: time-based negative value bias, in which the indexer skimps on providing access in an effort to save time; financially motivated negative value bias, in which the indexer skimps on providing access in an effort to earn or save money; and logistical negative value bias, in which the indexer skimps on providing access in response to logistical issues like software limitations, file size requirements, page count requirements, controlled vocabulary limitations, and the like.

Topic combination (lumper's) bias. This bias is exhibited by indexers who are likely to combine otherwise dissimilar ideas because they find this "lumping together" of ideas to be aesthetically pleasing or especially useful. This kind of bias is visible in the ratio between locators (page numbers) and subentries, in that entries are more likely to have multiple locators than multiple subentries, on average. For example, an entry like "death penalty, 35, 65, 95" shows that the indexer believes the content on these three pages is similar enough that subentries are not required or useful. Topics that start with the same words might also be combined in a more general topic (such as combining "school lunches" and "school cafeterias" into a combined "school meals.") It is worth noting that some kinds of audiences or documentation subjects may tend toward topic combination bias; for this reason, it may be difficult to recognize lumper's bias.

Topic separation (splitter's) bias. This bias is exhibited by indexers who are likely to separate otherwise similar ideas because they find this "splitting apart" of ideas to be aesthetically pleasing or especially useful. As with lumper's bias, splitter's bias is represented by the ratio of locators to subentries throughout an index, in that splitters are likely to create more subentries than would other indexers, on average. It is worth noting that some kinds of audiences or documentation subjects may tend toward topic separation bias; for this reason, it may be difficult to recognize splitter's bias.

These are all the biases I've found or experienced. If you think there's another kind of bias that indexers exhibit, let me know.

The remaining question is this: Is it wrong for an indexer to have bias? That is, should indexers study their own tendencies and work to avoid them? I don't think it's that simple. The artistry that an indexer can demonstrate is fueled by these biases -- experiences, opinions, backgrounds, interpretations -- and perhaps should even be encouraged. An indexer's strengths come from his understanding of not just the material, but also his perceptions of the audience, the publication environment, and the audience's environments. Further, indexers who know and love certain subjects are going to be drawn to them, just as many readers are; these biases aren't handicaps so much as commonalities shared between indexers and readers. Biases will hurt indexers working on unfamiliar materials in unfamiliar media, but under those conditions the biases are the least of our worries; when the indexer is working without proper knowledge, the higher possibility of bad judgment or error is a much greater concern.

If anything, indexers should be aware of their biases because they can serve as strengths -- especially in comparison to what computers attempt to do.

Labels: , , ,


04 April 2006

 

Whatever happened to "indices"?

My uncle (among others) asked me, "Is the word indices no good anymore? It seems that indexes has won."

I didn't know those words were contesting, but yes. If a U.S. winner were to be declared today, I'd have to go with indexes.

Although strictly speaking the correct term is indices, I think in common speech a distinction has been made for an item that is rarely plural. For example, when referring to an appendix in the back of a book, often there is more than one: appendices. However, when referring to the (vermiform) appendix in the human body, rarely do you talk about more than one at a time, and thus "appendixes." (And then there's the acronym, APPENDIX, which simply doesn't count.)

With book indexes, there is rarely more than one -- although you can certainly talk about the indexes across the books, as I do. But in database programming and similar constructions, often each line in the database (each record) has its own index. And so you can have thousands of indices. When you're working with indices (as opposed to indexes), you're working with large quantities of small bits of information.

To me, this logic is what's also behind such oddities as the words "persons" and "peoples." These terms, though related, are attempts at showing quantity in environments where quantity is much less likely. Said another way, these words are attempting to emphasize the value of the singular, even while referring to more than one. Thus "persons" is used in legal contexts where the individual is important, "people" is referring to a group of beings in which individuality is not important, and "peoples" is referring to a collection of groups of beings in which the nature of each group remains important.

How's THAT for an answer?

Of course, the only real test is if there are other words that seem to follow the same pattern. So far, I can think of only index and appendix. Others?

Labels:


This page is powered by Blogger. Isn't yours?