Seth Maislin's Indexing Blog: March 2006

24 March 2006

The granularity of an online "page number"

When writing a hyperlinked index (where hyperlinks are used instead of page numbers), to what should those links point?

Some people think they should point to the section title in which the information is provided; other people like to point right to the specific word used in the index. The "answer," obviously, is that hyperlinked index entries should take the readers to where the information is, right? The problem -- the reason there's this question of "where do I point my entries" in the first place is that readers of hypertext might find themselves bounced somewhere they don't understand. How many times have you followed a link, only to find yourself fiddling around with the scroll bar to figure out where you ended up? Following a hyperlink is like being blindfolded and transported to an unknown destination.

What you may not know is that book indexes aren't much different. :-)

Think about how book indexes actually work, and you realize that direct readers to the page on which the information starts. An entry like "buoyancy, 164" tells the reader to look somewhere on page 164; an entry like "global harmony, 164-167" tells the reader to start looking somewhere on page 164. The granularity of an index is defined as the smallest unit of area that can pointed to. For printed indexes, this area is the page number. Rarely will you find locators that use fractional or qualified page numbers like 164-1/2 or 164top. (There are such things as qualified locators, like 164f, which might point to the footnote on page 164, but even in the books in which they're used they comprise only a small number of all locators used.)

If you follow the standards of the industry, then, the granularity of a printed index is one physical page. For this reason, books that have lots of words on a page -- big pages, narrow margins, tiny print -- are less friendly to book indexers. It's like telling someone that there's a needle in that 164th haystack over there. Maybe we should count our blessings that someone bothered to number the haystacks, but ideally this is where the book designer starts earning her salary. Book pages don't have to look like haystacks -- more accurately, wordstacks -- if the book has legible headings and subheadings. Books can be written with quickly visible landmarks within the pages, like italics and boldface, larger and smaller font sizes, headings and callouts, footnotes, and so on. Going back to the blindfolded analogy, there's no reason we have to drop our readers into deserts of information, when we can drop them in a place surrounded by location clues and navigational signs, like at a train station.

On the Web, however, there is no such thing as a printed page. Web pages can be any length, from tiny pop-up windows with only a sentence fragment of information within, to long scrolls of endless paragraphs and images. Additionally, you don't have to direct the reader to just the page any more, but rather you can deposit him anywhere within the page. The granularity of a Web page is a word! You can send someone into the middle of a paragraph.

When you have tiny little windows of information, using that window as a destination is a no-brainer: the reader arrives at a single sentence of information, which is what he needs. It doesn't matter if you point him to the beginning, middle, or end of that sentence, because it's all they get to read. Pointing someone to an isolated window of information -- what Web authors call "chunks" -- is as easy as looking into a food pantry that contains only a single can. But when you have longer pages, and you have the ability to point someone to any spot within those longer pages, you have a decision to make. And it's a decision that didn't exist in the printed world, with its larger granularity.

The solution is to connect the text of the index entry with the text of the documentation. Not the meaning, but the actual words. If the index entries are written to almost identically match those of the documentation, then the reader won't mind as much because it won't look like a desert. They'll have exactly the landmark they need right in front of them. The entry "cancer, prevention of," for example, could point directly to this line without a problem:

... cessation of smoking. In fact, many physicians are well aware that one way to prevent cancer is to quit ...

That's because the words of your index entry, which are cancer and prevention, appear almost verbatim in that line of text. And if this information were part of a section titled "Using Peer Pressure to Help Patients Quit Smoking," then you really wouldn't want to point to the heading for context. That's because it's unclear to the reader that you're actually directing him to information about cancer or prevention. You're making them work at it.

And then there's the other situation. Using the same sentence and heading as above, where should the indexer point readers who look up the entry "smoking, how to quit"? Clearly they should go right to the heading. If they went to the line that talked about physicians, they wouldn't know where they are.

Our original question here was this: When writing a hyperlinked index (where hyperlinks are used instead of page numbers), to what should those links point? Clearly the only way to answer this question comprehensively is to suggest that the language of hyperlink indexes has two contexts: the index entry itself and the destination location. These two contexts need to work together. And as we saw, the same is true with the printed book: having arrived at page 164, how quickly can you find the idea you were looking for?

Looking this closely at hyperlinked indexes only emphasizes something we need for all indexing: use index entries that match the documentation text. If you have to write a slightly longer entry, that's okay. Instead of "cigarettes," use "cigarette smoking, quitting." Instead of "social networks," use "social networks and peer pressure." The people who work with search engines and Internet marketing are familiar with the term trigger words, which refers to visible language that matches the mental language of the searcher. If you're thinking of the words "white elephant," then a result of "pale pachyderm" doesn't work because it doesn't trigger your sense of recognition.

So the next time there's a white elephant in haystack 164, be sure to tell someone as explicitly as possible.

Labels: books, indexing process, keywording, pages and page ranges, web indexing

# posted by taxonomist @ 9:23 AM 0 comments

20 March 2006

Frustrated by a lack of meaning

Not for the first time, search engines have been wrongly criticized for the politics of their results.
As reported by The New York Times today ("Amazon Says Technology, Not Ideology, Skewed Results," March 20, 2006), an abortion-rights organization discovered and reported the appearance of biased results in its search engine. Apparently books with anti-abortion leanings appeared as more relevant on Amazon's search results pages. I am not taking sides on this highly charged issue; I am taking offense at the ignorance demonstrated by people who don't seem to understand how search works. (And I'm not singling out this issue either, as you'll see from my later examples.)

See, there isn't a search engine on the planet that can cull actual meaning from its databases. They can only look at the words themselves. Even search engines that analyze the behavior of their users still look at words and numbers, without interpretation.

Let me explain what really happened with Amazon, and why Amazon is not automatically in the wrong. Someone went to the search engine and typed in the word abortion. Now imagine that you're the search engine, and you have two results to give back. Result one is a book whose title is simply Abortion. The second is a book whose title is Understanding Abortion. Tell me: which result is more relevant? Answer: you have no clue.

When faced with this impossible question, the search engines at Amazon and elsewhere attempt to apply certain generalizations that might work in other situations, but simply don't work here. For example, there might exist a rule that puts Abortion ahead of Understanding Abortion because the title of the first book matches the query exactly, whereas the second title is only "half right." Or perhaps one of the books is 500 pages long, but the other is 200 pages long, and Amazon favors longer books. Maybe Amazon is interested in selling you the more expensive book, the book more recently published, or the book that gets a higher rating from all the people visiting the website. In the end, however, all of this analysis fails -- completely and utterly fails -- to answer a very simple question: which of these books is against abortion? Heck, even I don't know, and I invented them!

With search, meaning is irrelevant. Search engines can look only quantitatively at the letters of the words, and at innumerable statistics (e.g., number of Web views) that have at best a tangential relationship with meaning.

Before we look at another example, let me also talk about another thing that Amazon did at one time. If you searched for the word abortion, in addition to your results you received what should be interpreted as a helpful search hint: "Did you mean adoption?" This might sound political, but the logic of this lies in the similar spelling of the words adoption and abortion. Given that there are many more books about adoption than abortion at Amazon, the search engine guessed that someone typing the word abortion might have misspelled something; the computer offered what it considered a reasonable alternative. Had that suggested word been something different -- "Did you mean apportion?" -- no one would have cared.

By the way, I will admit that it is always possible that a company, like Amazon, could consciously manipulate its search results to accomplish some kind of selfish ends. Lycos puts sponsored links at the top; Yahoo promotes its internal products over those of others; Amazon presents the products of its more lucrative partners over all others. It is not far-fetched to imagine a company exercising editorial control for political or religious purposes, especially in today's age. The problem is that some issues are perceived as so volatile that no one is willing to consider coincidence of language as just that, a coincidence. Language is powerful stuff; spelling women as womyn to avoid the "men" letter subset is a powerful choice, whether you agree or not.

Here's another story, from the late 1980s. A search for the word monkey within a database of clip art provided by Microsoft produced a seemingly offensive result: a picture of African-American children. There was an uproar, and although Microsoft denied that it had done anything intentionally racist, it quickly removed the image from the database. The real problem, however, is that the children in the image were playing on monkey bars. Interestingly, if you stop to think about it, the only racism in this example is caused by the person who performed the search! That's the person who actually connected the word monkey with the children (and not the playground equipment); no one at Microsoft did. In this example, the giant void where meaning should have been was automatically filled in by the searcher, by association and as a reflex.

Here's another story, from last year. An article (I can't remember where) expressed how a bad critical review of a specific performer appeared more relevant in a Google search that the good reviews -- of the performer's own website. This would be equivalent to searching for me ("Seth Maislin") and getting a top result of "Seth Maislin Has Bad Teeth" instead of this blog, my website, or one of my interview at O'Reilly & Associates. In this case, Google isn't passing judgment, but it certainly feels like it! Instead, it's looking at how popular that Bad Teeth article might be, or its host (for example, it might be a Wall Street Journal or People Magazine article, periodicals that have readerships thousands of times larger than anything I've ever done), and using that popularity to push the article to the top. It's assuming -- wrongly, in this case -- that people looking for me are less interested in my website than in what Teen People or the WSJ has to say.

Search just doesn't care. If you're looking for meaning, don't ask a search engine.

Labels: keywording, search engines

# posted by taxonomist @ 9:49 PM 1 comments

15 March 2006

Bikini emergencies

Index entries can be divided into two categories: the things people look up, and the things they don't. The first category is straightforward, but the second category is inspiring.

I think it's fair to say that index entries that no one looks up are, well, better left out of the index in the first place. If you're reading a book about the cardiovascular system, there's little point in including the index entry "jack-o-lanterns."

It's at this point that I've love to say, "Enough said," but I'd be wrong. That's because having entries that no one looks up isn't really a problem, as long as you don't have too many of them. Ignoring the costs of the physical space they require, or the indexers' and editors' resources in making them actually appear in that physical space, unused index entries can exist without anyone really caring, like pennies in a penny jar. As long as the jar isn't full, no one cares.

Why was the entry there in the first place, though? There are some legitimate reasons, of course, with the most legitimate being a reflection of an author's non sequitur. It's not too far-fetched to imagine a cardiovascular surgeon authoring a textbook about the cardiovascular system, taking a moment to wax poetic in a footnote about how the surgically cutting into the chambers of the human heart always remind him pumpkin carving on Halloween. If he takes even a half-sentence to explain why heart surgery has something in common with ritualistic gourd mutilation, the indexer will notice this and create that silly entry for "jack-o-lanterns." No one will look it up, but that's not the indexer's fault, is it? She's just doing her thorough-as-usual job.

Then there are the indexers who include things without realizing who the audience is. They include ideas that are too esoteric, off-topic, general, or inappropriate for people to look up. For example, in a book about pet care, they might include an entry for "schnausers. See dogs." Alternatively, they index the ideas people will look up, but under labels that no one would look up, like using the term "octothorpe" as a name for the # symbol.

And of course there are always the honest mistakes: spelling or typographical errors, document file anomalies, outdated indexes for new materials, and so on.

What most indexers seem to forget, however, is that just as these odd index entries might be created by accident -- author silliness, indexer ignorance, production oversight -- those very same index entries might be discovered by accident, too: reader serendipity. One of the great advantages of printed indexes over search results (or search-accessed indexes) is the serendipity that results from browsing. Just as we're likely to discover interesting words in the dictionary while trying to look up something else -- look up "Jefferson, Thomas" and find yourself reading about jeffing and jeffus -- so we might discover interesting things those indexes.

If you found "jack-o-lanterns" in a cardiology textbook, wouldn't you follow the entry? I know I would. And that's why there's one last reason for including these unlikely-to-be-used entries in indexes: sheer joy.

Did someone say, "bikini emergency"?

Labels: human factors, keywording

# posted by taxonomist @ 9:15 PM 0 comments

14 March 2006

Fixing the books that leak

The more that I think about the presentation I am giving in April to the ASI New England Chapter, the more I find myself contemplating the environment in which indexers work. Why do so many indexers have trouble bidding for work?

Because clients don't grasp what this industry is about.

I'll make the analogy between a book without an index and a leaky faucet. The client recognizes that his sink faucet is malfunctioning. He is unable or unwilling to fix the faucet himself. He investigates the services of a professional faucet fixer.

Taking the three most common bidding environments for indexers, we get these three models for an owner of a leaky faucet.

Model 1: The Drowning Homeowners at Midnight. Having waited too long, or perhaps because another faucet-fixer failed to complete the job, these clients need help NOW. They're willing to pay any amount of money, but their standards are very low. "Please," they say, "just stop the leak." If they happen to know how to reach a specific faucet-fixer at midnight, that faucet-fixer is going to make a lot of money. But if there's no one they can call at midnight, they're going to call everyone, leave a lot of messages, and wade knee-deep in hundreds of early-morning responses. It's a bidding war, and the least expensive person wins.

For indexers to make this model work, they need to have their names right there on the clients' desks, and they need to be prepared to sacrifice quality for the sake of an important deadline. However, this model is terrible for the industry, because it leads to bidding wars where indexers underbid each other for the privilege of doing a lousy job.

Model 2: The Paranoid Homeowners. Here are clients who have worked with so many bad contractors over the years that they simply don't trust anyone. They're looking for someone who will fix the faucet according to ridiculously robust specifications, and who will allow them to stand over their shoulders, watch the fixer's every move, and offer suggestions and instructions all along the way. It's obvious this homeowner would do the work himself if he could -- which means all of those specifications, suggestions, and instructions are coming from a place of very hostile ignorance. With this model, indexers will find themselves burned almost every time.

This model doesn't help the individual indexers or the indexing industry, yet the model exists only because indexers as a whole failed to uphold any kind of standards, or to educate their clients about indexing. It's not uncommon for a more experienced indexer to clean up the mess left behind by someone without appropriate experience or sufficient resources to get the job done right the first time. When faced with this model, the indexer must be prepared to uphold his own principles, explain his choices, demonstrate precedent, and remain friendly throughout. I suppose it's like trying to talk about love to someone who just got divorced.

Model 3: The Proud Fixer-Uppers. Then there are the folks who would be happy to fix the faucet themselves but don't have the patience, the know-how, or the resources. They want help, but on some level they resent needing it. These are the people who, when presented with a faucet-fixing cost of $80, ask, "Will you do it for $40?" If you say no, they'll say, "What if you just put the pieces in a line and I'll wrench them together myself?" Say hello to the silly goons who would rather let their faucets leak than admit they need help worth paying for.

This model exists because the industry is underrespected. If someone had an overflowing toilet, do you really think they'd try to haggle with the plumber? But with indexing, this happens all the time. "Just index the headings" and "Do the best you can in only three days" are all too common in an industry where people aren't aware of the advantages of a good index. The individual indexers perpetuate this problem by accepting these underpriced, undervalued jobs and creating mediocre products.

Yes, there is a fourth model, and that's the ideal situation: someone who understands that indexing is a specialized trade that requires a professional with the appropriate education, background, and resources to get the job done. If this model were the most common, though, would indexers still be afraid of bidding? Nope.

What would it take to convert the clients of the three models above into the clients of a more reasonable, respectful model? I've heard a lot of suggestions -- education, indexing standards, indexer credentialing -- but the problem runs a lot deeper than most people think. There are two fundamental issues here: knowing what a good index is, and knowing that an indexing profession exists.

With a faucet, all we need is water on the floor to know it's broken, but you can't "replace a washer" to make a bad index into a good index. And while most home dwellers have heard of plumbers, few of the world's literate population have heard of indexers! Trying to educate the world about indexing is like trying to explain the ideal gas law to toddlers: they may breathe the air, but that doesn't mean they know how it works. (And don't ask their parents, because they don't know either!)

The law that mandated the wearing of seat belts in motor vehicles raised awareness of their value. Would they work for indexers?

# posted by taxonomist @ 9:45 PM 0 comments

01 March 2006

College content management

If you thought online education was a growing industry, you underestimated. It's about to explode. That's because Congress is now allowing federal financial aid to students of colleges that teach more than half of their courses off-campus, including over the World Wide Web. (See "Online Colleges Receive a Boost From Congress," The New York Times, March 1 2006.)

Let me rephrase that. Colleges are going to explode. Like scambling an egg.

There are two types of courses that are likely to appear online at exponential speeds. First, you have the low-end requirement courses, like Calculus I and English 101, which are courses everyone has to take to earn a bachelor's degree. These courses are regularly populated with large numbers of students, and yet they are introductory-level, tried-and-true courses that are relatively simple to teach and grade. (In fact, graduate students usually teach these courses.) Instead of wasting valuable classroom space and valuable instructor time, these courses will appear online. For large universities, there's an even greater advantage: hundreds of students can be taught at once without that "giant lecture hall" atmosphere, and without having to create numerous sections (class segments) for advising lessons or grading purposes.

The other type of course that is going to end up online are the courses that professors are just climbing all over themselves to teach, either because it's a cutting-idea, the product of a personal info-lust or -peeve, or simply because it's so esoteric that their departments have never granted them the opportunity to teach something that almost no one will attend. These kinds of courses will prosper online because students from all over the world can attend. That sociology elective called "What License Plates Tell Us About Our Culture" won't get just one student any more, but tens of students, and that makes professors and universities happy. I'm an online instructor of a rather esoteric subject -- indexing -- and so being able to offer my indexing course over the Internet has enabled me to reach dozens of indexing professionals and enthusiasts from around the globe every year. Had I continued to teach only in person, the course might have been cancelled for lack of interest.

So suppose every college in the United States retools that Calculus I class into an online course. It's not easy building an online course, but talk about unnecessary redundancy! No, if department heads are smart, they'll team with the department heads from other colleges and share a course. For example, I can imagine a single English 101 course offered to every college student in Massachusetts. I'm not saying this is a perfect idea, but then again, neither is having 14538 of them.

This is a content management problem (which is why I'm blogging about this). Content management is about many things, including (a) avoiding redundancy in communication; (b) avoiding the communication of inaccurate, outdated, or contradictory things; and (c) providing the correct information for each audience subset, whether it's a single person (like "Welcome, Seth!") or a large group (like English speakers); and (d) communicating everything that needs to be said. Content management is a huge issue in the distribution of information, one that got even more obvious with database use and the Internet.

The magic question here is this: How many different Calculus I courses do we need? From a production standpoint, our goal is to have only one. In reality, however, there are many reason a student might prefer one version of this course over another: the instructor (charisma, ability to teach, ability to communicate, educational background, current interests, track record in past courses, reputation in the industry, etc.); the course materials (ability to relate to examples, quantity of independent and groups exercises, immediate relevance of exercises to students [e.g., local interest], ethical choices, strictness of prerequisite management, etc.); the course delivery (tools requirements, number of lectures, number of students, grading methods, student-to-teacher and student-to-student interactions, what percentage of the course is different from that in previous semesters); and the course environment (reputation of university, opportunity for real-time meetings [chat, voice, face-to-face], textbook requirements); and so on. As you can see, there are a lot of reasons one course might be "better" than another, for a particular student.

And so you have a battle: competition for students vs. need to teach the basic materials. Colleges and universities are already doing this at a macro level -- compare MIT to Bunker Hill Community College, as in the film Good Will Hunting -- but the competition at the lower level is going to be very interesting.

Further, if course development follows the path of computing, parts of courses are going to be delivered as if by subscription. Imagine some guy in his attic churning out math problems for fun. (Believe me, it's real.) Professors can subscribe to this guy in the same way you find those Sudoku puzzles. In computer parlance this is called distributed computing: where a bunch of computers are working together, but each is working on a separate problem. (It's sort of like wearing a wristwatch to tell the time, carrying an iPod to listen to music, and attaching a bottle opener to your keychain, instead of investing in a WatchPod Opener.)

Distributed education means the end of campus life as we know it! Professors moderate courses written by dozens of international specialists, students take courses moderated in other countries, and grades are applied to the diploma of your choice. Campuses become less about learning and more about community events, just as public libraries and shopping malls have been forced to evolve.

Congress has made the right choice, but is this country truly ready for college management?

Labels: content management

# posted by taxonomist @ 12:09 PM 1 comments

Seth Maislin's Indexing Blog