Seth Maislin's Indexing Blog

20 January 2007

Foreword to Heather Hedden's upcoming book

I was asked to write the foreword to Heather Hedden's upcoming Indexing Specialties: Web Sites, to be published in 2007 by ITI. Given the importance of this book in the indexing industry, I am reprinting that foreword here. For more information on the book itself (not yet available), visit either ASI's publications page or a list of ITI's indexing publications.

- - - - - -
Foreword

Indexing is not a popular profession by any stretch of the imagination. Not only is it almost completely unknown in lay circles, but let's be honest: writing indexes sounds about as exciting as cleaning the house, but a hundred times harder. Also, if you were born in any year before 1990, the idea of Web indexing sounds like cleaning a house in outer space. I mean, there are no houses in outer space.

The Internet and the Web -- this monstrously huge and growing system of sharing data -- desperately need more information sorcerers like Heather Hedden. Not only does Heather have the talent to recognize when knowledge is missing, but she also has the ability to make that knowledge visible. She starts by learning for herself, and then she loves to share.

Heather and I first crossed paths in my classroom, where I taught a course called "Writing Indexes for Books and Websites." My course was written to explore the questions and theories of indexing, and so couldn't be limited to just books. Heather’s interest went much further, and since then she has explored writing web indexes as a singular discipline. For me, Heather has been a student, an apprentice, and a role model. She's someone I count on to get things done. She has vaulted across the lines from library science to book indexing to web indexing, each time with surprising success, and has since become a renowned and respected expert in the web indexing community.

Indexing Specialties: Web Sites is a book filled with honest, get-it-done advice. Heather is not afraid to talk about the code and the tools, because she has faith in her readers. In her hands, the complicated stuff looks straightforward. Besides, when the technical lessons are over, Heather shows readers how to think about web indexing as well: as a process and as a business. Until now, if book indexers wanted to graduate to the Internet frontier, they had no unified place of reference, no single source of everything they'd want to know. In fact, some of the tools Heather includes in this book were almost completely unknown to indexers until now.

I am excited and pleased to see Heather compiling this knowledge in a book. She has put into print an indexer's Rosetta Stone, which will lead book indexers toward other information management topics like taxonomies, information architecture, and search tools. It's not about complicated coding practices and computer programs, but about the guidelines to getting that A-to-Z index published on the Internet, and doing it right.

She begins by exploring the boundaries of web site indexing, clarifying what kinds of sites need indexing, how they should look, and how they should work. Then she immediately provides the HTML building blocks to making your indexes appear on the Web, the surprisingly simple code you'd need to create index pages, index entries, indentations, hyperlinks, and cross-reference links. If you've never programmed on the Web before and are afraid it's over your head, you’ll be kicking yourself once you see how easy Heather makes it.

Once you're armed with the grammar, you next need the tools to actually write. Heather gives you the detail about the tools (CINDEX, HTML Indexer, HTML/Prep, Macrex, SKY Index Professional, and XRefHT) to create or generate indexes that are ready for web publication. She takes more time exploring the specialized tools of XRefHT and HTML Indexer, two stand-alone web indexing applications, and shows how you can use their features with agility.

The last third of the book is dedicated to the "mindspace" of web indexing. There's more to indexing than just the tools, and so Heather writes carefully about how indexers should approach the job. She addresses the challenges of working out of order, adding anchors, indexing periodicals, and knowing which pages and at what level of detail you should index. She deals in detail with cross-references, language, subentry structure, and format. Finally, Heather dives into the nitty-gritty of the web indexing marketplace, including how to market yourself as a web site indexer.

Web Sites is going to satisfy you immediately and in the long term. On behalf of the American Society of Indexers -- and myself, personally -- I am honored to welcome Heather as an esteemed author in our community.

Seth Maislin
President of the American Society of Indexers (2006-2007)

Labels: books, web indexing

# posted by taxonomist @ 11:15 AM 1 comments

28 December 2006

Eighteen million people can't be wrong

No matter how much you and I might like Google, the fact is that Google has some very serious problems with it comes to finding content. More specifically, if you're looking for the "right" answer, or if you're attempting to do any serious research, Google is likely to fail you miserably.

The flaw lies in Google's strength: social algorithms. Social algorithms are processes in which decisions are made by watching and following the majority of people in a community. If blogger.com tends to be the place people go to create blogs, then a social algorithm will see blogger.com as "better." When a search engine is managed by a social algorithm, a website might appear first in search results not because of the quality of site, but rather because a larger number of people treated the site as if were of higher quality. In other words, social algorithms equate "majority" with "best," something that often looks right but actually is patently untrue.

When you perform a search at Google.com, your results are sorted based on majority behavior and little else. For simple questions about anything -- as well as complex questions about cultural issues, for which "lots of people" is critical -- frequently the majority opinion is rather close to what you want -- which is why Google is so successful. But the gap between "close to what you want" and "accurate" is an invisible one, and that makes it insidious and dangerous.

For example, search for "Seth Maislin." The first hit is my website. The second hit is this blog. The third hit is an interview I did for O'Reilly & Associates in July 1999. An investigation of why these are the top three sites is rather interesting. First of all, these are the only results in which my name actually appears in the title; the fourth link and beyond have my name in the document, but not the title. Second, my website appears at the top not because it's the definitive website about "Seth Maislin," but because Google knows of 24 people linking to it. In comparison, the only person who ever created a link to this blog is me -- a number far less than 24! The same goes for the O'Reilly interview, except that the single linker isn't even a valid site any more: it's broken. The popularity of my home page (in comparison to this blog, for example) is why it's a better hit for my name. But if you folks out there started to actually link to this blog, that would change.

You should look into the search results for the word "Jew." A website known as JewWatch.com, an offensive and inflammatory collection of antisemitic content, had appeared as the number-one result at Google.com for this one-word query. This happened because a large number of supporters of this site tended to build links to it; then, those were were outraged or amused also linked to it within their protestations. In the end, the social algorithms at Google recognized how popular (i.e., "linked to") this site was, and in response rated it very highly -- in fact, rated it first -- compared to all other websites with the word "Jew" in the title. Eventually, those who were enraged by this content fought back by asking as many people as possible to link somewhere else -- specifically, the Wikipedia definition of Jew -- just as I have here. Over time, more people linked to Wikipedia than to JewWatch, and so the latter dropped into second place at Google. This process of building networks of links in order to influence Google's social algorithm is called "Google bombing." In other words, when the people who hated the site acted together in a large group, Google's social algorithms responded.

(By the way, you'll notice that I do not create a link to the offensive site. I see no reason to contribute to its success.)

Do you see the problem? The success of Google bombing is analogous to the squeaky wheel metaphor, that the loudest complainer gets the best service. Social algorithms reward the most popular, regardless of whether they deserve it. JewWatch made it to the top because it was popular first; Wikipedia's definition moved to the top because those offended banded together to demonstrate even more loudly. And in the end, there's no reason for me to think either of these links is best.

Whether popularity is a good thing or a bad thing is often subjective. In language, some people lament the existence of the word ain't, while others applaud its existence as an inevitable sign of change; either way, the word is showing up in our dictionaries because more and more people are using it. But I'm not talking about language; I'm talking about truth.

Do you think vitamin C is good at preventing colds? Well, it isn't; there have been no studies demonstrate its effectiveness, but there have been studies that show it makes no real difference. (It's believed that vitamin C will shorten the length of a cold, but studies are still inconclusive.) But after a doctor popularized the idea of vitamin megadosing, our entire culture suddenly believes taking the vitamin will keep you extra healthy. Untrue.

Do you know why "ham and eggs" is considered a typical American breakfast? Because an advertising executive in the pork industry used Freudian psychology to convince people to eat ham for breakfast. He did it by asking American doctors if they thought hearty breakfasts were a good thing (which they did); the ad-man then asked if ham were a hearty food. Voila: ham, sausage, and bacon are American breakfast staples, and the continental breakfast vanished from our culture.

In both of these examples, majority belief trumps the truth. And look at the arguments about global warming! I won't repeat the arguments laid out by Al Gore in An Inconvenient Truth, but his argument is that as long as enough people insist that global warming isn't true, its dangers will remain unheeded. In fact, I'm not even going to argue here whether global warming is a real thing or not; it doesn't matter what I believe. What matters is that the debate over global warming isn't a fight over the facts. Instead, it's a shouting match, in which the majority wins. Right now, so many influential people have argued that it doesn't exist (or isn't such a big deal) that very little has been done in this country in response to its possible existence. But as more and more people start to believe it's at least possible, it's becoming a reality. Doesn't that just drive you nuts? Why are the facts behind global warming driven by democracy? Can't something be true even if no one believes in it?

One last look at this "majority rules" concept, only this time let's avoid politics and focus on simple word spelling. If you search for the word millennium, correctly spelled with two Ls and two Ns, you'll get about 54 million hits at Google (English-language pages only). If you search for the word millenium, misspelled with two Ls and only one N, you'll get 18 million hits. Twenty-five percent of all websites have this misspelling in them! For content that's published, that by its very nature is biased toward having only correct spellings, this error rate is monstrous! But does Google let you know that millenium is misspelled? Does it ask you if you "meant to type millennium?" No! After all, Google considers the misspelled word correct.

I mean, eighteen million people can't be wrong, right?

Labels: Google, misspellings and other errors, search engines, social algorithms, spamming and similar behaviors, web indexing

# posted by taxonomist @ 10:49 AM 0 comments

24 December 2006

A needle in a haystack with 100,000,000 blades

The Internet has more than 100 million websites, according to the November Netcraft survey. If you were standing on top of the growth curve, by now your stomach would have nothing left to vomit up.

I did some math, and I've figured out a way to make sure that all of these websites are indexed. Here's what I discovered.

Between October and November 2006, approximately 3.5 million sites were created. Assuming that my team would be responsible for inventing a set of keywords for the whole site -- and not for individual pages or parts of pages -- we would have to build 3.5 million keyword sets.
Let's further assume that on average, every website would have four keywords or key phrases. For example, this blog would get the keywords "Seth Maislin," "indexing," "blog," and perhaps my company name, "Focus Information Services." Ideally we'd have the time to invent many more, since it's our goal to help the website perform well at the various search engines, but this team simply can't give everyone special attention. So I'm making the executive decision to limit ourselves to creating 4 terms each for 3.5 million sites.
Assuming that we can invent and type one keyword every two seconds -- a conservative estimate, given that my company name takes me a minimum of two seconds to type -- we'll need 28 million seconds to get the job done.
Now remember, we're just taking about the new sites created in October 2006. Consequently, we have only a month to get the job done before we have to start indexing the November 2006 sites. For this reason, I'm going to build a team of several people, with each one putting in eight hours per day, twenty days each month. That's 576,000 seconds per person per month.
Dividing 28,000,000 seconds per month by 576,000 seconds per person per month gives me 48.1 people, which I'll round to a nice 50 people. That means I need a team of just 50 people to get the job done.

So there you go: a team of 50 people can index the Internet. That doesn't sound nearly as bad as I thought. Of course, everyone will have to type rather quickly, and we'll need a system in place to prevent us from accidentally indexing any one website more than once, but that shouldn't be too bad. And yes, I'm assuming that all of these websites are in English, but most of them are; I'll bring a few translators to work on the few remaining.

At U.S.$50,000 per year per indexer, which is quite modest for a highly intense round-the-clock job like this, plus $100,000 for me as manager, I could probably put together a bid of about $350,000/year to get the job done. Given how many billions of dollars are spent or exchanged over the Internet today, that seems quite reasonable, too. Heck, I should triple the whole thing, since we'd have to re-index the old sites every once in a while. Maybe I should double it again, too, so we'd be allowed to use eight keywords instead of four.

So let's see, that brings the total bill to to $2.1 million. Gosh, that isn't bad at all, is it? I mean, we all agree that indexing the Internet is at least a two-million-dollar-per-year business, right?

Except it's not. Indexing the Internet is a zero-dollar-per-year business. No one is doing it. Just about no one seems to care about quality keywords. In fact, there are only two industries that exist around keyword creation. One of them is misnamed "search optimization," which is about spamming the heck out of the Web. Optimize, I think not: this is the opposite of the intelligent product my team would be build. The other business is the search business itself, companies springing up around those fancy algorithms that Google, Yahoo, Lycos, Ask Jeeves, and the rest use. The thing is, those algorithms are just word-matching machines. These engines are looking for keywords, but none of them is actually writing any. So you see, no one with indexing training is writing any keywords. The inexpensive market for human indexers is being completely overlooked.

Guess it's not worth the two million.

Labels: indexing process, keywording, search engines, spamming and similar behaviors, web indexing

# posted by taxonomist @ 1:58 PM 0 comments

24 March 2006

The granularity of an online "page number"

When writing a hyperlinked index (where hyperlinks are used instead of page numbers), to what should those links point?

Some people think they should point to the section title in which the information is provided; other people like to point right to the specific word used in the index. The "answer," obviously, is that hyperlinked index entries should take the readers to where the information is, right? The problem -- the reason there's this question of "where do I point my entries" in the first place is that readers of hypertext might find themselves bounced somewhere they don't understand. How many times have you followed a link, only to find yourself fiddling around with the scroll bar to figure out where you ended up? Following a hyperlink is like being blindfolded and transported to an unknown destination.

What you may not know is that book indexes aren't much different. :-)

Think about how book indexes actually work, and you realize that direct readers to the page on which the information starts. An entry like "buoyancy, 164" tells the reader to look somewhere on page 164; an entry like "global harmony, 164-167" tells the reader to start looking somewhere on page 164. The granularity of an index is defined as the smallest unit of area that can pointed to. For printed indexes, this area is the page number. Rarely will you find locators that use fractional or qualified page numbers like 164-1/2 or 164top. (There are such things as qualified locators, like 164f, which might point to the footnote on page 164, but even in the books in which they're used they comprise only a small number of all locators used.)

If you follow the standards of the industry, then, the granularity of a printed index is one physical page. For this reason, books that have lots of words on a page -- big pages, narrow margins, tiny print -- are less friendly to book indexers. It's like telling someone that there's a needle in that 164th haystack over there. Maybe we should count our blessings that someone bothered to number the haystacks, but ideally this is where the book designer starts earning her salary. Book pages don't have to look like haystacks -- more accurately, wordstacks -- if the book has legible headings and subheadings. Books can be written with quickly visible landmarks within the pages, like italics and boldface, larger and smaller font sizes, headings and callouts, footnotes, and so on. Going back to the blindfolded analogy, there's no reason we have to drop our readers into deserts of information, when we can drop them in a place surrounded by location clues and navigational signs, like at a train station.

On the Web, however, there is no such thing as a printed page. Web pages can be any length, from tiny pop-up windows with only a sentence fragment of information within, to long scrolls of endless paragraphs and images. Additionally, you don't have to direct the reader to just the page any more, but rather you can deposit him anywhere within the page. The granularity of a Web page is a word! You can send someone into the middle of a paragraph.

When you have tiny little windows of information, using that window as a destination is a no-brainer: the reader arrives at a single sentence of information, which is what he needs. It doesn't matter if you point him to the beginning, middle, or end of that sentence, because it's all they get to read. Pointing someone to an isolated window of information -- what Web authors call "chunks" -- is as easy as looking into a food pantry that contains only a single can. But when you have longer pages, and you have the ability to point someone to any spot within those longer pages, you have a decision to make. And it's a decision that didn't exist in the printed world, with its larger granularity.

The solution is to connect the text of the index entry with the text of the documentation. Not the meaning, but the actual words. If the index entries are written to almost identically match those of the documentation, then the reader won't mind as much because it won't look like a desert. They'll have exactly the landmark they need right in front of them. The entry "cancer, prevention of," for example, could point directly to this line without a problem:

... cessation of smoking. In fact, many physicians are well aware that one way to prevent cancer is to quit ...

That's because the words of your index entry, which are cancer and prevention, appear almost verbatim in that line of text. And if this information were part of a section titled "Using Peer Pressure to Help Patients Quit Smoking," then you really wouldn't want to point to the heading for context. That's because it's unclear to the reader that you're actually directing him to information about cancer or prevention. You're making them work at it.

And then there's the other situation. Using the same sentence and heading as above, where should the indexer point readers who look up the entry "smoking, how to quit"? Clearly they should go right to the heading. If they went to the line that talked about physicians, they wouldn't know where they are.

Our original question here was this: When writing a hyperlinked index (where hyperlinks are used instead of page numbers), to what should those links point? Clearly the only way to answer this question comprehensively is to suggest that the language of hyperlink indexes has two contexts: the index entry itself and the destination location. These two contexts need to work together. And as we saw, the same is true with the printed book: having arrived at page 164, how quickly can you find the idea you were looking for?

Looking this closely at hyperlinked indexes only emphasizes something we need for all indexing: use index entries that match the documentation text. If you have to write a slightly longer entry, that's okay. Instead of "cigarettes," use "cigarette smoking, quitting." Instead of "social networks," use "social networks and peer pressure." The people who work with search engines and Internet marketing are familiar with the term trigger words, which refers to visible language that matches the mental language of the searcher. If you're thinking of the words "white elephant," then a result of "pale pachyderm" doesn't work because it doesn't trigger your sense of recognition.

So the next time there's a white elephant in haystack 164, be sure to tell someone as explicitly as possible.

Labels: books, indexing process, keywording, pages and page ranges, web indexing

# posted by taxonomist @ 9:23 AM 0 comments

Seth Maislin's Indexing Blog

20 January 2007

Foreword to Heather Hedden's upcoming book

28 December 2006

Eighteen million people can't be wrong

24 December 2006

A needle in a haystack with 100,000,000 blades

24 March 2006

The granularity of an online "page number"

About Me

Relevant Links

Some Blogs Seth Might Visit

archives