25 March 2007


Indexers indexing infinitely ... like monkeys

Three ideas have merged.

First, there's the idea I published last December as "A needle in a haystack with 100,000,000 blades," where I argued how the Web, or an approximation thereof, could be indexed by humans for a reasonable amount of money.

Second, there's The New York Times article "Artificial Intelligence, With Help From the Humans," in which we learn that the Amazon Mechanical Turk service subcontracts human workers to perform tasks that are especially challenging for computers to accomplish, such as matching images to textual descriptions. For some jobs, Turkworkers might make one penny per transaction.

And finally, there's the infinite money theorem, which states that a monkey hitting keys at a typewriter for an infinite amount of time will "almost surely" type the complete works or Shakespeare, or something similar. I first heard this ideas as a "million monkeys and million years," but I bet the math's a bit different. After all, "infinite" is much bigger than a million million.

Putting these ideas together seems to provide a rather obvious solution: third-world indexers. After all, if it costs only a nickel to get someone to write a few keywords for something, we can get a lot of indexing done very cheaply; I say "third world" because no indexer I've ever known is willing to work for a penny per word.

The indexing industry is facing the very real possibility that our workload will be taken from us and delivered to those in economies that allow lower prices. But what if we went a step further and, instead of looking for less expensive indexers with good qualifications, we decided to look for dirt cheap indexers with no qualification other than time to waste? What if, I ask, we asked monkeys to pound away at their keyboards?

I find the idea amusing but too close to the truth. After all, the intelligence behind Google is the social intelligence, the uneven and culturally biased workings of millions of Internet users plugging away at their disparate tasks. What Mechanical Turk has going for it, then, is the human decision making at the back end. Whereas most search engines look for better and greater stores of metadata with which to judge content, one man in a back room can make smarter decisions upon command. No, the real problem is that today's human intelligence is worth only pennies per word. Computers do their best, and humans sweep up afterwards. Our natural intelligence isn't worth a whole lot, I guess.

That's how we know computers are smart. Computers own us monkeys.

24 March 2007


The passive-aggressive bullies of the information world

An indexer, while building an index of historical documents for a small township on Cape Cod, Massachusetts, came across an old diary written during the American Civil War. She scanned the pages, filled with small and semi-illegibly handwritten words, and realized that nothing important had been written.

The diary went unindexed.

This anecdote, shared by indexer Marilyn Rowland at the March 24 (2007) meeting of the New England Chapter of the American Society of Indexers, struck me as surprisingly uncomfortable. Certainly I agree that when something seems unimportant to the indexer, it should not be indexed; in fact, I've claimed many times within this blog that one of the biggest failings of computer-generated lists and search engine algorithms is that they cannot identify the true value (or correctness) of content, even when using social algorithms.

Still, not indexing someone's diary? This sounds passive-aggressive. So does this instruction: "Don't index the names of everyone in that photograph. Mention these two important people, and don't bother with the rest."

Just as scientists are often accused of sacrificing ethics and social responsibility in favor of "pure scientific exploration" (the temptation to clone human beings is a fun example), so might indexers be accused of excessive marginalization or trivialization of content. It may be human nature to filter out everything we don't need to survive or enjoy ourselves in our lives, but it is an indexer's nature to impose these filters upon future users. In other words, indexers are responsible -- on a daily basis -- for rewriting history.

Everything we create in our lives -- email messages to diaries, family snapshots to oil paintings, back-of-the-napkin notations to dissertations -- is subjected not just to the entropy of time but also the red pen of the indexers. We may speak about the value of individuals, but in reality it's just a big game of Survivor, where the indexers are the ones to vote our creativity out of existence.

There is no good way to remove indexers from the equation, of course. If nothing were indexed, and no content were ever deemed to be more valuable (worth finding) than something else, content would be lost in the same way a paper cup with a lipstick stain inevitably disappears into a landfill. But who would have believed that indexers are the ones in control, that indexers are the Langoliers, who like the big kids in school get to decide who gets picked first for the schoolyard team, and who doesn't get picked at all. We are, let's face it, the bullies of the information world.

Don't mess with me. I'll erase you.

17 March 2007


Notes on automatic indexing

"Automated indexing software" is, according to the common definition, software that analyzes text and produces an index without human involvement. I'm a firm believer that the technology doesn't exist, and that a human being is required to write an index. Thus I don't use the software, and I also don't recommend it.

There are those who advocate it, arguing that it's "not as bad as an indexer would have you think." These people are often coming from the standpoint that automatic software is faster and cheaper, and they're right. Thus the issue surrounds quality.

I believe that good automatic indexes will exist once there's good artificial intelligence, something that presently doesn't exist. In very limited circumstances, however, it does; a machine can easily cull capitalized words from a textbook to create an approximation of an index of names -- although, again, the machine isn't going to differentiate between names like "David Kelley" and places like "San Francisco," since they are both of the same format and used the same way. It also won't know that "Bill Clinton" is also "William Jefferson Clinton." And certainly it can't tell when the name is being mentioned in an unuseful and trivial way, as are the names in this paragraph! So imagine the problems trying to get a machine to parse full sentences of ideas and recognizing the core ideas, the important terms, and the relationships between related concepts throughout the entire text.

FYI, those who advocate automatic software, however, would argue that the machine gets "close enough" so that a human being can edit the resulting product. However, expert evaluators unanimously agree that the software fails; those who disagree are likely those who are sufficiently ignorant of indexing in the first place such that they are unable to determine the quality differences.

Oh, I should mention that there are software programs that human indexers use to simplify and speed up the mechanics of the index process. For example, it would be silly to disallow a computer to alphabetize the entries, reformat the index, and manipulate page numbers. There are a few software packages that do this exclusively, which are considered top of the line; other applications that have indexing capabilities, such as Microsoft Word and Adobe FrameMaker, have some of these capabilities, with notable limitations.

For information on the various software available, see http://www.asindexing.org/site/software.shtml. If you have feedback, especially differing opinions, I'd love to hear them. Write me at seth@maislin.com.

(This article was originally published in 2002 and 2004 -- and it's still 100% accurate.)

05 March 2007


Interpretation, not computation

After explaining the limitations of Microsoft Word's auto-indexing feature to one of the many people who write me asking for indexing advice, I got an interesting response. Clearly frustrated by the nonexistence of computer tools to do something as simple as generate a name index, he wrote:

> I'm amazed at the poor development of the science of indexing for printed matter such as books.

I wrote back, "You misunderstand!"

The science of indexing is quite broad, given that it has a history in long-ago library science. What seems undeveloped in this case are the tools, but that's a misunderstanding of what indexing is. Indexing is an editorial field, not an automatic one. You might say it's a lot like writing, in that the writer must decide what their readers want to read, and then the writer must communicate those ideas in an organized and approachable way. Indexing is the same: analysis of text to discover what readers might find interesting, and then multiply labeling and organizing those ideas so people can find them.

Computers will never be able to write indexes because they can't (a) interpret importance of a concept, (b) understand concepts over simple words, and (c) connect ideas in contextually relevant ways. As much as I admire the Google.com search engine for what it can do, once again I will demonstrate what it can't do. Google finds 10,000,000 things when we really only want 3 (or 10 or 20). It finds what we type, but it doesn't find synonyms. And there's no guarantee that Google is searching everything that's out there, though it appears to come close; in book indexing, however, there's a human to make sure every page was considered.

How often has Microsoft Word attempted to auto-correct you in a completely inaccurate way? Spell-check? Auto-format? Auto-complete? Half-intelligent humans don't make the kinds of mistakes that these tools do.

Here's what I wish he had written:

> I'm amazed that people who know full well that computers could never write newspaper articles still believe computers can write indexes.

Another problem, of course, is that indexes aren't respected in the industry. The reason Microsoft Word even has an automatic indexing feature is because the people who wrote that software have no idea of the damage such a tool provides. That Word's {XE} functionality is so miserable is even further proof. There's a nasty cycle: people use inferior tools, quality indexing grows less likely, and inferior tools become the standard.

Indexing is an editorial process, just like writing and editing. Indexing requires interpretation, not computation.

Computers will not and should not be used as indexers. If my job ever dies because computer programmers have found a way to make me obsolete, at least I know I'll be in the enlightening company of human writers and artists.

