28 December 2006
Eighteen million people can't be wrong
No matter how much you and I might like Google, the fact is that Google has some very serious problems with it comes to finding content. More specifically, if you're looking for the "right" answer, or if you're attempting to do any serious research, Google is likely to fail you miserably.
The flaw lies in Google's strength: social algorithms. Social algorithms are processes in which decisions are made by watching and following the majority of people in a community. If blogger.com tends to be the place people go to create blogs, then a social algorithm will see blogger.com as "better." When a search engine is managed by a social algorithm, a website might appear first in search results not because of the quality of site, but rather because a larger number of people treated the site as if were of higher quality. In other words, social algorithms equate "majority" with "best," something that often looks right but actually is patently untrue.
When you perform a search at Google.com, your results are sorted based on majority behavior and little else. For simple questions about anything -- as well as complex questions about cultural issues, for which "lots of people" is critical -- frequently the majority opinion is rather close to what you want -- which is why Google is so successful. But the gap between "close to what you want" and "accurate" is an invisible one, and that makes it insidious and dangerous.
For example, search for "Seth Maislin." The first hit is my website. The second hit is this blog. The third hit is an interview I did for O'Reilly & Associates in July 1999. An investigation of why these are the top three sites is rather interesting. First of all, these are the only results in which my name actually appears in the title; the fourth link and beyond have my name in the document, but not the title. Second, my website appears at the top not because it's the definitive website about "Seth Maislin," but because Google knows of 24 people linking to it. In comparison, the only person who ever created a link to this blog is me -- a number far less than 24! The same goes for the O'Reilly interview, except that the single linker isn't even a valid site any more: it's broken. The popularity of my home page (in comparison to this blog, for example) is why it's a better hit for my name. But if you folks out there started to actually link to this blog, that would change.
You should look into the search results for the word "Jew." A website known as JewWatch.com, an offensive and inflammatory collection of antisemitic content, had appeared as the number-one result at Google.com for this one-word query. This happened because a large number of supporters of this site tended to build links to it; then, those were were outraged or amused also linked to it within their protestations. In the end, the social algorithms at Google recognized how popular (i.e., "linked to") this site was, and in response rated it very highly -- in fact, rated it first -- compared to all other websites with the word "Jew" in the title. Eventually, those who were enraged by this content fought back by asking as many people as possible to link somewhere else -- specifically, the Wikipedia definition of Jew -- just as I have here. Over time, more people linked to Wikipedia than to JewWatch, and so the latter dropped into second place at Google. This process of building networks of links in order to influence Google's social algorithm is called "Google bombing." In other words, when the people who hated the site acted together in a large group, Google's social algorithms responded.
(By the way, you'll notice that I do not create a link to the offensive site. I see no reason to contribute to its success.)
Do you see the problem? The success of Google bombing is analogous to the squeaky wheel metaphor, that the loudest complainer gets the best service. Social algorithms reward the most popular, regardless of whether they deserve it. JewWatch made it to the top because it was popular first; Wikipedia's definition moved to the top because those offended banded together to demonstrate even more loudly. And in the end, there's no reason for me to think either of these links is best.
Whether popularity is a good thing or a bad thing is often subjective. In language, some people lament the existence of the word ain't, while others applaud its existence as an inevitable sign of change; either way, the word is showing up in our dictionaries because more and more people are using it. But I'm not talking about language; I'm talking about truth.
Do you think vitamin C is good at preventing colds? Well, it isn't; there have been no studies demonstrate its effectiveness, but there have been studies that show it makes no real difference. (It's believed that vitamin C will shorten the length of a cold, but studies are still inconclusive.) But after a doctor popularized the idea of vitamin megadosing, our entire culture suddenly believes taking the vitamin will keep you extra healthy. Untrue.
Do you know why "ham and eggs" is considered a typical American breakfast? Because an advertising executive in the pork industry used Freudian psychology to convince people to eat ham for breakfast. He did it by asking American doctors if they thought hearty breakfasts were a good thing (which they did); the ad-man then asked if ham were a hearty food. Voila: ham, sausage, and bacon are American breakfast staples, and the continental breakfast vanished from our culture.
In both of these examples, majority belief trumps the truth. And look at the arguments about global warming! I won't repeat the arguments laid out by Al Gore in An Inconvenient Truth, but his argument is that as long as enough people insist that global warming isn't true, its dangers will remain unheeded. In fact, I'm not even going to argue here whether global warming is a real thing or not; it doesn't matter what I believe. What matters is that the debate over global warming isn't a fight over the facts. Instead, it's a shouting match, in which the majority wins. Right now, so many influential people have argued that it doesn't exist (or isn't such a big deal) that very little has been done in this country in response to its possible existence. But as more and more people start to believe it's at least possible, it's becoming a reality. Doesn't that just drive you nuts? Why are the facts behind global warming driven by democracy? Can't something be true even if no one believes in it?
One last look at this "majority rules" concept, only this time let's avoid politics and focus on simple word spelling. If you search for the word millennium, correctly spelled with two Ls and two Ns, you'll get about 54 million hits at Google (English-language pages only). If you search for the word millenium, misspelled with two Ls and only one N, you'll get 18 million hits. Twenty-five percent of all websites have this misspelling in them! For content that's published, that by its very nature is biased toward having only correct spellings, this error rate is monstrous! But does Google let you know that millenium is misspelled? Does it ask you if you "meant to type millennium?" No! After all, Google considers the misspelled word correct.
I mean, eighteen million people can't be wrong, right?
Labels: Google, misspellings and other errors, search engines, social algorithms, spamming and similar behaviors, web indexing
24 December 2006
I met a famous indexer the other day
Well, I met the indexer who actually wrote those keywords -- someone I've known for a long time -- and I have to say, there's something really cool about realizing that one of your good colleagues was behind that story. I also find it reassuring that the indexer is someone who really knows what she's doing, because it emphasizes just how far apart good indexing is from good search: smart people, dumb tools.
For more on this subject, I recommend reading The Inmates Are Running the Asylum. The book is about computer programming in general, but the sentiment is dead on.
Labels: keywording, search engines
A needle in a haystack with 100,000,000 blades
I did some math, and I've figured out a way to make sure that all of these websites are indexed. Here's what I discovered.
- Between October and November 2006, approximately 3.5 million sites were created. Assuming that my team would be responsible for inventing a set of keywords for the whole site -- and not for individual pages or parts of pages -- we would have to build 3.5 million keyword sets.
- Let's further assume that on average, every website would have four keywords or key phrases. For example, this blog would get the keywords "Seth Maislin," "indexing," "blog," and perhaps my company name, "Focus Information Services." Ideally we'd have the time to invent many more, since it's our goal to help the website perform well at the various search engines, but this team simply can't give everyone special attention. So I'm making the executive decision to limit ourselves to creating 4 terms each for 3.5 million sites.
- Assuming that we can invent and type one keyword every two seconds -- a conservative estimate, given that my company name takes me a minimum of two seconds to type -- we'll need 28 million seconds to get the job done.
- Now remember, we're just taking about the new sites created in October 2006. Consequently, we have only a month to get the job done before we have to start indexing the November 2006 sites. For this reason, I'm going to build a team of several people, with each one putting in eight hours per day, twenty days each month. That's 576,000 seconds per person per month.
- Dividing 28,000,000 seconds per month by 576,000 seconds per person per month gives me 48.1 people, which I'll round to a nice 50 people. That means I need a team of just 50 people to get the job done.
At U.S.$50,000 per year per indexer, which is quite modest for a highly intense round-the-clock job like this, plus $100,000 for me as manager, I could probably put together a bid of about $350,000/year to get the job done. Given how many billions of dollars are spent or exchanged over the Internet today, that seems quite reasonable, too. Heck, I should triple the whole thing, since we'd have to re-index the old sites every once in a while. Maybe I should double it again, too, so we'd be allowed to use eight keywords instead of four.
So let's see, that brings the total bill to to $2.1 million. Gosh, that isn't bad at all, is it? I mean, we all agree that indexing the Internet is at least a two-million-dollar-per-year business, right?
Except it's not. Indexing the Internet is a zero-dollar-per-year business. No one is doing it. Just about no one seems to care about quality keywords. In fact, there are only two industries that exist around keyword creation. One of them is misnamed "search optimization," which is about spamming the heck out of the Web. Optimize, I think not: this is the opposite of the intelligent product my team would be build. The other business is the search business itself, companies springing up around those fancy algorithms that Google, Yahoo, Lycos, Ask Jeeves, and the rest use. The thing is, those algorithms are just word-matching machines. These engines are looking for keywords, but none of them is actually writing any. So you see, no one with indexing training is writing any keywords. The inexpensive market for human indexers is being completely overlooked.
Guess it's not worth the two million.
Labels: indexing process, keywording, search engines, spamming and similar behaviors, web indexing
18 December 2006
"Can I Delete All My ___ Entries in MS Word?"
Every now and then, there's nothing you want to do more than globally delete a bunch of entries. The problem is how this is supposed to happen. For example, suppose you have a common main entry for "publicity," when you decide that you're better off with a cross reference like "publicity. See marketing." In addition to creating this cross reference, you need to remove all of your original publicity entries. Although you can search for marker text, you can't search for whole markers. In other words, you can search for the word "publicity" when it's used within index markers (look for hidden text), but you can't search for a whole marker like {XE "publicity"} or {XE "publicity:methods for"}. For this reason you can search globally and delete.
The easiest approach to deleting all publicity entries is the manual approach: generate your index, then delete everything that starts with the word publicity. Unfortunately, manual edits will be undone as soon as you generate the index again; you'll have to remember that you want to make these manual changes every time you create a new version of the index. To help you remember to make these manual changes, I recommend changing the format and/or language for the word publicity to make sure it jumps out at you. Search for XE "publicity, the unique text for all publicity entries, and replace it with boldface, all caps, and a shocking color like red. I also recommend that you change the word publicity with something that will sort at the very beginning of your index, such as aaa DELETE ME. Now, when you generate your index, you'll see some red, boldface, all-caps reminder at the top of your index file. Hopefully this will be enough for you to remember deleting your entries.
Another approach, and by far the one I prefer, is to replace the marker syntax with something that Word can't interpret. Instead of using the letters XE in your marker, use something like DELETE_ME. In other words, globally change XE "publicity with DELETE_ME "publicity. Since markers are hidden text, your DELETE_ME markers will remain hidden from publications; further, they'll fail to become index entries since Word won't interpret them as XE markers. The biggest advantage to this method is that it works globally, and you only have to make these changes once. Another advantage is that you aren't actually deleting the entry, just rewriting it; if for any reason you need to reconstruct entries, you can always change DELETE_ME to XE. (This is a kludgy way of creating conditional text, but it might be just what you need.) The disadvantage is that you're not actually deleting anything, potentially cluttering your documentation.
As a side note, whenever you remove an entry from your index, remember that you have to delete any cross references that target those now-removed entries. For example, if you replace your publicity entries with "publicity. See marketing," you'll need to rewrite or delete entries like "public relations. See also publicity."
Labels: indexing tools
16 December 2006
ASI President's Letter (December 2006)
ASI: Prospective and Retrospective, a Presidential Perspective
At the end this year, I’ll write a letter to “Seth of 2008.”
For a number of years I’ve been sending snapshots of my life to future “selves,” capturing a year’s events, achievements, and desires onto a couple pages. Even though I’m writing to myself, however, I’m trying to communicate with versions of me that don’t yet exist. Who will I be in 2008? Why will I want to know about today’s “me”? What about the Seth of 2014? So, after warming up my pen with details about family, house, job, art, and health, I inevitably get to the tough stuff: ambitions, anxieties, hopes, and disappointments. There’s an irony to the whole thing, knowing I’ll be reading the letter with perfect hindsight. It’s an incentive to improve every year.
ASI’s strategic plan is just such a letter. With its many strategies and priorities, we’re informing our future society about some critical information. Our members have shared with us a vision in which indexing will be recognized and respected more; to reach this vision we’ll have to look critically at who we are, now and soon. With the hindsight we’ll have in 2008 (and 2010 and 2014), I don’t want us to feel nostalgic when we look back. I want us to feel successful. I want us to be glad that we live in better times.
The conflict between the needs of the immediate and our goals for the future is real. To function as a society, we need people in charge of what’s happening right now, as well as people in charge of what’s happening in the future. Week to week, ASI manages a long stream of important details: chapter name changes, SIG formations, PR construction, training course materials, administrative shifts, the Philadelphia conference, membership drives, and so on. The board gets a few dozen reports from committees, fourteen chapters, SIGs, and task forces. This is the “ASI of 2006,” focused on bylaws and meetings and content development.
Labels: American Society of Indexing, future of indexing
Seth, a *different* enabler
"Just tell me the answer" (blog entry, 7 Nov 2006)
There are more enabling Seths out there than you might have noticed at first.
Labels: information architecture process
12 December 2006
Seth, the enabler
Here's an analogy. Suppose a friend who needs a resume approaches you for help. "Would you write me resume?" she asks. On approach is to say "yes," ask a couple of questions, and then crank out a complete resume. Handing it to her you say, "Go ahead and make some edits, if you want." There are some wonderful advantages to this process: you get to work on your own, on your own terms, and for a very short period of time. On the other hand, what you're really supposed to do is sit down with your friend and say, "Well, what have you got so far?" Then you ask all sorts of clever questions like, "What do you think you do best?" and "What kind of job do you think you want?" She answers these questions, and as you nod wisely, you tell her to write all that stuff down.
The greatest part about being an enabler is that you never have to make a decision at all. You're a Freudian psychologist asking all sorts of provocative questions, getting paid by the hour to watch someone else do all the work. The better they do, the better you look.
I've discovered that being an enabler is the smartest, most lucrative, and most effective way to be a consultant -- but the fact that I never have to make a decision is very interesting. "What do you think? How would you do this? Do you think this would work? Before tomorrow, see if Joe agrees." I'm amazed at the power these kinds of questions have.
Ask yourself how much enabling you do in your job. I'm starting to realize that helping people do things on their own is more rewarding than doing it myself. Frankly I'm unnerved by this. This wasn't at all what I learned in engineering school.
But everything I've read says that this is now the right way to do this. Decisions made by people who don't actually use the system are less likely to succeed. Evidence-based practice is about moving forward not on what you think, but on what you know, such as from testing. So yes, it's about asking the right questions, and not about what you know. In fact, psychologists who do know the answer have to play dumb if they're to succeed.
If you had told me years ago that the subject matter experts are far less valuable than the subject matter dunces, I'd have said you were full of... what's the word? (I trust your opinion.)
Labels: information architecture process
Indexing moving content
Fact is, the world has a way of throwing curve balls on a regular basis. For me, those curve balls include a family-wide influenza epidemic, teething babies, travel plans, and the like. Trying to keep a grip on life is like trying to catch fish with your hands.
Tonight I give a presentation about trying to index moving targets. I was surprised to discover that of all the presentations I've ever given, this was absolutely the hardest to write. In fact, I just finished a few minutes ago. I've taught three-day classes, with eight hours of material on each day, but this 45-minute presentation really stymied me. There are two reasons for this.
First, trying to index moving content is, no matter what, a mess. The simplest example of a problem is creating an index entry like "software development, 111-121," and then finding out that pages 111 and 121 have moved respectively to pages 113 and 123. With standalone indexing (where you type in the page numbers), the only real way to fix this is manually: go back and rewrite all your page numbers. It's a MESS. So here I am, hoping to provide some tips to indexers and technical writers, something to help them avoid these kinds of corrections -- only to realize that there's no good answer. (A bad answer is to not index at all. :-)
The second problem is that even if I did have a list of useful tools, they don't make for interesting presentation materials. The first draft of my presentation would have resembled a public reading of the weather report for ever American city, in alphabetical order: if you're lucky, you're interested in Albuquerque and Atlanta and can walk out early.
The fact is, our growing reliable on live and custom information is wreaking havoc on the indexing world. It's becoming harder and harder to collate information in relevant chunks. Search will never do it; even if there were human beings out there developing controlled vocabularies, full-text search still retrieves a tremendous amount of flotsam. But creating keywords for something that won't live an hour seems kind of pointless, too. We're all just pounding sand.
I'm looking forward to what the participants have to say. Must we accept the false imprisonment of uncatalogued real-time information flow, or will writers finally catch on that indexers have an important role on the creation side as well?
Labels: cataloguing, embedded indexing, indexing process, keywording