07 February 2007
Indexes are the speed limit on the information highway
The growing demand for indexes that point to online, changing, and custom content is forcing a huge gap into the indexing industry, and that gap is physical.
With traditional publishing, if I wanted a reader to find content on page 114, I would simply have the number 114 show up in my index: "credit card fraud, 114." To make this work, however, that page number must be immutable across the lifetime of my index. Should the content be republished in a different format, layout, or language, or with significant edits within the first 100-plus pages, my index could be rendered inaccurate. In other words, if I type "114" in my index, that content had better be on page 114.
The appeal of fluid content, however, is slowly making traditional information delivery obsolete. Not only are books republished for lots of "traditional reasons" (e.g., updated editions, new languages, different book and print sizes), but technology is enabling books to be published without a single physical page. With the possible exception of the Adobe PDF format (which purposefully preserves the overall book-like format in an electronic file), page breaks are optional and subjective. A Web page or HTML document can have a scroll bar, such that there are no pages; an e-book intended for a handheld reader is paged according to the size of the reader; a news or magazine article of any length can be broken in two or three simply to increase ad sales; and some electronic documents can be edited by the readers such that anything goes.
Ah, how I miss the days when 114 meant 114.
Indexing content that changes is going to be hard, but the fundamental challenge isn't about keeping up with what was newly published today, or even in the last twenty minutes. It's about content ownership. When content is moving around all the time, indexers don't have a good way to tracking where that content is going.
As an analogy, consider a classroom filled with thirty students, with one student at each desk. If you have a photograph of where every student is sitting, you could leave the room and generate a spreadsheet that lists each person's name and seat location. But what happens when the students are playing musical chairs? Every photograph you take is outdated almost immediately; even staying in the room wouldn't be good enough, because your typing speed will never match the speed of twenty kids jumping around. In fact, the only way you could manage a spreadsheet that shows where each student is sitting at all moments is if that spreadsheet operated in real time, by reference. In other words, if all thirty students carried GPS locator chips in their pockets, you could track the chips -- and thus the students -- by satellite. Your map could be as dynamic as what it is you're mapping.
Embedded indexing, or indexing by reference, is a rudimentary and imperfect example of this process. With embedded indexing, I can have some kind of information inserted into the content -- like the GPS chip in the student's pocket -- and then I can generate an index based on where that information is at any one time. This blog entry, for example, has keywords attached to it; the website where my blog is published can, at any time, generate a list of all entries with that keyword. This kind of dynamic indexing is not uncommon these days; website content is served according to a number of immediate rules, and the result can be as simple as a website that publishes "Hello Seth Maislin" on my page but no one else's, or as complicated as an online stock trading program that keeps track of millions of private transactions.
I say this is rudimentary, however, because it's still a snapshot. Perhaps it's convenient to have that snapshot taking at the moment I arrive at a website, but if I leave my browser at a website and walk away for 20 minutes, the picture doesn't have to change. The "Hello Seth Maislin" greeting made sense when I was sitting at the computer, but if I walk away and my wife sits down, it's now wrong. The snapshot is old. Google search results can change from one minute to the next. Even stock trading programs sport copious warnings that despite the best efforts of the website, the price you think you're getting may not be the *actual* price when you complete a transaction; the delay between your clicking the mouse and the machines at the other end doing something is a legitimate and unavailable delay. Some website attempt to minimize this by taking a snapshot every fraction of a second, as if you were watching what was happening "live." In reality, there's still a delay, and there's still no way to truly synchronize everyone's machine.
My point is that indexes to changing documentation must live apart from the documentation. If they really lived completely together, the content and the index would be essentially the same thing, just as the GPS chip and the student are really one merged object. But because indexes are interpretations of content, there is always going to be a gap. The generation of the index be removed from the content that is being indexed, in order for that interpretation to take place.
The only way for indexing to survive, I think, is for content to slow down. And because I believe indexing -- interpretation -- is critical for learning, the only logical conclusion is that content will slow down.
The need for an index is the logical limit of just how fast data can travel.
With traditional publishing, if I wanted a reader to find content on page 114, I would simply have the number 114 show up in my index: "credit card fraud, 114." To make this work, however, that page number must be immutable across the lifetime of my index. Should the content be republished in a different format, layout, or language, or with significant edits within the first 100-plus pages, my index could be rendered inaccurate. In other words, if I type "114" in my index, that content had better be on page 114.
The appeal of fluid content, however, is slowly making traditional information delivery obsolete. Not only are books republished for lots of "traditional reasons" (e.g., updated editions, new languages, different book and print sizes), but technology is enabling books to be published without a single physical page. With the possible exception of the Adobe PDF format (which purposefully preserves the overall book-like format in an electronic file), page breaks are optional and subjective. A Web page or HTML document can have a scroll bar, such that there are no pages; an e-book intended for a handheld reader is paged according to the size of the reader; a news or magazine article of any length can be broken in two or three simply to increase ad sales; and some electronic documents can be edited by the readers such that anything goes.
Ah, how I miss the days when 114 meant 114.
Indexing content that changes is going to be hard, but the fundamental challenge isn't about keeping up with what was newly published today, or even in the last twenty minutes. It's about content ownership. When content is moving around all the time, indexers don't have a good way to tracking where that content is going.
As an analogy, consider a classroom filled with thirty students, with one student at each desk. If you have a photograph of where every student is sitting, you could leave the room and generate a spreadsheet that lists each person's name and seat location. But what happens when the students are playing musical chairs? Every photograph you take is outdated almost immediately; even staying in the room wouldn't be good enough, because your typing speed will never match the speed of twenty kids jumping around. In fact, the only way you could manage a spreadsheet that shows where each student is sitting at all moments is if that spreadsheet operated in real time, by reference. In other words, if all thirty students carried GPS locator chips in their pockets, you could track the chips -- and thus the students -- by satellite. Your map could be as dynamic as what it is you're mapping.
Embedded indexing, or indexing by reference, is a rudimentary and imperfect example of this process. With embedded indexing, I can have some kind of information inserted into the content -- like the GPS chip in the student's pocket -- and then I can generate an index based on where that information is at any one time. This blog entry, for example, has keywords attached to it; the website where my blog is published can, at any time, generate a list of all entries with that keyword. This kind of dynamic indexing is not uncommon these days; website content is served according to a number of immediate rules, and the result can be as simple as a website that publishes "Hello Seth Maislin" on my page but no one else's, or as complicated as an online stock trading program that keeps track of millions of private transactions.
I say this is rudimentary, however, because it's still a snapshot. Perhaps it's convenient to have that snapshot taking at the moment I arrive at a website, but if I leave my browser at a website and walk away for 20 minutes, the picture doesn't have to change. The "Hello Seth Maislin" greeting made sense when I was sitting at the computer, but if I walk away and my wife sits down, it's now wrong. The snapshot is old. Google search results can change from one minute to the next. Even stock trading programs sport copious warnings that despite the best efforts of the website, the price you think you're getting may not be the *actual* price when you complete a transaction; the delay between your clicking the mouse and the machines at the other end doing something is a legitimate and unavailable delay. Some website attempt to minimize this by taking a snapshot every fraction of a second, as if you were watching what was happening "live." In reality, there's still a delay, and there's still no way to truly synchronize everyone's machine.
My point is that indexes to changing documentation must live apart from the documentation. If they really lived completely together, the content and the index would be essentially the same thing, just as the GPS chip and the student are really one merged object. But because indexes are interpretations of content, there is always going to be a gap. The generation of the index be removed from the content that is being indexed, in order for that interpretation to take place.
The only way for indexing to survive, I think, is for content to slow down. And because I believe indexing -- interpretation -- is critical for learning, the only logical conclusion is that content will slow down.
The need for an index is the logical limit of just how fast data can travel.
Labels: future of indexing, indexing process, power of information