XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.


Quick and Dirty Topic Mapping

Quick and Dirty Topic Mapping

February 04, 2002

There's a lot of advanced research and literature on the subject of taxonomies and topic maps. To be honest, much of it goes way over my head. I'm keenly interested in the problem of categorizing content, but the more I wrestle with the problem, the more it looks like a job for a bit of scripting and common sense. A few times recently, I've gone through the exercise of mapping a set of dozens or hundreds of URLs to a set of topics. What seemed like a daunting task has turned out, rather surprisingly, to be not so hard.

The starting point, for me, is a mapping file with a list of URLs to categorize. I walk down the list in a text editor, and attach one or more topics to each URL, like so:

url1,web services,proxying,peer-to-peer url2,collaboration,
messaging url3,books,security

The mapping file can be plain text or HTML with clickable links. The choice depends on the content management system (or systems) behind those URLs. When a CMS supports and encourages descriptive URLs, and when you're familiar with the documents they link to, you can often categorize those URLs just by eyeballing them. But when the URLs are opaque, you'll need to turn them into HTML links so you can visit them. The former case is preferable because the mapping file you're editing stays clean and simple. The more I think about it, the less I am able to justify opaque URLs that say nothing about the documents they refer to.

The essence of this strategy is to work bottom-up, rather than top-down. I don't start with a predefined set of topics. Rather, I allow them to emerge from the material as I work my way through it. I don't try to create a topic hierarchy. Having wrestled with questions such as whether XML should be a subcategory of Web Development, or vice versa, I've concluded that this way lies madness. My goal now is simply to assign resources to a flat list of topics--from 15 to, at most, 40 of them, depending on the data set. The resulting topic map isn't fancy, but it chunks the data set usefully, it's easy to create, and it's easy to maintain.

I don't worry too much about the topics I assign on the first pass. Because I know I'll be iterating over the results and tuning them, it's only necessary to create a rough draft. At this stage, it's OK to create more topics than you'll need, as well as assign URLs to more categories than they may finally warrant. Once you can visualize the mapping, you'll see much more clearly how to coalesce and streamline.

Building the Topic Database

Here's a script that turns the mapping file into a database that drives the topic visualizer, based on an HTML-style mapping file. I use the term database loosely because it's really just a persistent Perl hashtable.

#! /usr/bin/perl -w
use strict;
use LWP::Simple;
use Data::Dumper;

do 'topicDb';
$main::topicDb->{topicHash} = {};

while (<>)
    my @fields = split(/,/);
    my $url = $fields[0];
    $url =~ m#"([^"]+)#;
    $url = $1;
    my $title = undef;
    if ( ! grep (/$url/, keys %{$main::topicDb->{urlHash}} ) )
        my $doc = get $url;
        if ( $doc =~ m#<title>([^<]+)#i )
            { $title = $1 }
            { $title = 'untitled' }
        chomp $title;
        print "$url,$title\n";
        $main::topicDb->{urlHash}->{$url} = $title;
        { $title = $main::topicDb->{urlHash}->{$url} }
    my @topics = @fields[1..$#fields];
    foreach my $topic (@topics)
        push ( @{$main::topicDb->{topicHash}->{$topic}}, $url ) 


sub save
    my ($var,$name) = @_;
    my $dump = Data::Dumper->new([$var],[$name]);
    open (F,  ">$name") or die "cannot create $name $!";
    print F $dump->Dump;
    close F;

The database, in a file called topicDb, should be initialized like so:

$topicDb = 
    'topicHash' => {},
    'urlHash'   => {},

The topicHash, keyed on topic names, will store lists of resources. The urlHash, keyed on URLs, will store HTML document titles. The topicHash is always recreated anew from the mapping file, so that it reflects any changes you've made in the assignment of topics to URLs. The urlHash, though, is restored from its persistent on-disk representation. That's because the script fetches and remembers the HTML document title of any unremembered URL--an expensive operation.

Visualizing the Topic Database

Feed the mapping file to the database builder to produce a topic database in the file topicDb. Now, to visualize the results, use something like this:

#! /usr/bin/perl -w
use strict;

do 'topicDb';

my $topicHash = $main::topicDb->{topicHash};
my $urlHash = $main::topicDb->{urlHash};

foreach my $topic (sort keys %$topicHash)
    print "<p><b>$topic</b>";
    print "<blockquote>\n";
    foreach my $item (@{$topicHash->{$topic}})
        my $url = $item;
        my $title = $urlHash->{$item};
        print qq(<div><a href="$url">$title</a></div>\n);
    print "</blockquote></p>\n";

The result will be something like this:




Now you can iterate. Are there too many topics? Are there topics with only one entry that should perhaps coalesce with broader topics? Do items assigned to multiple topics make sense for all those topics? It's straightforward to edit the mapping file, rerun the scripts, and recheck the results. Renaming a topic, or subsuming one into another, is just a quick search and replace in your text editor. Going in the other direction--that is, breaking out items from a general topic and assigning them to more specific topics--is, of course, counter-entropic and thus, much harder. But that's true no matter what system you use. That's why it's best to start with more specificity than you'll need, and generalize as you iterate.

If your content management system helps you build this kind of topic map, great. I've just been checking out the new version of Radio UserLand, for example, and it does a beautiful job with topics. You can easily create topics, tag one or more of them onto your postings, and even--this is wildly powerful--make each topic into its own RSS feed. You can't, however, abstract the mapping of resources to topics into some central and easily accessible place. Topic maps often need to evolve globally as the territory they survey changes. When you want to effect such change, it's ideal to have global access.

Topic maps can, and often do, span resources from multiple content management systems. In these cases, the URLs and document titles from each CMS will observe different conventions. That's OK. The topic map doesn't care about the format of the metadata, it only relates resources to topics. That said, as you move from one system to another, the stylistic clash of URLs and document titles is vexing. Isn't it about time the CMS community agreed to apply a minimal standard set of Dublin Core metadata in managed HTML documents? Creator, subject, date, and publisher fields would be incredibly useful. A topic map builder needing to reclassify a resource could do so, using the technique shown in this article, but hints from the author would be a great place to start. And availability of the other bits of metadata would enable far smoother presentation of heterogeneous topic maps.