Building a Worldwide Lexicon
by Brian Jepson
|
Pages: 1, 2, 3, 4
Building a Worldwide Dictionary
So how do you go about building a worldwide dictionary/encyclopedia without doing a colossal amount of busy work? A lot of this information is already on the Web. The problem is that each dictionary has its own front-end. So if you can convince many of these dictionary servers to support WWL, you can build a worldwide network of dictionaries without reinventing the wheel.
Upgrading a dictionary server to support the Worldwide Lexicon is nearly as simple as the previous examples. Most dictionary and encyclopedia servers allow users to submit queries via a Web form that in turn invokes a script.
Most programming and scripting languages now support the SOAP interface via easy-to-use toolkits (e.g. SOAP.py, SOAP:Lite for Perl, .Net). So upgrading your dictionary or encyclopedia server to participate in WWL is easy. All you need to do is write a new script that responds to SOAP messages instead of a Web CGI-interface. You've already done the dirty work of building a database and code to query it. All you need to do is add a fancy wrapper.
Simple Read-Only WWL Servers
If you only want to provide read-only access to your server, the job is truly easy. You'll need to write two simple scripts.
- One script calls out to one or several WWL supernodes, and uses the WWLRegister() and WWLServerStatus() methods to announce its availability. The WWLRegister() method is used to declare which language domains you are hosting. You call this once when you start your server. The WWLServerStatus() method is invoked to notify other servers when you start up, shut down, experience congestion, etc.
- The second script responds to client requests. This script implements a handful of WWL methods to respond to queries, and replies via SOAP instead of generating an HTML document.
Since you already know how to write CGI scripts, this should be an easy afternoon project for most sites. A basic WWL server only requires five of the methods defined in the Worldwide Lexicon protocol specification.
See Also:
For more information about the WWL protocol specification, visit our Web site at www.worldwidelexicon.org. In addition to the protocol documentation, we'll also post code from dictionaries and encyclopedia servers that have implemented WWL in various languages.
And for the Truly Shiftless...
Some dictionaries and translation servers will be able to join the Worldwide Lexicon by doing absolutely nothing, by simply attracting the attention of people building WWL gateway servers. These gateways present a WWL interface to client applications and supernodes, and translate their requests into other protocols (such as the DICT tcp protocol, proprietary HTTP CGI interfaces, etc). These gateways will even be able to relay requests to live volunteers via instant messaging (more on this in a moment).
Gateways can also be used to convert static wordlists into systems that can be queried. For example, a gateway caches an English-Pashto wordlist, parses the HTML table into columns, and then performs a WWL search request. Essentially it treats the table or tab-delimited text as a multi-column database. This trick will be useful for adding rare languages to the WWL system with minimal effort.
See Also:
Gateway Servers (WWL and Other Dictionary Protocols)
Distributed Human Computing and the Worldwide Lexicon
Just creating a standard procedure for locating and communicating with existing dictionaries is a big improvement because it enables applications to access dictionaries and semantic networks via a standard interface. The process is nearly identical regardless of the language you use.
Now here is where things become really interesting. The second component of the Worldwide Lexicon is a distributed computing experiment (actually distributed human computing is a better term). Instead of tapping idle PCs to crunch numbers, this system will tap idle Internet users to contribute to the dictionaries participating in this network.
It sounds complicated. You're probably thinking that this project will redefine the term bloatware. One of the basic design guidelines used in the Worldwide Lexicon is to keep things as simple as possible. Developers like to build useful applications, not endlessly debug code. So this is also a straightforward feature to implement.
Lexicon@Home Client App
First, let's look at what need to do to build a Lexicon@Home client application. This program is actually very simple. It does three things:
- It passively monitors your activity to sense when you appear to be casually surfing the Web (e.g. moving your mouse occasionally but not typing a great deal).
- When it senses you are idle, and subject to your preferences (e.g. prompt me up to four times per hour), it polls one or more WWL servers to ask if there are jobs enqueued. You decide which WWL servers you want to contribute to and set these preferences in the config screen for the app.
- If the WWL server has a job enqueued, it replies with a CGI URL. All the client app needs to do is point your browser (or an embedded mini-browser) at this URL. You fill in a short Web form, and you're done.
Sounds easy enough; let's look at a quick example.
Public Function OnClientIdle()
set pf = CreateObject("pocketsoap.Factory")
set wwl = pf.CreateProxy(WWLserver_uri)
job = wwl.WWLRequest()
if job.id > 0 then
action = JobPendingDialog(job.message)
if action = "ok" then
job_handle = wwl.WWLFetch(job.id)
LaunchMiniBrowser(job_handle.url)
else
WWLReject(job.id)
end if
end if
End Function
As you can see, this is pretty simple. Of course, you'd probably want to add some other bells and whistles, like the ability to poll more than one WWL server for pending jobs (maybe you're trilingual and glad to translate English-German or English-Arabic terms).
Even with embellishments, this is still a pretty straightforward application to build since its behavior is simple, and it does not require a complicated user interface (data entry to the WWL server is done through a Web form served by the target WWL site).
This can also be embedded in or bundled with another widely deployed piece of software. Instant messaging clients are a good example. Millions of people use them on a daily basis. Users typically run IM software whenever they are online. These programs also include presence awareness features (Yahoo's IM client, for example, senses when you are probably away from your PC and notifies other users of this).
If hooks to WWL were embedded in widely deployed client programs such as IM clients, smart cursors, etc., the system could reach a large user population relatively easily; we're talking millions of part-time users. (Hint to any readers who work for companies that produce such software).
So now we've demonstrated that building a Lexicon@Home client app is not a big deal. The server side of this equation must be a nightmare, or there has to be a catch somewhere, right?
Nope. If you can write CGI scripts that read and write to a database (you already did that when you built your dictionary server, right?), you know how to update your WWL server so that it can accept user contributions. There are several ways to do this. Next, we'll consider a couple of examples.
See Also:
Public WWL Dictionary With Editorial Review
Let's suppose you like the idea of allowing the general public to add to your dictionary, yet you want to retain control over new submissions. Implementing this capability is also fairly easy to do.
To do this you'll need to write one SOAP script, and two conventional CGI/ASP scripts.
The SOAP script responds to the WWLRequest() and WWLFetch() methods. When a client invokes the WWLRequest() method, you reply with a message that includes:
- A transaction number unique to that job
- A short text message (e.g. "translate: english->spanish : bounce")
The client invokes the WWLFetch() method if the user clicks the OK button in the popup dialog box prompting him to process this request. The SOAP script responds to WWLFetch() with a CGI URL that points to the CGI script (e.g. http://www.yourserver.com/cgi-bin/wwl-post.pl?jobid=90212&randomkey=4450192353.
The first CGI script generates a data entry form that prompts the user to provide the requested information. Remember the Lexicon@Home client app is a dumb program. All it does is fetch jobs from your server, and point a browser to a target URL.
You decide what data entry fields to present. So this form may ask the user to fully conjugate a verb, or it may simply ask for a single text entry. You decide what works best for your user community, and what works within the constraints of your existing internal database. This CGI script stores the posted data in your database, but flags newly added records so that they do not appear in your live system.
The second CGI script is a private script that allows your editors or trusted users to pull up a list of recent contributions, and to accept or reject contributions (e.g. display 20 entries per page, with accept/reject checkboxes next to each).
So once again, this is not a colossal task. Of course, you'll need to consider some additional issues, including:
- Figuring out the best way to incorporate user submissions into your database.
- Assigning a time limit to jobs so that if a client doesn't respond to a job, the system adds it back to the queue.
- Blocking or ignoring repetitive WWLRequest() calls, either due to bad client software or malicious users.
- Blocking or ignoring submissions from IP addresses known for low quality or bogus contributions.
- Providing private CGI scripts that allow editors to access internal data, consolidate entries (i.e. link entries for different forms of the same word), etc.
- Allowing users to flag entries for case by case editorial review (optional).
- Replacing a Boolean accept/reject option with a more flexible scoring system (optional, if you have many editors).
So that's easy enough to accomplish, and for most dictionaries this is probably all you need to do. Now for the really fun stuff.
