Sign In/My Account | View Cart  
advertisement


Listen Print

Building a Worldwide Lexicon
by Brian Jepson | Pages: 1, 2, 3, 4

Automated WWL Dictionary With User Peer Review

Now let's consider the task of building a Worldwide Lexicon server that not only allows users to contribute new listings, but also automates the process of screening and ranking new submissions. Let's imagine for the sake of this example that you are hosting a multilingual dictionary for slang and sexual terminology (perfect for use in chat applications). Human editors would be overwhelmed by the onslaught of naughty words, so you need to create an automated peer review process to reduce their workload.

The process of building this type of server is very similar to the editor-controlled example we just described. The primary difference is that this system will also prompt end users to score contributions from other others. The behavior of the Lexicon@Home client app is identical. It uses WWLRequest() and WWLFetch() to poll your server for job information and target CGI URLs that point to data entry forms.

The server application that processes the WWLRequest() and WWLFetch () methods behaves slightly differently. It will ask some users to enter data (e.g. "translate english-spanish 'woody'"). It will ask others to score recent submissions.

Thus some users will see a Web data entry form in their client application, and others will see a form that prompts them to score or comment on a recent entry. Ideally, you will collect numerous votes for each entry so that the resulting average score is a reliable indicator of definition or translation quality.

(NOTE: It is important to dispatch requests to score entries to randomly chosen users -- this will make it harder for hostile contributors to play games with the scoring system).

This process does not need to be entirely automated. You can still allow human editors to intervene. Human editors can focus their time on entries that have ambiguous scores. For example, an entry like "'soccer' ~ 'i will not accept this tobacconist, it is broken'" will probably receive a very low score, and can be automatically filtered.

Likewise, an obviously accurate translation will receive high scores, and can be accepted without being held for editorial review. You can design your private editor's CGI script to display entries that are neither good nor bad. This will make optimal use of your limited time, without delaying most user contributions.

See Also:

WWL Dictionary With Real-Time Human Assistance (Jabber and SMS)

Let's suppose a user requests a definition for a word that is not in your WWL dictionary. Instead of replying with a record Not Found error, you can relay this query to human volunteers who are logged into a Jabber server (which in turn is linked to a Jabber-WWL gateway server).

These volunteers receive an instant message from the gateway server. If you were monitoring the chat conversion or IRC channel, you might see something like this:


wwlsrv9102: trans de-->en schadenfreude
wilson: ?
robert: ?
brian: dnd
vera: secret malicious pleasure

The gateway server listens for replies to its outgoing message. The volunteers simply reply to the message like another instant message. The gateway server uses the replies to generate a response which is sent back to the WWL client via the SOAP interface.

What's especially nifty is the WWL client application does not need to do anything special. It simply invokes a WWLTranslate() method. In many cases, it may not know the response came from a human volunteer who was conferenced in via Jabber to provide a quick translation or definition (the only difference between this and a purely automated query is that there many be a time delay in the response).

This capability enables WWL server owners to do some really interesting things. First, it allows the dictionaries to adapt to the needs of real users since it learns definitions and translations for queries submitted by WWL clients. Second, it creates the illusion that the dictionary is larger than it actually is. As long as there are volunteers logged in to accept real-time queries, the system will be able to broker on-the-fly translations and definitions for queries that have not yet been catalogued (or at least try to find an answer).

This will become especially interesting if developers create Jabber clients that are aware of the WWL system, and that can automatically locate WWL-Jabber gateways. For example, a multilingual chat client, upon startup, might ask you if you are willing to provide real-time translations for other users. If you answer yes, the client uses WWLFindServers() to locate a real-time WWL server for your language(s), and signs into it. If such a feature became common, this would create a large pool of human volunteers who could provide on-demand word, phrase, and full-text translations.

NOTE: This will also be easy to implement via SMS (short messaging service), although the roundtrip time may not be fast enough for real-time queries (depends on message delivery times). SMS is interesting because it will enable WWL servers to tap not just PC users to contribute definitions, but also millions of cellular phone users worldwide. This will be especially useful in Asia and Europe where wireless communication is more prevalent than landline communication.

See Also:

Worldwide Lexicon Applications

One of the most obvious applications for the Worldwide Lexicon is a Web browser or text editor plug-in that allows users to fetch definitions for words and phrases on the fly. This isn't really a new application, these types of plug-ins have been available for a while. The problem is that most dictionaries do not use a standard format, so each plug-in is usually tied to a specific dictionary.

The browser or text editor plug-in is useful, but it's also pretty simple to build (see earlier example), and not particularly challenging. What is interesting about WWL, especially the distributed human computing aspect of it, is that it can be used to create and maintain dictionaries with continually evolving vocabularies, and even to process documents. It also creates a dictionary and translation API that can be used in a wide range of applications (with minimal cost and effort).

Some of the most interesting applications for the Worldwide Lexicon blur the line between human and machine. The WWL system uses computers where they excel (memorizing and searching large amounts of information), and people where they excel (inferring meaning, understanding metaphors, etc.).

The WWL distributed computing project is, in a sense, a cyborg project. The computers enable WWL clients to locate and query a WWL server in a fraction of a second, yet they also create an efficient human-machine interface that taps human capabilities when they are needed.

News and Document Translation

Imagine being able to read good quality translations for popular news sources, magazines, and short stories. These translations would not be produced by an automated program, but rather by bilingual contributors throughout the world.

The Lexicon@Home client described earlier can just as easily be used to prompt users to translate full sentences, paragraphs, or short documents. Users could volunteer to translate not only words, but also chunks of text from news sites, online magazines, and other sources.

Each volunteer translates a small block of text, perhaps just a few sentences. This process, called segmentation, is often used in document translation. The new twist here is the use of a very large and open user community. With enough participants to translate and cross-check each other's work, a network of volunteer translators could process a prodigious stream of articles.

As with the earlier example, the key to making such a system work is to take a large task (in this case translating a document, perhaps an essay from a magazine), and divide it into manageable pieces that are then parceled out to many contributors in a convenient and non-obtrusive way. Each contributor is asked to translate a small block of text, something a proficient bilingual or multilingual user will be able to do with minimal effort.

These block translations are stitched together to form complete documents, and are then available to any Internet user (most of whom will not be aware the Worldwide Lexicon is involved in their translation). This information can be used to produce a directory of human translated news stories, essays, magazine articles, and other recent publications. Any Web document perceived to have value could be flagged for processing.

Such a system would not translate every document on the Web, nor would it guarantee perfect translations (though they would be superior to automatic translations). It would focus on Web sites or documents that have high current value, such as news stories, important essays, and other sources flagged by users. A user-driven system like Slashdot might play a role in ranking today's most interesting URLs.

Of particular importance is the fact that the output from this application is simple HTML, and will be accessible to any Web browser. Readers will not need to know anything about the Worldwide Lexicon to benefit from this service. They'll simply bookmark a site that serves as a portal to these translated articles and sites.

In addition, this application could catalog a large amount of data, specifically word and phrase translations, that can be indexed and shared with other WWL dictionary servers. So not only will a text server process full-texts, it can also catalog translations for words, phrases, and sentence fragments that can then be replicated to other WWL servers.

So you're probably thinking that this is a pretty complicated project, yet it's also easy to build, primarily because the most difficult computation (translation) is handed off to a general purpose saltwater computer (person). A system like this can be built with a collection of text processing and data entry scripts that each perform specific tasks (and communicate with each other via a shared internal database).

Twext (http://twext.cc) offers a glimpse of what such a system might look like. Twext is a prototype for a system that would translate lyrics for songs via a similar segmentation process.

See Also:

Pages: 1, 2, 3, 4

Next Pagearrow