Google's Gaffe
by Paul Prescod
|
Pages: 1, 2, 3
Handling the Response
Here is the Google SOAP API's response message (formatted for readability, the original is available).
HTTP/1.1 200 OK
Date: Thu, 18 Apr 2002 01:41:08 GMT
Content-Length: 1325
Content-Type: text/xml; charset=utf-8
<?xml version='1.0' encoding='UTF-8'?>
<SOAP-ENV:Envelope
xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:xsi="http://www.w3.org/1999/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/1999/XMLSchema">
<SOAP-ENV:Body>
<ns1:doGoogleSearchResponse
xmlns:ns1="urn:GoogleSearch"
SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">
<return xsi:type="ns1:GoogleSearchResult">
<documentFiltering
xsi:type="xsd:boolean">false</documentFiltering>
<estimatedTotalResultsCount
xsi:type="xsd:int">0</estimatedTotalResultsCount>
<directoryCategories
xmlns:ns2="http://schemas.xmlsoap.org/soap/encoding/"
xsi:type="ns2:Array"
ns2:arrayType="ns1:DirectoryCategory[0]">
</directoryCategories>
<searchTime xsi:type="xsd:double">0.071573</searchTime>
<resultElements
xmlns:ns3="http://schemas.xmlsoap.org/soap/encoding/"
xsi:type="ns3:Array"
ns3:arrayType="ns1:ResultElement[0]">
</resultElements>
<endIndex xsi:type="xsd:int">0</endIndex>
<searchTips xsi:type="xsd:string"></searchTips>
<searchComments xsi:type="xsd:string"></searchComments>
<startIndex xsi:type="xsd:int">0</startIndex>
<estimateIsExact
xsi:type="xsd:boolean">false</estimateIsExact>
<searchQuery
xsi:type="xsd:string">constantrevolution
rules xml</searchQuery>
</return>
</ns1:doGoogleSearchResponse>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>
Note that the resultElements element is empty
because we asked for zero hits. I warned you that we would have
quite a bit of XML to work with even without looking at any
hits.
HTTP allows any media type in the response to a message. It
could return any XML vocabulary whatsoever. The WSDL definition of
the Google SOAP API already embeds a simple schema. I will use that
as the basis of a schema for responses to search requests. I merely
have to remove a few SOAP-isms (there is an
SOAP-Enc:arrayType attribute, the
SOAP-ENV:Envelope element, etc.) and choose a new root
element type (I chose searchResult).
I've called the new language GoogleML. I've written a tiny XSLT stylesheet, pureGoogle.xsl, which translates GoogleSOAP into GoogleML. The resulting documents are smaller and simpler.
And here's a GoogleML equivalent to the SOAP response:
HTTP/1.1 200 OK
Date: Thu, 18 Apr 2002 02:29:56 GMT
Content-Type: text/xml
<searchResult xmlns="http://www.prescod.net/google_search_result">
<documentFiltering>false</documentFiltering>
<estimatedTotalResultsCount>0</estimatedTotalResultsCount>
<directoryCategories/>
<searchTime>0.03261</searchTime>
<resultElements/>
<endIndex>0</endIndex>
<searchTips/>
<searchComments/>
<startIndex>0</startIndex>
<estimateIsExact>false</estimateIsExact>
<searchQuery>constantrevolution rules xml</searchQuery>
</searchResult>
HTTP provides the envelope so the redundant SOAP:Envelope would be redundant. The types are known in advance so they are stripped (although I could just as easily have left them in). GoogleML documents are not constrained to the subset of XML supported by SOAP. They may use any feature in standard XML, including a DOCTYPE, a DTD and processing instructions.
Handling the Other Methods
Next let's look at the getCachedPage method. Here is the original SOAP request (again formatted, original here).
POST /search/beta2 HTTP/1.1
Host: api.google.com
Accept-Encoding: identity
Content-Length: 577
SOAPAction: urn:GoogleSearchAction
Content-Type: text/xml; charset=utf-8
<?xml version='1.0' encoding='UTF-8'?>
<SOAP-ENV:Envelope
xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:xsi="http://www.w3.org/1999/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/1999/XMLSchema">
<SOAP-ENV:Body>
<ns1:doGetCachedPage xmlns:ns1="urn:GoogleSearch"
SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">
<key xsi:type="xsd:string">0000</key>
<url xsi:type="xsd:string">
http://www.constantrevolution.com</url>
</ns1:doGetCachedPage>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>
Now compare that to an HTTP-style cachedPage request:
GET /cgi/cached_page.py?key=0000&url=http%3A%2F%2Fwww.constantrevolution.com HTTP/1.0
Host: mymachine
User-agent: Python-urllib/1.15
Believe it or not, there is an even more dramatic improvement in the response. Here is the SOAP response (I have elided the majority of the embedded base64-encoded data and reformatted; unformatted version).
HTTP/1.1 200 OK
Date: Thu, 18 Apr 2002 03:01:40 GMT
Content-Length: 10744
Content-Type: text/xml; charset=utf-8
<?xml version='1.0' encoding='UTF-8'?>
<SOAP-ENV:Envelope
xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:xsi="http://www.w3.org/1999/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/1999/XMLSchema">
<SOAP-ENV:Body>
<ns1:doGetCachedPageResponse
xmlns:ns1="urn:GoogleSearch"
SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">
<return
xmlns:ns2="http://schemas.xmlsoap.org/soap/encoding/"
xsi:type="ns2:base64">PG1lEFESF132FE...</return>
</ns1:doGetCachedPageResponse>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>
The SOAP API must base64-encode the data because the result is HTML (not well-formed XML) and perhaps could eventually include binary formats like Word documents.
Here is the HTTP-style response:
HTTP/1.1 200 OK Date: Thu, 18 Apr 2002 03:44:34 GMT Content-Type: text/html <html><head>...</head></html>
Because HTTP has no problem directly embedding HTML (or any other textual or binary data type), there is no reason to base64-encode the data. Base64 data is always more verbose than unencoded binary data; the HTTP/URI version of the service will always save bandwidth and CPU power. By the way, Google already provides a service to get a cached page using exactly this technique.
For brevity, I'll skip comparing the
doSpellingSuggestion requests and go directly to the
responses. Consider the old-style SOAP doSpellingSuggestion
response (unformatted version):
HTTP/1.1 200 OK
Date: Thu, 18 Apr 2002 03:01:40 GMT
Content-Length: 10744
Content-Type: text/xml; charset=utf-8
<?xml version='1.0' encoding='UTF-8'?>
<SOAP-ENV:Envelope
xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:xsi="http://www.w3.org/1999/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/1999/XMLSchema">
<SOAP-ENV:Body>
<ns1:doSpellingSuggestionResponse
xmlns:ns1="urn:GoogleSearch"
SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">
<return xsi:type="xsd:string">britney spears</return>
</ns1:doSpellingSuggestionResponse>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>
Compare that with the HTTP version:
HTTP/1.1 200 OK Date: Thu, 18 Apr 2002 04:08:27 GMT Content-Type: text/plain britney spears
I could have used XML for the response, but it's overkill for this job.
Statically Declaring Types
Many people assume that the difference between HTTP-based services and RPC-based services is that the former are loosely or dynamically typed and the latter are strongly or statically typed. That's not especially true. The choice between HTTP and SOAP is a choice between protocols. The decision to statically type-check information passing across the wire has more to do with service description. The two issues are totally separate.
I've already demonstrated how one can strongly type-declare the responses to HTTP-based services using W3C XML Schema. If you want to type-declare the URI query parameters, then you can use a language designed for type-declaring HTTP-based services like Web Resource Description Language (WRDL). WRDL is still under development, but you can already solve this problem today using the more popular WSDL.
Although WSDL is most often used with SOAP, it can in fact
type-declare the parameters for simple HTTP services. Here is the
relevant bit of a WSDL for my HTTP version of the
doGetCachedPage method:
<operation name="doGetCachedPage">
<http:operation location="/cached_page"/>
<input>
<http:urlEncoded/>
</input>
<output>
<mime:content type="text/html"/>
</output>
</operation>
It turns out that WSDL's handling for the <http:urlEncoded/>
works almost perfectly. It gets the parameter names from the
operation's part names. If you know WSDL, then that will probably be
clear to you. If not, don't worry about it, it won't affect your
understanding of what follows. The input description for the other two
methods is identical, so we will concentrate on the
output elements.
The output for the doSpellingSuggestion has a
media-type of "text/plain":
<output><mime:content type="text/plain"/></output>
Finally, the output for doGoogleSearch is in XML so
we declare that directly:
<output><mime:mimeXml part="searchResult"/></output>
This refers to a part named searchResult which is
based upon an element type of the same name.
I do not want to oversell WSDL's HTTP features. You cannot define sophisticated HTTP-based web services with WSDL. It falls down as soon as a web resource generates links to another web resource. WSDL cannot express the data type of the target resource. In other words it can describe only one resource and not the links between resources. SOAP lacks a first-class concept of resource and especially lacks a syntax for linking them. It is thus not surprising that WSDL inherits this flaw. Nevertheless, I hope that this weakness will be corrected in future versions of WSDL. In the meantime, this is the primary reason for the existence of the WRDL language.
|
| |
But you do not have to wait for WRDL. WSDL is sufficient for Google's current API because the API does not make use of hyperlinks. This is a common failing of SOAP-based APIs which follows from the component-centric thinking that SOAP encourages.
What this means practically is that it is possible to generate statically typed APIs for languages like C# and Java for the Google/HTTP interface. For instance, I can generate a C# interface from an HTTP-based WSDL description. The strongly-typed code is functionally identical to the SOAP version. The C# generated from the HTTP/WSDL can have exactly the same strongly typed interface as the C# for the SOAP/WSDL version.
Here is the strongly typed C# code to do a Google search. The only thing that has changed from the version shipped with the Google API is the class name:
PureXMLGoogleHTTPBinding s = new
PureXMLGoogleHTTPBinding();
// Invoke the search method
GoogleSearchResult r = s.doGoogleSearch(keyBox.Text,
searchBox.Text,
0, 1, false, "",
false, "", "", "");
Strong type checking (in C# and VB.NET) and easy API use (in a language like Python or Perl) are completely unrelated to the choice of protocol. It is .NET's WSDL and XML Schema features which handle the strong type checking, not its SOAP features. According to Don Box (one of the key inventors of SOAP), "At this point, I believe most SOAP plumbers have conceded that XML Schema will be the dominant type system and metadata format for interop."
HTTP works equally well with XML Schema, RELAX NG, Schematron and DT4DTDs because it strongly separates protocol data (which is in MIME format) from payload data (which is in XML).
In the spirit of truth in advertising, to get it actually working I did have to work around several bugs in Microsoft's WSDL toolkit. I hope that WSDL implementors will its HTTP binding seriously and implement it properly. Once the bugs are fixed, Microsoft's WSDL/HTTP interface to the service for C# and VB programmers will be the same in every detail as the WSDL/SOAP interface. This is true of any complete implementation of the WSDL specification because it is WSDL and XML Schema that define the service's interface, not SOAP.
In the meantime, the bugs really do not matter because Google ships a .NET interface to the service in its toolkit. Google could easily ship an HTTP-based interface. To client code there would be no difference. The same argument applies to the supplied Java binding. It would be relatively simple to ship an HTTP-based binding instead of the existing SOAP one.
