Norbert Fuhr
Dortmund University, Germany
-
Kai Großjohann
Dortmund University, Germany
Today, XML is used in three different ways. First, XML is used as a markup language, where documents are considered to be trees (with the occasional hyper link added) which represent the document structure. Secondly, XML is used as an interchange format for structured data. Here, a document is considered as a set of fields, each of which has a specific data type. The third aspect is to use XML to represent text, where a document consists of words which can be stemmed and phrases and the like.
Obviously, a single XML document may represent more than one of these aspects, so for querying XML we need a query language which takes into account all these aspects.
XQL [Robie et al. 98] is a query language for XML documents which is a natural extension of the W3C standard XPath [Clark & DeRose 99]. Therefore, XQL is a promising start for designing an XML query language for Information Retrieval systems. However, the following features are desirable for IR systems, yet not available with XQL:
XQL provides for expressive searching in the structure of XML documents. For example, //author searches for all author elements in the complete document, whereas /author is restricted to the root element. Both the // and / operators have binary versions, as well, indicating ancestor/descendant and parent/child relationship, respectively. For example, //author/last searches for all last (name) elements which are children of author elements, whereas //chapter//heading searches for chapter headings as well as section headings.
Matching on attribute, rather than element, names is possible with the @ prefix: //@author searches for all author attributes.
It is possible to combine this tree search with conditions on the element and attribute contents. //@date = "2000-05-21" returns all date attributes which describe the given date.
It is possible to restrict the results of a query using the filter operator [...]. For instance, //chapter[heading] searches for all chapter elements which have headings. //chapter[heading = "XML"] returns all chapter elements titled XML. Finally, //chapter[.//heading = "XML"] returns all chapter elements where the heading or any subheading is `XML'.
As a last example, //book[@author = "John Smith" $and$ chapter/heading = "XML"] searches for books authored by John Smith which have a chapter about XML.
XQL provides the data types number and date in addition to string, and there are $lt$ (less than) and $gt$ (greater than) operators in addition to the = operator.
Whereas the result of an XQL query is a set of nodes, the result of a XIRQL query is a weighted set of nodes; we use a probabilistic interpretation of the weights. The XQL operators can be interpreted as XIRQL operators by using only zero and one as weight. A special operator $c$ (contains) searches for terms, with stemming. For instance, //title $c$ "document" searches for all titles which contain the word document (or documentation). Searching for several words is done using a scalar product function ($wsum$ operator); //title[. $c$ "information" $wsum$ . $c$ "retrieval"] searches for titles about information retrieval. (An alternate syntax //title[wsum(0.3, (. $c$ "information"), 0.7, (. $c$ "retrieval"))] can be used to explicitly specify the query term weights.) Since the query terms are considered to be disjoint events, the weights should add up to one.
For the $and$ and $or$ operators, the events are considered to be stochastically independent. The system uses intensional semantics, but may also provide extensional semantics for efficiency.
The semantics of XIRQL can be conveniently specified by transforming to pDatalog [Fuhr 00], a variant of Datalog where facts as well as rules can have attached probabilities. pDatalog supports intensional semantics and independent as well as disjoint events. Each document is transformed into a set of EDB predicates, and a XIRQL query can be transformed into IDB predicates and a pDatalog query.
Additional data types and vague predicates can easily be supported in XIRQL by adding more operator symbols to XQL. But whereas the XQL draft suggests to infer the type from the arguments of a predicate (so that x="2" is a string comparison, whereas x=2 is a numeric comparison), and to require the user to possibly cast one of the arguments into a specific type (using date("2000-05-28") to convert a string into a date, for example), we will adopt a different approach. For each data type, there is a special operator symbol. This enables users to use string equality even on date values, if desired. It also obviates the need for type inference, which would open a can of worms.
One operator, $c$, has already been presented which implements a `contains' operation for natural language text. As another example, for person names, a useful operator would be $sounds_like$.
For indexing, the database administrator needs to specify which parts of the documents have what data type. Also, for some data types, the format that's used in the document has to be specified. For example, for the XML fragment <date year="2000" month="05" day="28"/> the administrator would say that the date element is of type date, and that the year, month and day are each stored in one XML attribute. The specification should be done by augmenting the DTD; it is clearly not practical to specify this separately for each document.
As mentioned in the introduction, the system should be able to automatically determine the right result context for some queries. For example, users might wish to search for several terms, without knowing in advance whether there is a section which contains them all or not. For some documents, returning a chapter might be appropriate, whereas for other documents, returning sections might be better. The system should return the smallest context which answers the query. Automatically determining the right result context consists of two steps: first, defining possible result contexts, and second, finding the right result contexts for a given query.
It does not work to allow any XML element as a possible result context, since individual elements are too fine-grained. For example, it does not make sense to return a bold element which just contains a few words. Users should not need to know which XML elements are suitable result contexts, therefore, in our system the database administrator defines possible result contexts by specifying root nodes for each context.1 A context then consists of the document subtree rooted at the specified node so that some contexts subsume other contexts. It is convenient to use XPath or XQL to specify which nodes are root nodes of a context. See figure 1 for an example XML tree with contexts indicated.
For finding the right result context(s) for a query, there are two possibilities. One alternative is to return the most specific context which answers a query. However, this does not work for complex queries; consider a query searching for two terms, and one term occurs in one section and another term occurs in another section of the same chapter. Obviously, the chapter will not be found if the two component queries just return the respective section context.
Therefore, the second alternative needs to be chosen where a query returns all contexts satisfying the query. In this case, the more specific a context, the higher the weight should be. Else the system would tend to favor large contexts, which is not what we intended.
Specifying that a query always returns the enclosing contexts in addition to the most specific context, however, does not blend well with the semantics of normal XQL. Here, the normal case is that a query returns exactly the XML elements specified and not also their ancestors.
We introduce a special $c$ (contains) operator which includes the necessary down-weighting. If a user searches for information about Information Retrieval (IR) and hypertext (HT), this could be formulated as follows:
//#inode[(. $c$ "IR") $and$ (. $c$ "HT")]
Here, #inode matches context root nodes.
For implementing this predicate, each word could be indexed multiple times, once for the most specific context containing the word, and additional times for the ancestor contexts. This implementation would make the index very large, however. Therefore, index nodes are used for the implementation. For each context, there is an index node which is equal to that context minus all contained contexts. See figure 2 for an illustration of a document with its index nodes.
This way, retrieving the most specific context for each query is easy (by just retrieving the index node), and the information is propagated from the most specific context to the more general contexts using a down-weighting heuristic in the style of `multiply weight by factor 0.7 when crossing a context boundary'. It remains to be seen whether it is necessary to choose different down-weighting factors for different contexts, or whether one factor results in sufficient retrieval quality.
The same information can be represented in a document in multiple
ways. For example, the same information could be encoded in an XML
element in one set of documents but in an XML attribute in another set
of documents. As another example, a date value can be encoded as one
string in an element or attribute (where various formats are
possible), or the year, month and day can be in three separate
elements or attributes. In XIRQL, author searches an element,
@author searches an attribute, and ~author is used for
abstracting from this distinction. To even further abstract from the
concrete XML syntax, it is possible to search for certain data types,
regardless of their position in the XML document tree. For example
#pname searches for all personal names.
For complex data types, the XML representation is transformed into an internal representation for the system, so the user can just use the internal representation for querying such values. For example, the query would contain 2000-05-28, regardless of the date representation used in the documents.
For retrieval in XML documents, a query language should support the following features: retrieval with respect to the tree structure, weighting, data types and vague predicates, relevance-oriented search, and semantic relativism. XQL is a query language which supports retrieval with respect to the tree structure but is lacking with respect to the other criteria. This paper describes XIRQL, an extension of XQL which provides the missing features. While not suitable for end users, it provides an intermediate language that can be used by application developers, similar to SQL for relational databases.
This document was generated using the LaTeX2HTML translator Version 99.2beta6 (1.42)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -show_section_numbers -no_navigation -html_version 4.0 sigir00.tex
The translation was initiated by Kai Grossjohann on 2000-07-07