We suggest several ways information retrieval systems can exploit semantically structured XML documents to improve precision and recall. Most of these techniques require the addition of structure to the user's query. To keep the system usable by end users who are not XML schema experts or able to construct database queries, we propose a user interface that allows the user to begin with an unstructured keyword query and, with guidance by the system, apply structure in a dialog based on iterative refinement. Finally, we introduce an extended form of the inverted index used to implement this interface.
The adoption of XML as a standard for the publication and interchange of structured data creates a great opportunity for better information retrieval. XML has the ability to represent the semantics of data in a structured, documented, machine-readable form. Organizations are emerging to define standard schemas to capture the semantics of many domains, and content providers are beginning to publish information using XML and standard schemas.
We have identified several ways information retrieval systems can use semantically structured XML (XML whose structure represents in some fashion the semantics of the data) to improve precision and recall:
Documents can be categorized by schema. The schema of an XML document identifies its structure and its purpose. Results can be made more precise by limiting the search scope to only those documents matching particular schema or schemas of interest.
For example, a user looking to buy a used car might restrict her query to schemas used to describe cars for sale, while avoiding schemas used for car reviews, maintenance questions, or other irrelevant documents that would normally clutter her search results.
Ambiguous words can be distinguished by the context they appear in. Words in human languages often have multiple senses or referents; precision suffers because information retrieval systems are unable to distinguish which sense was used in queries or in corpus documents. The XML context (surrounding tags) can provide clues to help disambiguate meanings and potentially improve search precision.
For example, a user browsing academic publications might use the search term "brown". She could be looking for papers written by Donald Brown, papers published at Brown University, or papers about the brown bear. It is quite unlikely that she's looking for all three. If she can identify whether she wants to find <author>Brown</author>, <university>Brown</university>, or <subject>brown</subject>, her search could be considerably more precise.
Queries can use rich data types, including numeric attributes, geographic location, temporal values and other quantities whose semantics are difficult to capture with keyword search. Because XML separates these values from other content, information retrieval systems can identify and process them appropriately; the data types used can be stated explicitly in XML schemas (declared data types) or inferred heuristically by the information retrieval system. Either way, successfully processing queries that refer to non-textual data types can improve the system's recall.
For example, a user looking for inexpensive weekend entertainment might search for "concerts this weekend under $50 and within 50 miles of Seattle". Because event listings in XML contain location, price, and time in a consistently labeled format, a search engine could identify these very common data types and offer listings meeting the user's requirements.
Structural proximity can be used instead of physical proximity to rank results. Information recall systems frequently use the proximity of the user's search terms in a document as a ranking method, either implicitly or explicitly (via quoted phrases and the NEAR operator). XML's structure offers more sophisticated proximity measures. The distance between the last word in one element and the first word in the next element is greater than the distance between adjacent words in the same element, even though their physical proximity in the document is similar. Adjusting the ranking function (implicitly or explicitly) in this manner can improve precision and recall.
For example, consider a user searching for video drivers for a Matrox video card on the Linux operating system; she might search for "matrox linux driver". However, many documents will (sensibly enough) refer to several different drivers. Without structure, the user's query might hit a page that describes a Microsoft Windows driver for a Matrox card coincidentally next to the description of a Linux driver for some other card. If these drivers are instead recorded in structured XML, the Windows driver's description would be in one <driver> element and the Linux driver in a different <driver> element. A well-tuned information retrieval system would rank that document below a document where all of the user's search terms occur in the same outer <driver> element.
Portions of documents can be returned. Documents can often be quite long, and in many cases only a small part of the document may be relevant to the user's query. By making document structure explicit, XML allows information retrieval systems to extract portions of documents. (It is important to allow the user to navigate to the entire containing document, if desired.) This can improve perceived precision by eliminating the user's need to scan through each result document to look for relevant material.
For example, consider again the user searching for video drivers. One document might contain hundreds of driver listings, but by extracting just the <driver> elements matching the user's query, the information retrieval system can answer the user's question far more directly.
To take advantage of semantically structured XML documents, the query must have semantic structure as well. Schema restrictions, term context limitations, data type information, and other elements of structured queries can be applied in three ways:
Proactively structured queries are fully structured when first submitted by the user. Database systems almost always require proactive structure. Information retrieval systems sometimes allow proactive structure (via field specifiers on search terms), but rarely if ever require proactive structure.
Reactively structured queries start out as unstructured search queries, but are annotated with structure by the user in a refinement process. For example, a user who filters a result set by document schema is reactively applying a schema constraint.
Automatically structured queries also start out as unstructured search queries, but have structure applied to them by an automatic process. Structure can be applied by parsing the user's natural language in the context of a knowledge representation system, or merely by spotting certain keywords.
Users who are not domain experts find proactively structured systems difficult. As the diversity of a corpus increases, it becomes less likely that search users will know the structure of the data they wish to locate. In order to support users who are not domain experts, users who are not proficient with formal query languages, and users who wish to search large, diverse corpora, our research centers on systems that do not require proactively structured queries.
To help users create structured queries, we propose a user interface for semistructured information retrieval that uses reactive and automatic techniques to add structure to initially unstructured queries. These techniques distinguish this work from database query systems that presuppose proactively structured queries expressed in a formal query language (e.g. SQL or XML-QL).
To simplify the process of adding structure to queries, we make several assumptions about the form of the final query. We sacrifice the expressiveness of our "query language" in exchange for simplicity.
A query is restricted to exactly one XML schema. (The corpus can contain documents conforming to many different schemas, and different queries can search different schemas, but the results from any one query will conform to a single schema.) "Schemas" are identified by the root element of the XML document. Two documents belong to the same schema if (and only if) the name and namespace of their root elements are the same.
Each term in the query is restricted to searching content contained in the context of a specified XML path. (The "path" of an XML element or attribute is the ordered sequence of ancestor elements containing the element or attribute.) For example, the search term "Brown" might be restricted to the path "Dealer/Car/Color" (meaning it must occur inside a <Model> element contained in a <Car> element which is itself contained within a <Dealer> element), even though the word "brown" might also appear within "Dealer/Name" (Bob Brown's Big Bumpers).
Queries are always fully conjunctive; that is, all terms in the query must be present (in their allotted paths) in the same document. (Put another way, everything is AND; there is no OR or NOT.) Furthermore, to the degree allowed by their paths, search terms must occur in the same element. For example, if the query contains the search term "Nissan" in "Dealer/Car/Model" as well as the search term "red" in "Dealer/Car/Color", both terms must occur in the same <Dealer> and <Car> elements for a document to match.
While none of these assumptions are valid for all queries and all documents, we believe that despite the restrictions the system still represents a significant improvement over unstructured information retrieval. The performance of our system and experience with user testing will help us evaluate the validity of these assumptions and inform future work.
Keeping these assumptions in mind, we implemented a four-step search interface:
The user submits an unstructured full-text query. (This step is shared by
most information retrieval systems.)
The corpus is searched for documents that contain all of the terms in the user's query. The set of unique schemas represented by these matching documents is generated. The user is asked to choose from this set the schema most relevant to their query. (If there is only one possible schema, this step is skipped.)
If metadata is available, schemas may be labeled with "friendly" names
(such as "Used Cars for Sale" or "News Headlines"). Otherwise, schemas are
identified by the name of their root element.
The user is given a "search form" specific to the selected schema. This form contains one entry blank for each unique XML path that occurs in documents in the corpus that belong to the selected schema. Visual cues inform the user which paths are known to contain the user's search terms.
Depending on the data types (text, numbers, etc.) present in the corpus for
each path, the input blanks may be specialized. Specific metadata may also be
authored to control the layout and presentation of the form.
The query, now fully structured, is executed, and results are returned to the user.
Metadata, if available, controls the formatting of results. Otherwise,
generic XML presentation is used.
As you can see, this user interface is fully reactive (and occasionally automatic); it presupposes no domain knowledge or schema expertise on the part of the user. At the same time, within the parameters of the limited query capability, it is able to take advantage of semantically structured XML to improve precision and recall.
Index structures for proactive database query of structured or semistructured data are well understood. Similarly, index structures for unstructured full-text information retrieval are also well understood. However, the need to support reactive and automatic structured query imposes some unique requirements on the index used for our implementation. Our index design is therefore relatively unique.
Specifically, we identified these requirements for our index:
Given unstructured search terms (including keywords as well as numeric ranges and other data types), it must be possible to enumerate the unique XML path specifications that describe the locations where each search term occurs in the corpus.
Given search terms which have been placed in context (associated with specific XML path specifications) and related to each other by proximity requirements, it must be possible to retrieve XML documents, or sections of XML documents, where all terms occur in their appropriate contexts.
The search engine requires access to schema information and content statistics. Specifically, given an XML schema (root element), the index must be able to furnish a list of all the unique XML path specifications that have occurred in documents that conform to the schema. Furthermore, for each such specification, it is useful to know statistics about content which occurs in the context of that path, including the average content length, whether content is primarily text or numeric, the content and frequency of any particularly common values, etc.
Our implementation effort focused on developing an index that satisfies these requirements efficiently.
The index structure we use is a modification of the classic "inverted index" used by information retrieval systems. The simplest form of the standard inverted index is a relation between keywords and documents: found_in(keyword,document). A common extension includes the location of the keyword within the document, to aid phrase and proximity matching: found_in(keyword,document,offset).
Our implementation modifies the inverted index to first associate keywords and XML path specifications where those keywords appear; (keyword,path) tuples are then associated with documents and locations: found_in(keyword,path,pathwordid) and occurs_in(pathwordid,document,address). Additionally, rather than merely storing the offset from the beginning of the document, this index records the XML address (a sequence of numbers representing which outer elements contain this content) of each word. For example, consider the following pair of XML files describing used car dealerships:
billiebrown.xml
<Dealer>
<Name>Billie Brown Chevy</Name>
<Car>
<Year>1999</Year>
<Model>Chevrolet Caprice</Model>
<Color>white</Color>
<Price>$7500</Price
</Car>
</Dealer>
joebob.xml
<Dealer>
<Name>Joe Bob Ford</Name>
<Car>
<Year>1992</Year>
<Model>Ford Mustang</Model>
<Color>white</Color>
<Price>$7500</Price
</Car>
<Car>
<Year>1972</Year>
<Model>Dodge Dart</Model>
<Color>brown</Color>
<Price>$1999</Price
</Car>
</Dealer>
These files result in the following index entries:
found_in(1972,'Dealer/Car/Year',0)
found_in(1992,'Dealer/Car/Year',1)
found_in(1999,'Dealer/Car/Price',2)
found_in(1999,'Dealer/Car/Year',3)
found_in(7500,'Dealer/Car/Price',4)
found_in('billie','Dealer/Name',5)
found_in('bob','Dealer/Name',6)
found_in('brown','Dealer/Car/Color',7)
found_in('brown','Dealer/Name',8)
found_in('caprice','Dealer/Car/Model',9)
found_in('chevrolet','Dealer/Car/Model',10)
found_in('chevy','Dealer/Name',11)
found_in('dart','Dealer/Car/Model',12)
found_in('dodge','Dealer/Car/Model',13)
found_in('ford','Dealer/Car/Model',14)
found_in('ford','Dealer/Name',15)
found_in('joe','Dealer/Name',16)
found_in('white','Dealer/Car/Color',17)
occurs_in(0,'joebob.xml',[3,1])
occurs_in(1,'joebob.xml',[2,1])
occurs_in(2,'joebob.xml',[3,4])
occurs_in(3,'billiebrown.xml',[2,1])
occurs_in(4,'billiebrown.xml',[2,4])
occurs_in(4,'joebob.xml',[2,4])
occurs_in(5,'billiebrown.xml',[1])
occurs_in(6,'joebob.xml',[1])
occurs_in(7,'joebob.xml',[3,3])
occurs_in(8,'billebrown.xml',[1])
occurs_in(9,'billebrown.xml',[2,2])
occurs_in(10,'billebrown.xml',[2,2])
occurs_in(11,'billebrown.xml',[1])
occurs_in(12,'joebob.xml',[3,2])
occurs_in(13,'joebob.xml',[3,2])
occurs_in(14,'joebob.xml',[2,2])
occurs_in(15,'joebob.xml',[1])
occurs_in(16,'joebob.xml',[1])
occurs_in(17,'billiebrown.xml',[2,3])
occurs_in(17,'joebob.xml',[2,3])
If a user enters an unstructured query for "brown dart", the query processor can use the found_in relation to determine that "brown" could refer to either of "Dealer/Car/Color" or "Dealer/Name", but "dart" must refer to "Dealer/Car/Model". Once the correct path is associated (reactively or automatically) with "brown", the occurs_in relation supplies location information.
For example, let us suppose (as is likely) that the user meant to search for brown-colored cars, so "brown" is associated with "Dealer/Car/Color". This pair has pathwordid #7. The only pathwordid available for "dart" is #12. Fortunately, occurs_in associates both with the document 'joebob.xml' and the third child of the root element. This document -- or just the relevant element, i.e. the matching car listing -- is returned to the user.
These relations are not sufficient to complete the search experience. Additional relations associate all known XML paths with content statistics, and associate XML addresses with the content contained at those addresses (to allow the search engine to regenerate the original XML content for formatted output).
An additional set of extensions to the index structure allows queries to specify numeric ranges. Information retrieval systems normally treat numbers as keywords; a user searching for "1999" can find the exact value "1999", but it is not possible to search for "between 1000 and 2200".
To enable numeric queries, when the implementation encounters a number it stores it as several pseudo-keywords representing different numeric ranges which contain the target number. For example, "1999" might be indexed under the pseudo-keywords "1000-2000", "1900-2000", and "1990-2000", as well as the actual value "1999".
When a user searches for a range, the pseudo-keywords necessary to exactly cover the range are identified. Query results from each of these pseudo-keywords are combined disjunctively. For example, a user specifying a numeric range of 1000-2200 might cause a query for the pseudo-keywords "1000-2000", "2000-2100", and "2100-2200".
An open question is the correct set of pseudo-keywords to use. If the ranges used are too sparse, queries will have to combine results from many pseudo-keywords, resulting in slow execution times. If the ranges are too dense, the index will contain lots of repeated information, resulting in a large index size. The optimal boundaries for pseudo-keywords are not well defined; for example, using powers of 10 may align well with common user queries and thereby improve system performance. We are currently conducting research in this area to improve the efficiency of our implementation.
The current implementation stores relations with the widely used Berkeley DB index package [5]. Element names, path specifications, and content text are normalized to save space. Paths and addresses are stored in sorted order to facilitate efficient merging of search results.
Semistructured data retrieval in general, and XML search in particular, is an active field of research and development; indeed, the potential for structured search is often cited as a motivation for the development of XML.
The semistructured database community has understandably embraced XML as an interchange format for semistructured data [6]; many semistructured database systems have been recast as "XML databases". Examples of semistructured databases include research systems like XSet [7] and Niagara [4] as well as commercial products like XDex and Tamino. These systems generally accept some formally specified query language (such as XQL, XPath, or XML-QL) and return precise results. The structure of the query must be proactively specified; the person or software submitting the query can only take advantage of XML schemas they already know. Our work focuses instead on end users who are not schema experts.
Lore is such a semistructured database, and normally accepts the proactively structured query language Lorel [1]. However, researchers have used Lore's "data guides" and search capability to experiment with reactively structured information retrieval. [3]
GoXML [8] is a commercially deployed "context-based XML search engine" and one of the few search systems using a reactive approach to structured query. However, GoXML does not allow different search terms in the same query to refer to content in different elements, and distinguishes content only by its containing element; furthermore, GoXML has no capacity for searching non-text data, and returns only links to XML files. Our work allows the association of an independent XML path with each search term, can process numeric range queries, and returns excerpted and formatted content.
Xyleme is a promising new research action (associated with the Verso project at INRIA). Our work is similar in intent to Xyleme's semantic data integration research [2], but we take a different approach. Rather than focusing on formal knowledge representation and extensive metadata to annotate the user's query with structure a priori (fully automatic structural reformulation), we use an inverted index to determine candidate structures for the user's query, and employ a combination of heuristics and user interaction to add structural annotations (reactive structural reformulation).
We have presented three main ideas:
Techniques for exploiting semantically structured XML to increase information retrieval precision and recall.
A user interface model for information retrieval over structured data that does not require proactive identification of document structure, and which allows end users to engage the system in a dialogue to help create structured queries.
Extensions to the classic "inverted index" to support structured information retrieval.
This is very much a work in progress, both as research and as commercial development. In the future, we hope to explore automatic schema and context selection as an adjunct or replacement for reactive choice by the user, support for additional data types, and performance improvements. Above all, we plan to measure and objectively evaluate our progress against existing information retrieval technology.
Meanwhile, interested parties may experiment with a demo of this technology and download beta versions of a product using these techniques from our corporate Web site at http://www.xyzfind.com/. Please check this site for updated information about product features and ongoing research.
S. Abiteboul, D. Quass, J. McHugh, J. Widorn, and J. Wiener. The Lorel Query Language for Semistructured Data, Journal of Digital Libraries, 1(1), November 1996.
M. Rousset, Semantic Data Integration in Xyleme, Presentation at INRIA, September 1999.
R. Goldman and J. Widom, Interactive Query and Search in Semistructured Databases, Technical Report, Stanford University, 1998.
J. Naughton, D. DeWitt, D. Maier et al., The Niagara Internet Query System.
M. Olson, K. Bostic, and M. Seltzer, Berkeley DB, Proceedings of the 1999 Summer Usenix Technical Conference, Monterey, California, June 1999.
D. Suciu, Semistructured Data and XML, Proceedings of International Conference on Foundations of Data Organization, Kobe, Japan, November 1998.
B. Y. Zhao and A. Joseph, XSet: A Lightweight XML Search Engine for Internet Applications.
GoXML.com - Search and Index XML Documents. http://www.goxml.com/