ACM SIGIR 2000 Workshop on XML and Information Retrieval.  July 28, 2000. Athens, Greece.
 

Searching Text-rich XML Documents with Relevance Ranking

Yoshihiko Hayashi, Junji Tomita, and Gen'ichiro Kikui

NTT Cyberspace Laboratories

1-1 Hikari-no-Oka, Yokosuka, 239-0847 Japan


1. Introduction

XML is being considered as an emerging standard markup language with emphasis on its functionality to structure raw data, including natural language texts, with element tags. These tags are introduced to accommodate possible applications to process the encoded information properly. This functionality is tremendously crucial for opening a door to so-called "Semantic Web" [Berners-Lee 1998]. In accordance with this trend, several tools for supporting deployment of XML documents on the Internet and/or corporate intranet have been offered. These tools range from basic document parsers to integrated searching environments as well as XML authoring tools.

XML documents are inherently usable for storing and sharing textual information with structural/semantic annotations provided by element tags. However, most of the currently available search tools do not allow us to retrieve relevant XML documents from a corpus of XML documents with relevance ranking. This may be a serious burden for the deployment of XML documents whose primal contents are textual; such an XML document is called a text-rich XML documents in this paper.

This paper proposes a novel development of an XML document retrieval system which can search relevant XML documents with relevance ranking from a large set of text-rich XML documents. The fundamental approach taken in the development is adopting the framework of conventional text retrieval engines as far as possible. A key to success of this approach is how to cast a type of sub-structures in the target XML documents into a so-called search field. Thus, with this scenario, fielded search plays an essential role.
 

2. Requirements for Text-rich XML Document Retrieval System

Although there are no clear definitions for the text-rich XML documents, they are apparently different from table-like documents (referred as 'data-oriented documents' in [Frankhauser 2000]) often seen in on-line shopping cites. These documents could be generated from a database which is operated behind the information services. Searching this type of XML documents may be relatively easy, because of the existing back end database engine. Contrary to this, what we want with the text-rich XML documents is to pose a query like "find relevant documents discussing trade friction issues especially involving Japan and U.S., preferably written by Dr.Foo." Assuming that the author of an article and the textual content are properly structured, "Dr.Foo" must be appeared in the element marked by a tag like <AUTHOR>. Similarly, terms "trade", "friction", and preferably "Japan", and "U.S" should be appeared in the elements tagged by <SUMMARY> and/or <DESCRIPTION>. We also want the results to be organized as a list of relevant documents sorted by their relevancy to the query. The query engine of an XML document retrieval system thus should be able to handle not only structural constraints coming from the document structure, but also rank the retrieved documents accurately. The ranking must be properly computed based on the similarity between a user's query and the textual contents in an XML document.
 

3. Overall Design

It is technically almost intractable to retrieve arbitrary type of XML documents in response to any form of users' queries. We thus need to restrict types of document sub-structures which can be specified in a query in some way. Given a conventional text retrieval mechanism supporting so-called the fielded search in hand, the most straightforward way to handle structural constraints is mapping a type of sub-structures into a search field; a search field is a text region within a document marked by some indicators. Note that the fielded search is widely accepted and implemented functionality in text retrieval systems.

Due to our design decision, only sub-structures, each of which is specified as a tag-path can be specified as structural constraints in a query. Figure-1 clarifies the notion of tag-path. In the figure, an XML document structure is illustrated as a DOM tree structure. A description of tag-path "//CONTENTS/SECTION/" designates the two sub-trees as shown. Here 'x/y' denotes that y is a child of x, while 'x//y' denotes that x is the only ancestor of y. Thus, the text string within the two sub-trees can be indexed, in spite of the existence of another tags like  <KEYWORD>; they are also inside the specified sub-structures. This means, however, that we cannot distinguish, at the search time, the two text regions whose tag-paths are identical. Although a tag-path can be specified more strictly by using XPath [XPath 1999] compliance syntax, it has not yet been implemented by our design decision. Our intention to this decision is; when we provide an actual search service, it is almost unthinkable that such two text regions must be distinguished.

Figure-1: DOM tree and tag-path.

Figure-2 illustrates possible architecture of a search service with the XML document retrieval system in its central part. Note that the system consists of the Indexer and the Query engine. The search corpus should be a set of XML documents, each of which is compatible with the definitions in the Format file. We can utilize some XML converter implementing XSLT [XSLT 1999] to create such a uniform set of XML document corpus from a variety of XML documents (or other types of documents like HTML pages).
 

Figure-2: A Possible Search Service Architecture.

Based on the discussions above, our design principles for the text-rich XML document retrieval system are summarized as follows.

4. Implementation

4.1 Indexer

Shown in Figure-3 is an example of XML document to be indexed. The associated Format file for this type of XML documents is shown in Figure-4. As seen in Fig-4, the Format file itself is also an XML document.

Figure-3: An Example of XML Document.

The indexer first reads and interprets the Format file. By interpreting the file, it initiates seven sets of indexes, "default", "title", "author", "subject", "jsubject", "keyword", and "topic_area". The set of files created for a search fields, specified as inverted type, include a dictionary file, an inverted file, a posting file and others. The terms contained in each field of a document are weighted by using a variant of well known TF*IDF formula. The data structure adopted for the dictionary file is Patricia tree [Baeza-Yates 1992].

The indexer then reads each XML document in the corpus. To parse an XML documents, the indexer currently utilizes SAX API provided by the popular XML parser XML4C. It always checks the parsed region if it should be indexed. If so, the index terms are extracted by invoking language-dependent term extraction process, and registers them into all the designated indexes.

Figure-4: Format File for the Example XML Document

4.2 Format File

A Format file consists of three parts. The first part is tagged by <INDEX_FILE_DEFINITION>, containing <INDEX_FILE> elements. In this part, each index to be created is defined by giving index name (id), specifying index file style, and index type. The index file style can be either of 'INVERTED' and 'SEQUENTIAL'. We can construct 'SEQUENTIAL' indexes, rather than usual 'INVERTED' indexes, for sub-structures where only a limited number of terms can be appeared. This type of index is useful for filtering/reranking after a primary search, which will be briefly described later. The index type can be one of 'TEXT', 'NUMBER', and 'TAG'. Number range search can be possible for 'NUMBER' type index, whereas tag name search, also described later, is applicable to 'TAG' type index.

The second part, which is most important, is tagged by <FIELD_DEFINITION>, containing <FIELD> elements. Each <FIELD> element defines field name (fid), associations from a tag-path (target_tag), to the indexes (index_ids), to where the index terms should go to. The text regions, for example specified by the tag-path "//CONTENTS/SECTION/KEYWORD", are accumulated from each document to create an index set named "keyword". As shown in the field identified as "subject", the terms within this filed can go to both 'subject' index set and 'default' index set. Attribute values in a tag elements can also be searchable by specifying a tag-path like notation ('/@' is used for attributes instead of '/'). For example with the description "//SUBJECT/@topic_area", values for the "topic_area" attribute in the SUBJECT tag will be associated with the field named "topic_area". Thus, the terms in the region can be indexed in the 'topic_area' index set.

The third part is tagged by <ANALYZER_DEFINITION>, containing <ANALYZER> elements. Each <ANALYZER> elements defines a language-dependend term extraction processor, by specifying the analyzer name (aid), the host name (host), and the port number (port). For example, analyzer identified as 'TokenJ' can be utilized by connecting to port '10021' of the 'localhost'. Analyzers defined in this part can be specified in the analyzer attribute of a <FIELD> elements, in order to designate a suitable language processor for the very field.
 

4.3 Query Engine

The query engine has been implemented by adding slight extensions to the text retrieval engine previously developed by one of the authors [Tomita 1998]. The new query engine inherits a variety of text search facilities from the based system. The major facilities are outlined as follows. These facilities can be combined almost freely as far as obeying the underlying query syntax. As explained in this paper, a type of XML sub-structure is basically mapped into a search field. In addition to the structural search realized by the fielded search, another type of structural search named tag name search was implemented. This enables us to retrieve particular tag rather than textual content in a tagged element.

A relatively simple example of the acceptable query is something like this.

(UNIX~0.8 and title=(network~0.3 or TCP/IP~0.8) and topic_area=(IT or 'computer systems'))

This query roughly means: Find the relevant documents which contain "UNIX" with its term weight 0.8 in any part of the document, and contain "network" or "TCP/IP" in the sub-structure specified by the tag-path, "//TITLE.". Additionally, topic of the documents has to be either of "IT" or
"computer systems".

The query engine parses the input query into a tree structure whose non-leaf nodes are Boolean operators, and evaluate the query by recursively traversing the structure. During the process, the engine calculates a relevance score; if it encounters a 'or' node, the relevance score will be sum of the scores from the children nodes; while if the node is 'and' node, the relevance score will be the minimum of the children nodes' scores.
 

4.4 Other Remarks on the Implementation

Several other querying facilities were implemented for advanced query formulation. These include filtering/reranking after a primal search, query-document similarity calculation based on graph-based text representation. The latter is detailed in [Tomita 2000]. The filtering/reranking queries are briefly described with the examples as follows. In addition to these querying facilities, we provide a simple search/retrieval protocol in order to facilitate external programs to use the query engine more easily. Another thing deserved for mention is that the system employs UTF-8 encoded Unicode as its internal character set. This,  combined with the analyzer designation in the Format file, is quite important for managing multilingual XML documents which thought to be more popular pretty soon.
 

5. Concluding Remarks

This paper proposed an approach to handle text-rich XML documents with slight extensions to a conventional text retrieval engine supporting fielded search . At the time of writing, the implementation has just finished, and the basic evaluation process is ongoing. With a relatively small sized corpus converted from robot-gathered 36,000 HTML web pages (283MB), the time consumed by the XML parser was as short as almost 10 minutes (0.01sec/document). The resulted index size was 49.8MB, which is far small compared to that of the original corpus. These results insist that our approach is promising in building a practical search service for a substantially large set of XML document corpus. We will evaluate the system with more larger corpus, and figure out the points to improve the system performance.

The system enjoys several fruitful text search facilities from the based text retrieval engine as described. This way of development, however, forces us to explicitly associate a tag-path to a search field by giving explicit definition in the Format file. This also means that any external application program accessing to the query engine has to know relevant and available field as well as the associated sub-structures in the XML documents. Another problem with the current implementation is inability to directly constructing new XML data as the search results (cf. [Deutsh 1998]). Serious consideration must be given to ease these limitations, because these problems are essentially associated with the issue of functional interface between the external/upper-level program which utilizes the query engine as a building block. We will be seeking a better interface through a series of implementation experiences of application systems.


References


Short Vita of the Authors

Yoshihiko Hayashi is a senior research engineer, supervisor at NTT Cyberspace Laboratories. His current interests include applied natural language processing, intelligent information access, and cross-language text retrieval. He received his Dr.Eng. degree from Waseda University in 1996. He was a visiting researcher at CSLI, Stanford University from 1994 to 1995.

Junji Tomita is a research engineer at NTT Cyberspace Laboratories. His current interests include intelligent information access, XML, and information retrieval. He received his master degree from Keio University in 1997.

Gen'ichiro Kikui is a senior research engineer at NTT Cyberspace Laboratories. He is interested in multi-lingual lexicography and applied natural language processing including machine translation, and cross-language information retrieval. He joined ATR Interpreting Telecommunication Research Labs. from 1990 to 1994, and was a visiting researcher at CSLI, Stanford University from 1997 to 1998.



authored by Yoshihiko Hayashi