ACM SIGIR 2000
Workshop on XML and Information Retrieval. July 28, 2000. Athens,
Greece.
Searching Text-rich XML Documents with Relevance Ranking
NTT Cyberspace Laboratories
1-1 Hikari-no-Oka, Yokosuka,
239-0847 Japan
1. Introduction
XML is being considered as an emerging standard markup language with emphasis
on its functionality to structure raw data, including natural language
texts, with element tags. These tags are introduced to accommodate possible
applications to process the encoded information properly. This functionality
is tremendously crucial for opening a door to so-called "Semantic Web"
[Berners-Lee 1998]. In accordance with this trend, several tools for supporting
deployment of XML documents on the Internet and/or corporate intranet have
been offered. These tools range from basic document parsers to integrated
searching environments as well as XML authoring tools.
XML documents are inherently usable for storing and sharing textual
information with structural/semantic annotations provided by element tags.
However, most of the currently available search tools do not allow us to
retrieve relevant XML documents from a corpus of XML documents with relevance
ranking. This may be a serious burden for the deployment of XML documents
whose primal contents are textual; such an XML document is called a text-rich
XML documents in this paper.
This paper proposes a novel development of an XML document retrieval
system which can search relevant XML documents with relevance ranking from
a large set of text-rich XML documents. The fundamental approach taken
in the development is adopting the framework of conventional text retrieval
engines as far as possible. A key to success of this approach is how to
cast a type of sub-structures in the target XML documents into a so-called
search field. Thus, with this scenario, fielded search plays
an essential role.
2. Requirements for Text-rich XML Document Retrieval System
Although there are no clear definitions for the text-rich XML documents,
they are apparently different from table-like documents (referred as 'data-oriented
documents' in [Frankhauser 2000]) often seen in on-line shopping cites.
These documents could be generated from a database which is operated behind
the information services. Searching this type of XML documents may be relatively
easy, because of the existing back end database engine. Contrary to this,
what we want with the text-rich XML documents is to pose a query like "find
relevant documents discussing trade friction issues especially involving
Japan and U.S., preferably written by Dr.Foo." Assuming that the author
of an article and the textual content are properly structured, "Dr.Foo"
must be appeared in the element marked by a tag like <AUTHOR>. Similarly,
terms "trade", "friction", and preferably "Japan", and "U.S" should be
appeared in the elements tagged by <SUMMARY> and/or <DESCRIPTION>.
We also want the results to be organized as a list of relevant documents
sorted by their relevancy to the query. The query engine of an XML document
retrieval system thus should be able to handle not only structural constraints
coming from the document structure, but also rank the retrieved documents
accurately. The ranking must be properly computed based on the similarity
between a user's query and the textual contents in an XML document.
3. Overall Design
It is technically almost intractable to retrieve arbitrary type of XML
documents in response to any form of users' queries. We thus need to restrict
types of document sub-structures which can be specified in a query in some
way. Given a conventional text retrieval mechanism supporting so-called
the fielded search in hand, the most straightforward way to handle structural
constraints is mapping a type of sub-structures into a search field; a
search field is a text region within a document marked by some indicators.
Note that the fielded search is widely accepted and implemented functionality
in text retrieval systems.
Due to our design decision, only sub-structures, each of which is specified
as a tag-path can be specified as structural constraints in a query.
Figure-1
clarifies the notion of tag-path. In the figure, an XML document structure
is illustrated as a DOM tree structure. A description of tag-path "//CONTENTS/SECTION/"
designates the two sub-trees as shown. Here 'x/y' denotes that y is a child
of x, while 'x//y' denotes that x is the only ancestor of y. Thus, the
text string within the two sub-trees can be indexed, in spite of the existence
of another tags like <KEYWORD>; they are also inside the specified
sub-structures. This means, however, that we cannot distinguish, at the
search time, the two text regions whose tag-paths are identical. Although
a tag-path can be specified more strictly by using XPath [XPath 1999] compliance
syntax, it has not yet been implemented by our design decision. Our intention
to this decision is; when we provide an actual search service, it is almost
unthinkable that such two text regions must be distinguished.
Figure-1: DOM tree and tag-path.
Figure-2 illustrates possible architecture of a search service
with the XML document retrieval system in its central part. Note that the
system consists of the Indexer and the Query engine. The
search corpus should be a set of XML documents, each of which is compatible
with the definitions in the Format file. We can utilize some XML
converter implementing XSLT [XSLT 1999] to create such a uniform set of
XML document corpus from a variety of XML documents (or other types of
documents like HTML pages).
Figure-2: A Possible Search Service Architecture.
Based on the discussions above, our design principles for the text-rich
XML document retrieval system are summarized as follows.
-
The system indexes and searches well-formed XML documents, not necessarily
be verified ones. This means that the system ignores DTD, even if it is
provided.
-
Finite number of sub-structure types, which we like to make searchable,
must be identified well in advance to the indexing time. In addition and
more importantly, the associations from tag-paths to search fields must
be defined in a setting file called Format file.
-
The indexer first interprets the Format file, then creates a set of necessary
files, such as an inverted file, for each defined field.
-
The application service must formulate a query, acceptable by the query
engine, from the user's information request. To do so, it should know relevant
and available search field names and their semantic associations to the
sub-structures coded in the XML documents.
4. Implementation
4.1 Indexer
Shown in Figure-3 is an example of XML document to be indexed. The
associated Format file for this type of XML documents is shown in Figure-4.
As seen in Fig-4, the Format file itself is also an XML document.
Figure-3: An Example of XML Document.
The indexer first reads and interprets the Format file. By interpreting
the file, it initiates seven sets of indexes, "default", "title", "author",
"subject", "jsubject", "keyword", and "topic_area". The set of files created
for a search fields, specified as inverted type, include a dictionary file,
an inverted file, a posting file and others. The terms contained in each
field of a document are weighted by using a variant of well known TF*IDF
formula. The data structure adopted for the dictionary file is Patricia
tree [Baeza-Yates 1992].
The indexer then reads each XML document in the corpus. To parse an
XML documents, the indexer currently utilizes SAX API provided by the popular
XML parser
XML4C.
It always checks the parsed region if it should be indexed. If so, the
index terms are extracted by invoking language-dependent term extraction
process, and registers them into all the designated indexes.
Figure-4: Format File for the Example XML Document
4.2 Format File
A Format file consists of three parts. The first part is tagged by <INDEX_FILE_DEFINITION>,
containing <INDEX_FILE> elements. In this part, each index to be created
is defined by giving index name (id), specifying index file style,
and index type. The index file style can be either of 'INVERTED'
and 'SEQUENTIAL'. We can construct 'SEQUENTIAL' indexes, rather than usual
'INVERTED' indexes, for sub-structures where only a limited number of terms
can be appeared. This type of index is useful for filtering/reranking after
a primary search, which will be briefly described later. The index type
can be one of 'TEXT', 'NUMBER', and 'TAG'. Number range search can be possible
for 'NUMBER' type index, whereas tag name search, also described later,
is applicable to 'TAG' type index.
The second part, which is most important, is tagged by <FIELD_DEFINITION>,
containing <FIELD> elements. Each <FIELD> element defines field name
(fid), associations from a tag-path (target_tag), to the
indexes (index_ids), to where the index terms should go to. The
text regions, for example specified by the tag-path "//CONTENTS/SECTION/KEYWORD",
are accumulated from each document to create an index set named "keyword".
As shown in the field identified as "subject", the terms within this filed
can go to both 'subject' index set and 'default' index set. Attribute values
in a tag elements can also be searchable by specifying a tag-path like
notation ('/@' is used for attributes instead of '/'). For example with
the description "//SUBJECT/@topic_area", values for the "topic_area" attribute
in the SUBJECT tag will be associated with the field named "topic_area".
Thus, the terms in the region can be indexed in the 'topic_area' index
set.
The third part is tagged by <ANALYZER_DEFINITION>, containing <ANALYZER>
elements. Each <ANALYZER> elements defines a language-dependend term
extraction processor, by specifying the analyzer name (aid), the
host name (host), and the port number (port). For example,
analyzer identified as 'TokenJ' can be utilized by connecting to port '10021'
of the 'localhost'. Analyzers defined in this part can be specified in
the analyzer attribute of a <FIELD> elements, in order to designate
a suitable language processor for the very field.
4.3 Query Engine
The query engine has been implemented by adding slight extensions to the
text retrieval engine previously developed by one of the authors [Tomita
1998]. The new query engine inherits a variety of text search facilities
from the based system. The major facilities are outlined as follows.
-
Boolean operators (or/and/not) are usable.
-
Weights can be assigned to the query terms: This enables us to formulate
weighted Boolean queries.
-
Partial matching with wild character is possible.
-
Proximity search with/without order is supported.
-
Literal (phrase) matching can be specified.
-
Number range search is possible.
-
Fielded search is supported: As explained, a type of XML sub-structure
is basically mapped into a search field.
These facilities can be combined almost freely as far as obeying the underlying
query syntax. As explained in this paper, a type of XML sub-structure is
basically mapped into a search field. In addition to the structural search
realized by the fielded search, another type of structural search named
tag name search was implemented. This enables us to retrieve particular
tag rather than textual content in a tagged element.
A relatively simple example of the acceptable query is something like
this.
(UNIX~0.8 and title=(network~0.3 or TCP/IP~0.8) and topic_area=(IT
or 'computer systems'))
This query roughly means: Find the relevant documents which contain
"UNIX" with its term weight 0.8 in any part of the document, and contain
"network" or "TCP/IP" in the sub-structure specified by the tag-path, "//TITLE.".
Additionally, topic of the documents has to be either of "IT" or
"computer systems".
The query engine parses the input query into a tree structure whose
non-leaf nodes are Boolean operators, and evaluate the query by recursively
traversing the structure. During the process, the engine calculates a relevance
score; if it encounters a 'or' node, the relevance score will be sum of
the scores from the children nodes; while if the node is 'and' node, the
relevance score will be the minimum of the children nodes' scores.
4.4 Other Remarks on the Implementation
Several other querying facilities were implemented for advanced query formulation.
These include filtering/reranking after a primal search, query-document
similarity calculation based on graph-based text representation. The latter
is detailed in [Tomita 2000]. The filtering/reranking queries are briefly
described with the examples as follows.
-
Filtering query: (UNIX and TCP/IP filter=(topic_area=IT))
This query is logically equivalent to (UNIX and TCP/IP and (topic_area=IT)).
The filtering query however is far efficient than the equivalent one, when
the style of 'topic_area' index is 'SEQUENTIAL'. That is, after the
primary search involving 'UNIX' and 'TCP/IP', the additional condition
is checked against the limited number of documents.
-
Reranking query: (UNIX and TCP/IP rerank=(topic_area=IT))
The additional condition gives higher similarity score to the documents
whose topic area is specified as 'IT' than the documents whose topic area
is not specified so.
In addition to these querying facilities, we provide a simple search/retrieval
protocol in order to facilitate external programs to use the query
engine more easily. Another thing deserved for mention is that the system
employs UTF-8 encoded Unicode as its internal character set. This,
combined with the analyzer designation in the Format file, is quite important
for managing multilingual XML documents which thought to be more popular
pretty soon.
5. Concluding Remarks
This paper proposed an approach to handle text-rich XML documents with
slight extensions to a conventional text retrieval engine supporting fielded
search . At the time of writing, the implementation has just finished,
and the basic evaluation process is ongoing. With a relatively small sized
corpus converted from robot-gathered 36,000 HTML web pages (283MB), the
time consumed by the XML parser was as short as almost 10 minutes (0.01sec/document).
The resulted index size was 49.8MB, which is far small compared to that
of the original corpus. These results insist that our approach is promising
in building a practical search service for a substantially large set of
XML document corpus. We will evaluate the system with more larger corpus,
and figure out the points to improve the system performance.
The system enjoys several fruitful text search facilities from the based
text retrieval engine as described. This way of development, however, forces
us to explicitly associate a tag-path to a search field by giving explicit
definition in the Format file. This also means that any external application
program accessing to the query engine has to know relevant and available
field as well as the associated sub-structures in the XML documents. Another
problem with the current implementation is inability to directly constructing
new XML data as the search results (cf. [Deutsh 1998]). Serious consideration
must be given to ease these limitations, because these problems are essentially
associated with the issue of functional interface between the external/upper-level
program which utilizes the query engine as a building block. We will be
seeking a better interface through a series of implementation experiences
of application systems.
References
-
[Baeza-Yates 1992] Baeza-Yates,R.A. "Introduction to Data Structures and
Algorithms Related to Information Retrieval", in Information Retrieval
edited by Frakes,W.B. and Baeza-Yates,R., Prentice Hall, 1992.
-
[Berners-Lee 1998] Berners-Lee,T. "Web Architecture from 50,000 feet",
http://www.w3.org/DesignIssues/Architecture.html,
1998.
-
[Deutsh 1998] Deutsh,A., Fernandez,M., Florescu,D., Levy,A., and Suciu,D.
XML-QL: "A Query Language for XML", submission to W3C, http://www.w3.org/TR/NOTE-xml-ql/,
1998.
-
[Frankhauser 2000] Frankhauser,P., Marchiori,M. and Robue,J. "XML Query
Requirements", W3C Working Draft, http://www.w3c.org/TR/xmlquery-req.
2000.
-
[Tomita 1998] Tomita,J. and Takeno,H. "Proposal for a New IR System using
Subject Graph and Word's Weighting by Relation" (in Japanese), IPSJ SIG-Notes,
FI-52-3, 1998.
-
[Tomita 2000] Tomita,J. and Hayashi,Y. "Improving Effectiveness and Efficiency
of Web Search by Graph-based Text Representation", in Poster Proceedings
of WWW9, 2000.
-
[XPath 1999] XML Path Language (XPath)
Version 1.0, W3C Recommendation, 1999.
-
[XSLT 1999] XSL Transformations (XSLT)
Version 1.0, W3C Recommendation, 1999.
Short Vita of the Authors
Yoshihiko Hayashi is a senior research engineer, supervisor at NTT
Cyberspace Laboratories. His current interests include applied natural
language processing, intelligent information access, and cross-language
text retrieval. He received his Dr.Eng. degree from Waseda University in
1996. He was a visiting researcher at CSLI, Stanford University from 1994
to 1995.
Junji Tomita is a research engineer at NTT Cyberspace Laboratories.
His current interests include intelligent information access, XML, and
information retrieval. He received his master degree from Keio University
in 1997.
Gen'ichiro Kikui is a senior research engineer at NTT Cyberspace
Laboratories. He is interested in multi-lingual lexicography and applied
natural language processing including machine translation, and cross-language
information retrieval. He joined ATR Interpreting Telecommunication Research
Labs. from 1990 to 1994, and was a visiting researcher at CSLI, Stanford
University from 1997 to 1998.
authored by Yoshihiko
Hayashi