A Statement of the Problem
Nancy Ide
Department of Computer Science
Vassar College
Poughkeepsie, New York USA
ide@cs.vassar.edu
1 Overview
The Language Engineering community involves researchers and developers of natural language processing (NLP) applications that depend heavily upon annotated language resources for training statistical tools (e.g., syntactic parsers, morpho-syntactic and semantic taggers, speech recognition systems), automatically constructing mono- and multi-lingual lexicons and term banks to support NLP applications, and determining linguistic patterns to support tasks such as word sense disambiguation, discourse and co-reference determination, etc. As the reliance on annotated resources has sky-rocketed in the past few years, work on the development of data architectures, encoding formats, and annotation formalisms for language resources has intensified, in order to facilitate efficient and effective representation and search.
Given its emergence as the de facto standard for data representation on the World Wide Web, XML and its supporting applications (XSLT, XML schemas, etc.) have become the language engineering community's encoding framework of choice. As a result, an XML instantiation of the Corpus Encoding Standard (Ide, 1998a,b), called XCES (Ide, Bonhomme, and Romary, 2000), is being converted into XM, to support development of annotated language resources. Our interest here is to connect with the Information Retrieval community in order to (1) to learn more about leading edge research in IR that may have repercussions for the development of standardized data architectures and XML encoding practices for language resources; and (2) to outline the architecture and search requirements for annotated language resources, which, because of their multi-dimensional, heterogeneous, and temporal nature, raise interesting challenges for IR research.
2 Representation of language resources
Language resources span a wide variety of data types, including not only textual data, but also speech signal, audio, image, video (e.g., sign language), etc., in a variety of languages (and, therefore, a variety of character formats). Base data is associated with potentially several layers of annotation, spanning the gamut of traditional linguistic annotation types (phonetic, morphological, syntactic, semantic, discourse structure, co-reference, etc.), of which there may several variants or alternatives at any level. Annotations may be associated with continuous or any number of disfluent segments of the data in the base. In addition, alignment among different kinds of language resources and their annotations is common, for example, parallel translations in multi-lingual corpora, and parallel aligned speech data representing signal and the associated orthographic transcription.
Abstractly, an annotation is a one- or two-way link between an annotation object and a point (or a list/set of points) or span (or a list/set of spans) within a base data set. Links may or may not have a semantics--i.e., a type--associated with them. Points and spans in the base data may themselves be objects, or sets or lists of objects. This model assumes a fundamental linearity of objects in the base, e.g., as a time line (speech); a sequence of characters, words, sentences, etc.; or pixel data representing images. Note that this applies to the fundamental structure of storeddata, and not its logical structure; because the targets of a relation may be either individual objects, or sets or lists of objects, the model accommodates information at the logical level with more than one dimension.
It fairly well established within the LE community that a "stand-off" data architecture is best suited to represent language resources (Ide & Brew, 2000). Using this scheme, the data to be annotated are contained in a base XML document, and all annotations are in separate XML documents linked to the base. This avoids, among other problems, the conflict of overlapping hierarchies of data elements--that is, independent annotations that "chunk" data differently (as a simple example, consider sentence vs. line segmentation). Links within and between documents in the stand-off scheme can be one- way or two-way (e.g., for parallel texts), and annotation documents can themselves be linked. This strategy yields, in essence, a finely linked hypertext format where the links specify a semantic role rather than navigational options. That is, links signify the location(s) where markup contained in a given annotation document would appearin the document to which it is linked. As such, annotation information comprises remote or "stand-off" markup that is virtually added to the base. In principle, the base data could contain no markup at all (or, in texts, markup for gross logical structure only); all markup may be retained in separate documents with links into the original based on offsets.
The standoff scheme requires potentially complex linkage among documents, parts of documents, and different modalities. In addition, it must be possible to uniquely point to the smallest possible component (e.g., character, phonetic component, pitch signal, morpheme, word, etc.). Components may or may not be the content of an XML element; in many cases, it is impractical to tag every potential target (e.g., each morpheme in a large text) or impossible to predetermine targets at the time the data is tagged. Therefore, it must be possible to address not only XML elements, but also characters and chains of characters within those elements, as well as elements and characters both within the same document and in other XML documents. The gamut of XML linkage mechanisms, including XLink (DeRose, et al., 2000), the XML Path Language (XPath) (Clark & DeRose, 1999), and XPointer (DeRose, Daniel, & Maler, 1999), meets these needs. However, it is not yet clear how to most efficiently utilize XML linking mechanisms to facilitate search and retrieval from large bodies of annotated resources.
3 Query and search requirements for language resources
Annotated language resources pose several problems for query and search. First, queries over annotated language resources demand search and retrieval over not only a base document, but also over any number of annotation documents that may be associated with it. For example, a query might request all instances of a given word that appear as the head of a noun phrase in a given syntactic position; or a researcher might want to determine if there is any relation between prosodic variance and part of speech by searching and analyzing their parallel patterns of occurrence. Similarly, in an annotated speech database, one might search for all words whose phonetic transcription contains a 'd' and ends with 'k', etc. In each of these cases, it is necessary to access (at least) the base document and an annotation document, locate the relevant information, and follow links between documents to retrieve associated information, which, if performed over the XML documents themselves, can be costly in terms of efficiency
A more critical problem arises from the fact that linguistic annotations often do not form simple hierarchies, as outlined in the previous section. For this reason, existing query languages for structured text do not apply. Query languages for semi-structured data (e.g., XML-QL, Deutsch et al., 1998) answer some problems, but fail to capture the temporal (quasi-linear) relations in speech, dialogue, etc. Query languages for annotated languages resources have been developed, including for example the MATE query language (Carletta and Isard, 1999) and, more recently, a query language for annotation graphs (Bird, Buneman, and Tan, 2000). However, it is not clear that a query language that is adequate for the needs of language engineering yet exists; the annotation graph formalism, for example, does not accommodate multi-dimensional annotations (i.e., annotations applied to more than two points or spans), which are relatively common in many kinds of data.
In addition to problems related to the structure of the data itself, perhaps one of the most difficult obstacles to efficient search of language resources results from their sheer size. To provide reliable statistics for language modeling, text data bases must contain (at least) millions of words; speech data is even more extensive. Coupled with potentially several annotation documents, the amount of information that must be searched can be massive. Therefore, efficient search and retrieval methods are a must for language engineering research.
The current practice in most corpus-handling software is to use the XML representation of the data for interchange and communication between tools only, due to the high processing costs of directly accessing and manipulating XML documents directly, as well as the current lack of software support for XML linkage mechanisms. For the purposes of processing, search, and retrieval, the XML documents are transduced to some internal tool format (often, a relational database) and all operations are performed using the internal representation of the data. While this has advantages for tool efficiency, it is not always optimal or even possible. For example, Ide, Le Maître, and Véronis (1995) show that the relational model does not easily handle the representation of complex lexical information, which requires recursive nesting of tables. We therefore need to explore other possibilities for representing the information in suites of inter-linked XML documents.
5 Conclusion
As the language engineering community attempts to manage the rapidly increasing size and complexity of language resources, it becomes more and more attractive to adopt ideas from the database community. We are already beginning to see work in which the concerns of language resource designers are converging with those arising from work on semi-structured data (e.g., Bird, Buneman, & Tan, 2000). This presentation is intended to pursue this convergence, in particular to contribute to the development of the XCES. We therefore ask rather than answer questions in this presentation, to ultimately determine the best means to represent language resources within the XML framework.
References
Bird, S., Buneman, P. & Tan, W-C., 2000. Towards a Query Language for Annotation Grpahs. In Proceedings of the Second International Language Resources and Evaluation Conference. Paris: European Language Resources Association, 807-14.Carletta, J. & Isard, A., 1999. The MATE Annotation Workbench: User Requirements. In Towards Standards and Tools for Discourse Tagging: Proceedings of the Workshop. ACL.
Clark, J. (ed.), 1999. XSL Transformations (XSLT). Version 1.0. W3C Recommendation. http://www.w3.org/TR/xslt.
Clark, J. and DeRose, S., 1999. XML Path Language (XPath). Version 1.0. W3C Recommendation. http://www.w3.org/TR/xpath.
DeRose, S, Maler, E., Orchard, D., & Trafford, B. (eds.), 2000. XML Linking Language (XLink). W3C Working Draft, 21 February 2000. http://www.w3.org/TR/xlink.
DeRose, S., Daniel, R., & Maler, E., 1999. XML Pointer Language (XPointer). W3C Working Draft, 6 December 1999. http://www.w3.org/TR/xptr.
Deutsch, A., Fernandez, M., Florescu, D, Levy, A., & Suciu, D., 1998. XML-QL: A Query Language for XML. http://www.w3.org/TR/NOTE-xml-ql.
Ide, N. & Brew, C., 2000. Requirements, Tools, and Architectures for Annotated Corpora. In Proceedings of the EAGLES/ISLE Workshop on Meta-Descriptions and Annotation Schemas for Multimodal/Multimedia Language Resources and Data Architectures and Software Support for Large Corpora. Paris: European Language Resources Association, 1-6.
Ide, N., 1998a. Encoding Linguistic Corpora. In Proceedings of the Sixth Workshop on Very Large Corpora, 9-17.
Ide, N., 1998b. Corpus Encoding Standard: SGML Guidelines for Encoding Linguistic Corpora. In Proceedings of the First International Language Resources and Evaluation Conference, Paris: European Language Resources Association, 463-70.
Ide, N., Bonhomme, P., & Romary, L., 2000. XCES: An XML-based Encoding Standard for Linguistic Corpora. In Proceedings of the Second International Language Resources and Evaluation Conference. Paris: European Language Resources Association, 825-30.
Ide, N., Le Maitre, J., Véronis, J., 1995. Outline of a Model for Lexical Databases. Current Issues in Computational Linguistics: In Honour of Don Walker. Linguistica Computazionale IX, X (Pisa, 1995), 283-320. [reprinted from Information Processing and Management., 29, 2, 159-186]