Interlingual Annotation for Machine Translation

Start

Introduction

Project

Corpora

Objectives

the creation of a semantic representation system (Reeder 2004, p. 1)

to produce a practical, commonly-shared system for representing the information conveyed by a text, or interlingua (IAMTC 2004, "Home")

developing and testing a well-defined, well-motivated, and practical level of representation that captures semantic information from natural language text (IAMTC 2004, "Goals")

The development of six semantically-annotated bilingual corpora that pair English texts with corresponding text in Japanese, Spanish, Arabic, Hindi, French, and Korean. Reeder 2004, p. 2

The semantically annotated corpora will be useful for MT development and other natural language processing applications. Reeder 2004, p. 2

Contributions

The scientific interest of this research lies in the definition and annotation feasibility testing of a level of semantic representation for natural language text--the interlingua representation--that captures important aspects of the meaning of different natural languages. To date, no such level of representation has been defined complete with an associated annotated corpus of any size. As a result, corpora have been annotated at a relatively shallow (semantics-free) level, forcing NLP researchers to choose between shallow approaches and hand-crafted approaches, each having its own set of problems. Dorr/Farwell 2004, p. 6

This research will help provide the basis for a paradigmatic shift in natural language processing (NLP), enabling corpus-based research as well as linguistic research into language-independent meaning representations in applications such as machine translation, question answering, text summarization, and information retrieval. IAMTC 2004, "Goals")