Panlingual Dogfood

Version 0

Jonathan Robert Pool

University of Washington
Linguistics 580E
Spring, 2006

Abstract

Abstract

The idea of a world whose people interact efficiently without sacrificing their linguistic diversity is becoming popular. Despite many related projects (such as Unicode, Plone, Global WordNet, controlled languages, glyphs, and machine-translation systems), a plausible model for the realization of this idea has yet to be described. I envision and model here a natural-language-oriented version of the Semantic Web vision, in which "Panlingual Aspectual Translation" is supported by a collaboratively panlingual standardization process that yields a common lexicocentric semantic representation and a family of semantically equivalent language varieties. The implementation of this idea is a sequence of prototyping, testing, elaboration, and institutionalization steps that is multilingual from the start.

Keywords

standardization, conceptual standardization, semantic standardization, lexical standardization, terminology, multilingual systems, multilingual search, multilingual information retrieval, multilingual publication, multilingual discussion, universal access to information, machine translation, translingual collaboration, universal interactivity, linguistic diversity, linguistic universals, semantic universals, knowledge representation, ontologies, Semantic Web, WordNet, Unicode, Plone, Grammar Matrix

The Problem

The revolutions in the speed, power, and efficiency of automatic information-processing services since the late 1900s have led to an unprecedented situation of global language contact. With the improvements in information technology, people throughout the world have acquired the potential to interact at negligible cost with records in, and with other people who use, hundreds of the world's languages. People can envision using the Internet to exploit documents in any language and to interact with other people regardless of the languages they know.

If history is a guide, this global language contact can be expected to lead to massive language death. There are reportedly about 7,000 natural languages in the world (Gordon 2005). According to some forecasts, most of them will be extinct within a century (Woodbury 2006). There is speculation as to whether this trend will extend to global unilingualism. Already, however, within numerous world-wide professional and commercial communities there is only one standard language considered a normal medium of non-local discourse, and that pattern tends to promote lexical impoverishment in all other languages, in turn decreasing the motivation to transmit them as native languages to subsequent generations.

Massive language death might be considered a natural and beneficial consequence of the emergence of global interconnectivity, except that linguistic diversity is widely valued as a public good. Some consider each living language an asset to its traditional bearers; others consider each living language an asset to the entire world. The purported mechanisms underlying these beliefs vary, but the shared conclusion to which these beliefs lead is that massive language death is a high price to pay for efficient global interactivity.

The value that is placed on linguistic diversity leads to an obvious question: Can the world's thousands of languages remain alive, transmitted intergenerationally as native languages, while people and documents become able to interact freely worldwide? If so, how?

The most believable answer, according to Mufwene (2002), is yes, but only if it can become rational for the native speakers of weak languages to use their own languages, instead of learning and using dominant languages. Although languages are sometimes exterminated forcibly, as with prohibitions and genocides, the proximate cause of most language death is the self-interested decisions made by the dying languages' own speakers to cease using them, and the only promising interventions to stop or reverse language death are ones that confer rewards for the use of weak languages.

But is this possible? Can the use of weak languages be made rational? The "literature on language endangerment ... does not articulate the steps" (Mufwene 2002, 388). Or, if it does, it is unrealistic, in two ways. First, it prescribes difficult social revolutions, such as stabilizing the world population and eliminating poverty world-wide. Second, it ignores the fact that distributed prosperity often in fact solidifies a few regional languages, whose dominance leads many local minority languages to die. Thus, not much is known about the conditions under which thousands of weak languages could survive the development of a global community. There is, however, some evidence that their speakers value them and would maintain them if it were "possible for speakers to earn their living competitively in these languages" (Mufwene 2002, 390).

Making global interactivity compatible with linguistic diversity seems difficult in light of this analysis. One apparently needs to discover a way to make it both possible and profitable for people to use their native languages as languages of productivity, instead of learning and using a dominant language. This outcome is plausible in environments like manufacturing plants, with many linguistically similar workers collocated and isolated. But this outcome doesn't seem likely where there is substantial migration, population intermixture, or mass participation in world-wide information exchange. Under these conditions, it seems realistic that expressions uttered in other than face-to-face contexts will often derive much of their value from the number of people able to find and understand them efficiently. If expressions' meanings can be reliably translated by automatic means from any language into any other language, they can be encoded in any language without losing value. Otherwise an expression's value varies inversely with the number of translations needed for it to reach the target audience, and fewer translations tends to be necessary when it is in a widely known language.

One solution, according to this argument, would be to make expressions automatically translatable among arbitrary natural languages. As long as an expression were uttered in some language, it would be accessible via any other language. Native speakers of weak languages would not suffer damages from using their native languages; on the contrary, they would be freed from the cost of mastering a dominant language.

Automatic panlingual translatability, however, is possible only with compromises, because translingual equivalence cannot be verifiably accomplished (e.g., Trujillo 1999, pp. 73-74, 258). One form of nonequivalence is discordant lexical ambiguity. Most lexemes are ambiguous, but in general they are discordantly ambiguous. By this I mean that, for any language X and any lexeme L_x in language X, there is at least one language Y in which no lexeme has a set of senses equivalent to the set of senses of L_x (assuming it is possible to verify the equivalence of senses). In order to translate L_x into language Y, one must choose between compactness and precision. A compact translation consists of a lexeme whose senses are only partly equivalent to those of L_x. A precise (i.e. equivalently ambiguous) translation consists of an enumeration of the possible senses of L_x (assuming there is a way to express the individual senses of L_x in language Y). Analogous nonequivalences apply to morphosyntactic ambiguities (attachment, scope, etc.) and pragmatic ambiguities (illocutionary force, presupposition, etc.).

Solution or Evasion

We can simplify the many ideas for dealing with this translatability problem by classifying them into two categories: solutions and evasions. Solution ideas specify how to accomplish efficient panlingual translation. Evasion ideas specify how to make it unnecessary.

Solution Ideas

Solution ideas make panlingual translation efficient. They can be further classified into three subcategories: support, replacement, and constraint.

Support solutions rely on traditional translation by bilingual human beings but provide mechanisms to support the translation effort and thereby make it less costly. Some systems supporting human translation use databases and Web interfaces to limit the management overhead associated with the recruitment and deployment of translators and the validation, revision, maintenance, and deployment of translated versions. Simple systems of translation memory (Trujillo 1999, pp. 59-61) manage libraries of previous translations to limit the cost of their retrieval by translators. The DotSUB (http://www.dotsub.com) system manages the human translation of subtitles for motion pictures into any language and dynamically integrates the resulting subtitles into the motion pictures in on-demand per-user publication. Plone (http://plone.org) is "a content management system with strong multilingual support". A Plone Web site can display its content in any language, if the content's language-specific items have been translated. Translated and untranslated fields can be embedded, so word-order differences can be respected without superfluous translation. The user-preferred language is automatically selected, on the basis of the browser's HTTP_ACCEPT_LANGUAGE header. Web sites typically contain recurring elements, such as controls and navigation links; these need to be translated only once. Plone is dogfooded in the Plone development interface, which is currently available in about 50 languages. Such solutions incorporating support ideas decrease the cost of management overhead and part of the cost of translator effort (such as the cost of looking up previous similar translations), but the costs incurred in translating an utterance remain roughly proportional to the number of target languages. Support solutions also fail to overcome the limits on translatability.

Replacement solutions substitute automatic implementations for human translation. Replacement systems include knowledge-based and corpus-based machine translation (Trujillo 1999, pp. 85-220). Typically, such systems are more efficient on the margin, i.e. for each additional unit of translation, than human-translation systems, but the comparison of the initial investments is not straightforward and the quality of the best automated translation is usually inferior to the quality of the best human translation. Corpus-based systems tend to produce particularly inferior results if corpora are not parallel and aligned, if corpora are small, if algorithms for the segmentation and morphological classification of texts in the source language do not exist, and if source texts are in domains different from those in which corpora are available. Some of these limitations apply notably to the goal of making content in any language available in any other language. Needless to say, replacement solutions do not overcome the limits to translatability. While support-oriented systems like those mentioned above have been designed to be potentially panlingual, no replacement-oriented systems deployed up to now seem designed to become panlingual. Most are language-specific or language-pair-specific. Some are plurilingual, such as the Majstro Translation Dictionary (http://www.majstro.com/Web/Majstro/sdict.php) and Babel Fish Translation (http://babelfish.altavista.com/tr).

Constraint solutions attack the limits on translatability by constraining the grammar and lexicon of the language in which the original expressions are encoded. The required source language is a controlled variety of a natural language (e.g., Clark 2005, Kaljurand 2006, Sowa 2004), and its additional constraints specify the meanings of the utterances precisely. Ambiguities are prevented. Expressions in such a controlled language are equivalent to expressions in a machine-oriented formalism. This makes it possible for automatic translation to work well. Some experiments in collaborative problem-solving have shown constrained communication to be even more efficient than free communication (Ford 1979). But constraint solutions based on controlled languages remain poorly evaluated. Those with strictly controlled languages seem to offer inadequate expressivity, while those with mildly controlled languages have left much ambiguity intact and have not demonstrated automatic translatability (Pool 2005).

Evasion Ideas

Evasion ideas make it unnecessary to translate content into other languages. These ideas can be further classified into three subcategories: paralingual, unilingual, and metalingual.

Paralingual evasions rely on the assumption that all human beings, although they speak thousands of mutually unintelligible languages, also know how to exchange meaning by using universally shared paralinguistic codes. Among these are codes realized as music, dance, gestures, facial expressions, and drawings. Work based on this concept has produced sets of pictorial symbols ("glyphs"), such as the USP Pictograms used for the instruction of patients about the use of medications (http://www.usp.org/audiences/consumers/pictograms/); systems for graphical communication between persons without a common language (Tanimoto 1998); and text-free computer interfaces (Huenerfauth 2002). The success of paralingual systems depends on the universality of paralingual codes, and also (Tanimoto 1998) on the adequate expressiveness of whatever codes are universal.

Unilingual evasions assume that a single language becomes shared by all participants in a community of interaction. The shared language might be one that achieves dominance in global competition, or one that is adopted by a collective decision. It could be a natural language, a natural language modified for a global role, or a language designed for its global role and based on multiple natural languages or abstract principles. In any of these cases, panlingual translation can be evaded. The assumption of a globally shared language is compatible with linguistic diversity if mass bilingualism can be maintained. The maintenance of mass bilingualism, however, would presumably require many people to overcome the tendency to confine public utterances to the shared language. When people composed articles, books, Web pages, or discussion-group messages, for example, they would need to express themselves twice: once in their native language and once in the shared language. If this practice were normal, all languages would presumably continue to live and develop lexically along with the shared language. The incentives that would motivate this practice, however, are not evident.

Metalingual evasions call on people to annotate their linguistic expressions until automatic reasoning with these annotations as input can be performed, and therefore automatic translation can be either reduced to natural-language generation or avoided altogether. When people ask software agents to perform tasks, such as answering questions, the agents can use metalingual expressions as raw materials, so the original expressions in natural languages can remain untranslated.

The most prominent metalingual idea is the vision of the "Semantic Web" set forth in Berners-Lee (2001). The Semantic Web's content is re-encoded with an unambiguous formal (RDF/XML) syntax, and things referred to are expressly defined in terms of their class memberships and property values, with classes and properties further defined and constrained in formal (OWL) ontologies. Authors are free to define their own ontologies, but typically re-use concepts from consensual ontologies, which typically enrich their concepts with constraints, such as transitivity, mutual exclusivity, and numeric value types. Shared rich ontologies permit software agents to aggregate knowledge from multiple sources and perform reasoning on it (such as using information about dog biscuits to answer questions about pet food).

The willingness and ability of billions of ordinary Web authors to adhere to shared ontologies remains to be demonstrated, and one possible obstacle to this is the emergence of ontological fragmentation along natural-language lines. Moreover, the Semantic Web expects all people to formulate and express meanings and presuppositions with sufficient completeness and precision to permit automated reasoning, but in reality this capability may require as much training as that of a skilled knowledge engineer (Marshall 2003). If so, metalingual evasions may be more expensive than the mass learning of a common natural language posited by unilingual evasions.

State of the Art

No attempt at solution or evasion has shown that efficient panlingual interactivity is feasible. Each idea has inherent limitations that make it a questionable basis for intervention. Support solutions have continuing costs per unit of content proportional to the number of languages of the target audience. Replacement solutions deliver inferior results, whose inferiority grows as the source and target languages become more numerous and weak. Paralingual evasions have not been shown to work with reliability and adequate expressiveness. Unilingual evasions are based on the dubious feasibility of stable asymmetric mass bilingual maintenance and redundantly bilingual communication. And metalingual evasions require translingually consistent precision and explicitness that have not been shown to be within the grasp of most human beings.

A Vision

Having outlined some solution ideas and evasion ideas and some doubts about their realism, I now offer a new idea that borrows from them to fashion an evasive solution, which I shall call "Panlingual Aspectual Translation" (PAT).

PAT modifies the unilingual ideas' assumption of a shared global language, reducing what is shared to just one aspect of language, namely its semantic system. Every person who participates in PAT has acquired competence not only in a colloquial variety and a standard literary variety of his or her native language, but also in a written global variety. The global variety's semantic system is equivalent to those of the global varieties of the world's other languages. Any meaning or semantic distinction that is expressable in any of the other global varieties can be replicated in it, and vice-versa. They are all formally equivalent to a common semantic representation. The global variety is a morphological, syntactic, lexical, and orthographic variety of the person's native language, while being a semantic variety of the global varieties of all other languages. We can think of it as an "aspectual dialect": Different aspects of it are dialects of different languages. The global variety resembles the literary language as it might be (mis-)used by a typical world citizen who imposes a universal semantic system on it.

In terms of machine-translation paradigms, the common semantic representation fits interlingua and knowledge-based translation (Trujillo 1999, pp. 68-182), rather than semantic-transfer translation (Trujillo 1999, pp. 135-147). In the former paradigms, there are two stages that require conversion rules: input from the source into the intermediate representation, and output from the intermediate representation into the target. In the latter paradigm, there are three stages: input from the source into the source semantic representation, transfer from the source semantic representation to the target semantic representation, and output from the target semantic representation to the target. The latter paradigm, requiring pairwise transfer rules, has costs that vary with the square of the number of languages, while the former paradigm's costs are linear in the number of languages, making the former paradigm more applicable when panlingual translation is envisioned.

The equivalence among all global varieties makes it possible to construct a system for the automatic bidirectional conversion of expressions between each language's global variety and the common semantic representation. Thus, translation is not evaded, but human translation is replaced with automatic translation, and (in contrast with today's replacement solutions) there is no loss of meaning in the process. Round trips produce the original expressions. There is human effort, however, in the exchange of meaning, beyond that expended in unilingual communication. The extra human effort is translation-like, in the sense that formulating or understanding a text in the global variety of one's language may require a quasi-translation between the semantics of that variety and the semantics of the literary or colloquial variety. Lexemes and constructions have similar, but not always identical, meanings between a language's varieties. Discordant ambiguities between literary varieties are, wherever not eliminated, made concordant between global varieties.

PAT modifies the paralingual ideas' assumption that all people have shared meanings, assuming not that meanings are naturally shared, but instead that shared meanings are achievable with collaborative standardization and people are motivated to achieve and utilize standard meanings. Natural languages map expressions onto semantic space in different ways on many dimensions, such as color, distance, number, gender, aspect, evidentiality, and formality. These different mappings may or may not cause native speakers to perceive universes of discourse in correspondingly different ways. PAT assumes merely that people who engage in global exchanges of meaning are willing to adopt incrementally, and are able to learn to apply, a shared semantic system for that activity.

PAT's semantic representation is rich in explicit constraints and entailments, as are ontologies in the metalingual Semantic Web vision, but PAT goes beyond the Semantic Web in what it demands of an ontology. The Semantic Web permits anybody to publish an ontology and does not require multiple ontologies in the same domain to be coordinated. PAT relies on collaboration to produce global semantic standards, general and domain-specific. PAT's standardization process is accessible to universal participation, notwithstanding that the participants have no shared language in which to deliberate. Adopting a semantic standard when the world's languages exhibit massive semantic dissimilarities requires discussion, negotiation, compromise, testing, and decision-making. How can people with thousands of different languages engage in these activities together? Isn't this standardization process a prerequisite to the shared semantic system that would enable PAT itself, and isn't PAT's realization a prerequisite to the kind of collaboration across languages that the standardization process requires? Doesn't PAT, then, involve an impossible bootstrapping dilemma?

PAT uses dogfooding in order to defeat this dilemma or, insofar as this is impossible, relies on support solutions, invoking human translators to enable communication when the semantic system of PAT has not yet achieved an expressiveness that it requires for deliberations. The participants in PAT, paying the costs of human translation, have incentives to structure the deliberation process as one with a highly expressive panlingual form interface (such as Plone can manage), so human translation is minimized once forms have been translated. There are also incentives to sequence the elements of the incrementally developing semantic system in such a way as to minimize the need for human translation of discussion content. We might expect, then, that the earliest semantic elements to be standardized would be those extensively used in the standardization deliberations, such as elements related to proposals, acceptances, goodness ratings, voting, and expression-denotation relations. Finally, PAT's semantic standards begin their lives as proposals, and proposals based on merely plurilingual input can be implemented on a trial basis without violating the assumption of universal access to the adoption process.

PAT has some challenges to overcome, including:

the difficulty of achieving consensus on the representation of interests in, and the rules of, the semantic standardization process
the possibility that the semantic standard would force the global varieties to make semantic distinctions so granular as to exceed the lexical and morphosyntactic inventories of their base languages, requiring awkward supplementary annotations
the possibility that many global varieties would differ semantically from their base languages to such an extent as to make them more difficult to learn and use than an entirely different language
the possibility that global varieties would extinguish literary varieties and render linguistic diversity trivial by semantically homogenizing all languages

Potential problems like these have been discussed, but not resolved. For example, industrial controlled languages (e.g., Clark 2005, Kaljurand 2006, Sowa 2004) have been found to require as little as a few days of study, but also to present substantial difficulties to some learners and users, particularly native speakers of their base languages (Hebling 2002, pp. 54-57). I assume that testing is necessary in order for these challenges to be assessed, and therefore the most valuable next step in the development of the PAT idea is to model its implementation.

A Model

My notion of a model of PAT is an abstract description not only of how PAT operates, but also of how a world like the current world becomes a world in which PAT is pervasive. Describing how PAT operates does not include a description of its semantic representation, because PAT is a system in which the world's people, operating through a standardization process, choose or design a semantic representation that they will share. My model here is an adoption model, which describes a transition, beginning with current conditions.

Step 0: Initial Plurilingual Corpus

In step 0, researchers on PAT select or produce a parallel corpus in several typologically different natural languages containing discourses that they intend to be representable with the initial version of a prototype of a PAT semantic representation.

The corpus is defined as a compromise among the competing values of expressivity, topical diversity, linguistic diversity, user appeal, and feasibility. This corpus is used in steps 1, 2, and 3, in which a semantic representation is defined and global varieties for the corpus languages are designed and tested on native speakers.

Step 1: Initial Prototype Semantic Representation

In step 1, researchers design or adopt an initial semantic representation that they speculate could be adopted in a standardization process and that they intend to be capable of representing the discourses found in the step-0 corpus. They test the representation by using it to encode the corpus discourses and judging the practical adequacy of its representation of the meanings expressed in those discourses.

Step 2: Initial Prototype Global Varieties

In step 2, the researchers specify morphological, syntactic, lexical, and semantic constraints on the literary varieties of the languages of the corpus, so as to define prototype global varieties of those languages that are equivalent to the semantic representation. They implement the global varieties computationally, so that sentences input in any of them are automatically converted to the semantic representation and semantic representations are automatically converted to sentences in the global varieties, without corruption. The evaluation of the global varieties in step 2 is based on the correctness of translation from and into them, and on the extent to which their grammars and lexicons are judged faithful to their base languages.

Systems that may be adapted for step 2 include interlingual and knowledge-based machine-translation systems using intermediate representations, such as Unitran using Lexical Conceptual Structure (Trujillo 1999, pp. 168-182), Mikrokosmos using Text Meaning Representation (Trujillo 1999, pp. 182-199), the Interlingual Annotation of Multilingual Text Corpora project (http://aitc.aitcnet.org/nsf/iamtc/) with a new representation under development, and the Grammar Matrix (http://www.delph-in.net/matrix/) using Minimal Recursion Semantics (Trujillo 1999, pp. 212-214). Of these, it appears that the Grammar Matrix has been configured for the most efficient approach to panlinguality; it has been applied to 40 languages, including languages with very limited resources (http://depts.washington.edu/uwcl/twiki/bin/view.cgi/Main/LanguagesList). The Grammar Matrix makes use of a morphosyntactic and semantic knowledge base of linguistic universals and cross-linguistic typology. Using a Web form (http://www.delph-in.net/matrix/modules.html), the Grammar Matrix elicits a lexicon and a set of grammatical parameter values for any human language from a person who knows the language. It then automatically produces an initial computational parsing and generating grammar for the language. The grammar parses any compliant sentence into a semantic representation, and it generates a sentence from any compliant semantic representation. If a shared semantic representation were adopted in step 1 and a Grammar Matrix elicitation form were designed for that representation, then the grammars of the global varieties would be automatically intertranslatable.

Step 3: User Testing

Step 3 extends the testing of the prototype semantic representation and global varieties beyond the corpus of step 0. Native speakers of the corpus languages are taught to read and write their corresponding prototype global varieties, and they use these varieties to interact with one another and with the corpus. Their utterances also expand the corpus.

Initially, the user testing occurs in a laboratory environment. The evaluation includes measurement of the cost of user acquisition of competence in global varieties and the productivity (encoding and decoding speed) of global varieties relative to that of the corresponding literary varieties. User opinions about the global varieties and the experience of using them are also elicited for the evaluation. Defects in the semantic representation and in the global varieties discovered during the evaluation are remedied, and evaluation is repeated.

If and when the semantic representation and the global varieties are found satisfactory in a laboratory environment, testing with public users is begun. This testing depends on the tutorial documents and/or programs that teach native speakers of the prototype languages to read and write their corresponding global varieties. Ideally, the corpus has an expressive range sufficient to enable some activity that appeals to a linguistically diverse public, so the user testing can escape from the laboratory and continue in the real world, generating data for analysis and insight. This may not be a fanciful goal, given that other projects offering interactivity with extremely limited expressive ranges, such as the ESP Game (http://www.espgame.org), have succeeded in attracting mass voluntary public participation.

Step 3 continues in parallel with the following steps, elaborating its functions as PAT's own elaboration permits.

Step 4: Semantic Elaboration

Step 4 elaborates the prototype semantic representation and global varieties. They remain prototypes, therefore under the researchers' control. The elaborations arise from the researchers' judgments, informed by expressions of demand arising from users participating in the testing of step 3. Elaborations can include refinements of expressiveness within existing domains, as well as extensions into new domains.

Some of the semantic elaboration is structural. Structural elaborations may make the initial semantic representation more expressive with respect to such properties as modes, times, evidentialities, and illocutionary forces. It has been claimed (Berners-Lee 2001) that the vast majority of statements people wish to make on the Web are elementary definitions and elementary predications, but, even if this is true, it seems to imply nothing about the demand for expressiveness, which is indicated by the value users get from greater expressive power. The value of a particular expressive feature may be high, even if the fraction of all statements making use of that feature is small. User testing can reveal evidence of the values delivered by expressive features. The PAT project also makes its own demands: It needs particular expressive features in order to dogfood PAT effectively in the standardization process that begins in step 5.

The rest of semantic elaboration is lexical. One resource available for the facilitation of lexical elaboration is the collection of plurilingual lexical databases arising from the WordNet project (http://wordnet.princeton.edu/). These include the databases produced in the Global WordNet (http://www.globalwordnet.org/), EuroWordNet (http://www.illc.uva.nl/EuroWordNet/), MultiWordNet (http://multiwordnet.itc.it/english/home.php), and Mimida (http://www.gittens.nl/SemanticNetworks.html) projects. There are now unilingual WordNet databases for 39 languages, typically with 5,000 or more word senses (limited to nouns, verbs, adjectives, and adverbs), and project participants have worked on mapping their senses to one another and/or to universal sense taxonomies, including a set of "Global Base Concepts" defined as those "that act as Base Concepts in all languages of the world". This work may facilitate the lexical elaboration of the prototype semantic representation and global varieties. The resulting concept taxonomy may be treated as a proposal, to be discussed and improved in step 5.

Although the Berners-Lee (2001) claim about the simplicity of the necessary semantic structure may seem implausible, this claim is compatible with the typical manner in which formal expressive systems are semantically elaborated, and it may be practical to apply that same pattern to global varieties of natural languages. For example, natural languages may have multiple tenses, including simple, immediate, remote, and relative ones, but such time distinctions can be expressed lexically, such as with various adverbs, instead of structurally, and elaborating the semantic representation in this manner may make the design of equivalent global varieties less costly than if the elaboration were structural.

Step 5: Prototype Lexical Standardization System

Step 5 partly releases PAT's lexical repertoire from the researchers' total control by opening it to public deliberation. For this step, the researchers adopt or design a deliberation system that permits both formal actions by participants, such as proposals, rankings, and votes, and discussion, with the use of the prototype global varieties as media of expression. Contributions to discussions are accessible in all of the existing prototype global varieties. Participants' decisions are initially advisory. The main questions addressed in the prototype lexical standardization process are:

What lexical concepts should exist in the semantic representation?
What properties and relations apply to each lexical concept in the semantic representation?
For any particular lexical concept and any particular prototype language, what should the entry form of the lexical expression for that concept in that language's global variety be, and what grammatical properties should that lexical expression have?

For example, the participants may discuss whether food that nourishes humans, dogs, plants, and thought should be considered a single "food" concept, and, if so, whether it should be distinguished from a "fuel" concept; and whether a "dogfood" verb concept, meaning "to use in the conduct of one's own business the product that one produces", should be a primitive concept.

The evaluation of step 5 focuses on the dogfooding challenge: Is the semantic representation structurally and lexically expressive enough to permit effective and satisfying deliberations about the lexical semantics of that representation itself?

If step 5 elicits substantial activity from the public, it can be expected that some participants engage in off-line discussions of the above questions in natural languages' literary varieties. To the extent that such discussions deal with the first two questions and effectively swamp the deliberations in the official forum, the opportunity for translingual deliberation fails to motivate the participants to incur the costs of learning and using the global varieties, and these costs are therefore too high to make PAT successful in this dogfooding context. But discussions in literary variety X about the third question as it applies to global variety X can be considered a natural consequence of the PAT design. These discussions arguably do not substantially affect the users of other languages.

An existing panlingual standardization project that may offer a model of such semicentralized discussion is Unicode (http://www.unicode.org/). It defines a text-encoding standard intended to represent texts in all languages, natural and artificial, up to now including 58 scripts (Latin, Cyrillic, Arabic, Devanagari, etc.) and 13 other symbol collections (Braille, mathematical, musical, etc.). Because Unicode makes provision for about 1 million characters, it largely eliminates scarcity and inter-script competition. Decisions on characters in one script do not interact substantially with decisions on characters in another script, so experts on particular scripts can autonomously formulate proposals for their scripts. If a script is used in only one language, the relevant discussions may take place in that language. Other scripts are used by multiple languages, and Unicode also contains multi-script policies, such as the Unicode Standard Stability Policy (http://www.unicode.org/standard/stability_policy.html) and the dynamic composition principle (Unicode 2004, p. 20), so deliberation on these policies involves speakers of diverse languages.

Unicode's model applies to PAT only in part, however, because Unicode's product is panlingual, but its process is not. That is only natural. Unicode cannot dogfood its product to enable panlingual deliberation, since its product permits people to use digital equipment to write and read, but not understand, any language. PAT does the latter, and therefore PAT, unlike Unicode, can conceivably dogfood its product to conduct its business panlingually. Unicode has adopted a unilingual solution by producing almost all of its documentation and conducting its deliberations in English.

Unicode, by making it possible for PAT's global varieties to use their base languages' authentic scripts, is a crucial enabling technology for PAT. If PAT were to succeed, it could be an enabling technology for the panlingualization of the process whereby Unicode documents, maintains, and extends its standard.

Step 6: Prototype Semantic-Structure Standardization System

Step 6 conducts a standardization process and an evaluation thereof similar to that of step 5, but now with the semantic structure as the object. It follows step 5 on the assumption that the dogfooding of a semantic standard is more difficult to apply to the discussion of its structure than to the discussion of its lexical concept inventory. The main questions to be deliberated are:

What semantic operators should exist in the semantic representation?
What constraints should apply to each semantic operator?
For any particular semantic operator and any particular prototype language, what parsing and generation rules should apply?

For example, the participants may discuss which of the thirteen specific values of the aspect feature (inceptive, iterative, continuous, etc.) enumerated in the General Ontology of Linguistic Description (http://www.linguistics-ontology.org/html-view/index.html) deserve to be formalized in PAT's semantic representation, and how. Other features susceptible to such deliberation include evaluation, evidentiality, force, formality, natural gender, modality, mood, number, person, polarity, presuppositionality, size, tense, thematic role, topicalization, and voice.

Step 7: Prototype Multilingualization

In step 7, the original set of prototype languages is expanded with public collaboration. Most of the infrastructure for this expansion has been produced in the prior steps. Step 7 depends, however, on bilinguals to specify the lexicons and grammatical parameter values of the additional languages through step 4. Even if the WordNet projects were used as sources of lexical elaboration for the prototype languages (implying there were WordNet databases for all of the prototype languages), most of the additional languages do not have corresponding WordNet databases, so participants for those languages begin with empty lexicons. Their work may seed WordNets for those languages, instead of the reverse.

It is possible that lexical seeding can occur even for languages that lack pre-existing lexical databases, by means of lexicographic induction. A small but purposively selected initial lexicon elicited from an informant, combined with lexical patterns exhibited by typologically and genetically related languages that are already in PAT, may permit a set of rules to be induced that predict non-elicited lexemes and non-existent (neologistic) lexemes.

The PAT deliberation system serves for the discussion of alternatives and the negotiation of consensuses, but consensuses are not mandatory. Step 7 is still a step in prototype development, so the researchers retain decision authority. Moreover, multiple global varieties per base language are possible. One value of multiple global varieties is that they can be appropriate for different user types. For example, some users may prefer more formalized global varieties, while other users prefer more authentic (native-faithful) global varieties.

Step 8: Institutionalization

In step 8, the PAT researchers guide the development of a standardization institution that takes control over PAT from them. The relevant announcements, invitations, responses, discussions, and decision documents are composed in PAT global varieties, making the entire process dogfooded.

If this process involved face-to-face conferences or teleconferences, PAT would be unable to support oral interaction without reliable speech recognition that works with all global varieties. As an alternative, face-to-face writing, such as is employed with chat interfaces and hand-held translators, could be employed.

Conclusion

Under the above implementation model, PAT could begin to affect real-world interactions starting in step 3. PAT's success in this project would be surprising, since if it had success potential one could expect entrepreneurs to have forecast that and to have developed systems harnessing that potential. Thus, PAT's success would presumably imply that this project revealed surprising findings, such as these:

People value the ability to reach a panlingual audience, be reached by a panlingual public, and access panlingual literatures more than was known.
People value the ability to use their own native languages in global interaction more than was known.
The meanings that people want or need to exchange require surprisingly modest semantic structures.
People throughout the world have a more uniform set of conceptual cognitions and emotions than would have been inferred from the differences among the world's natural languages.
People can make the transition from a natural language's literary standard to a semantically global standard of that language with surprising ease, rapidly learning and easily maintaining separations between the different sense mappings of the same lexemes in the two standards.
People retain a surprisingly large fraction of the benefit of linguistic intuitions when using a global variety, in contrast with their loss of intuition when using machine-oriented representations such as RDF/XML.
Global varieties of the world's weak languages do not race ahead of their literary-standard and colloquial counterparts in the sizes of their lexical inventories, because the former, while absorbing lexical innovations from the world, also act as surprisingly productive lexical sources for their counterpart varieties.
Global varieties of the world's languages are surprisingly efficient targets for foreign-language learning compared with their literary-standard counterparts, and facilitate the subsequent learning of the latter.
The use of a shared semantic representation generates a surprisingly reliable body of extractable knowledge as a by-product and thereby increases the power of automated reasoning about PAT discourses as time passes.

If PAT were pervasively implemented and performed as intended, Internet services could become more effective. Some examples of improvements are:

Documents (Web pages, blogs, mail messages, etc.) are written in one language and can be read in any language.
Search queries are formulated in one language and all and only precisely matching documents are retrieved in the same language, translated from their original languages.
Questions are asked in any language and precise answers are delivered in the same language, based on the aggregation of knowledge expressed in any language.
Principles of knowledge summarization are applied to knowledge expressed in any language.
Alert requests are expressed in any language, and requesters are notified of precisely matching events, announcements, and changes expressed in any language.
Transactions with proposals, offers, acceptances, and agreements are conducted among parties that do not know a common language.
Teachers teach knowledge and skills to students when the teachers and students do not have a common language.
Discussion fora permit the active participation of all interested persons, regardless of which languages they know.
Collaborative teams consist of people with appropriate mixtures of domain knowledge, whether or not they know a common language.

The vision I have described, if attempted, might succeed or fail, but in either case it would be a test of the realizability of the mass-participation and linguistically inclusive principles of the proposed Semantic Web. An attempt to implement the vision of Panlingual Aspectual Translation would provide empirical measures of the cost and limits of panlingual interactivity.

Were PAT to be practical, it would make foreseeable improvements in Internet services. In addition, it would influence the world's economy of language. Languages that are being treated as doomed would become useful for participation in global information exchange, and this would increase their value as media of productive and other expression. Those considering whether to design writing systems for as-yet unwritten languages, and those considering whether to becoming literate in written minority languages, would be able to anticipate higher returns on their investments. The global linguistic equilibrium could change, as it became less costly to satisfy the preference for linguistic diversity and linguistic preservation.

References

Berners-Lee 2001. Tim Berners-Lee, James Hendler, and Ora Lassila, "The Semantic Web", Scientific American, 284(5), 2001, 34-43. http://www.scientificamerican.com/article.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21&catID=2

Clark 2005. Peter Clark, Phil Harrison, Tom Jenkins, John Thompson, and Rick Wojcik, "Acquiring and Using World Knowledge Using a Restricted Subset of English", 2005. http://www.cs.utexas.edu/users/pclark/papers/flairs.pdf.

Ford 1979. W. Randolph Ford, Alphonse Chapanis, and Gerald D. Weeks, "Self-Limited and Unlimited Word Usage during Problem Solving in Two Telecommunication Modes", Journal of Psycholinguistic Research, 8, 1979, 451-475.

Gordon 2005. Raymond G. Gordon, Jr. (ed.), Ethnologue: Languages of the World, 15th edn. Dallas: SIL International, 2005. http://www.ethnologue.com/.

Hebling 2002. Uta Hebling, Controlled Language am Beispiel des Controlled English. Trier: Wissenschaftlicher Verlag Trier, 2002.

Huenerfauth 2002. Matthew Paul Huenerfauth, "Design Approaches for Developing User-Interfaces Accessible to Illiterate Users", presented to American Association of Artificial Intelligence (AAAI2002) Conference.

Kaljurand 2006. Kaarel Kaljurand and Norbert E. Fuchs, "Bidirectional Mapping between OWL DL and Attempto Controlled English", delivered at Fourth Workshop on Principles and Practice of Semantic Web Reasoning, Budva, Monte Negro, 2006. http://www.ifi.unizh.ch/attempto/publications/papers/ppswr2006_kaljurand.pdf.

Marshall 2003. Catherine C. Marshall and Frank M. Shipman, "Which Semantic Web?", Hypertext '03 Proceedings, 2003. http://www.csdl.tamu.edu/~marshall/ht03-sw-4.pdf.

Mufwene 2002. Salikoko S. Mufwene, "Colonization, Globalization and the Plight of 'Weak' Languages", Journal of Linguistics, 38, 2002, 375-395.

Pool 2005. Jonathan Robert Pool, "Can Controlled Languages Scale to the Web?", manuscript, 2005. http://utilika.org/pubs/etc/ambigcl/.

Sowa 2004. John F. Sowa, "Common Logic Controlled English", 2004. http://www.jfsowa.com/clce/specs.htm.

Tanimoto 1998. Steven L. Tanimoto and Carlo E. Bernardelli, "Extensibility in a Visual Language for Web-based Interpersonal Communication", manuscript, 1998. http://www.cs.washington.edu/research/metip/vl/vv/UW-CSE-98-03-01.PDF.

Trujillo 1999. Arturo Trujillo, Translation Engines: Techniques for Machine Translation. London: Springer, 1999.

Unicode 2004. Unicode Consortium, The Unicode Standard, Version 4.0, chapter 2. http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf.

Woodbury 2006. Anthony C. Woodbury, "What is an Endangered Language?". Linguistic Society of America, 2006. http://www.lsadc.org/info/pdf_files/Endangered_Languages.pdf.