# Mathematical linguistics

The mathematical discipline whose objective is the development and study of ideas forming the basis of a formal apparatus for the description of the structure of natural languages (that is, the metalanguage of linguistics). The origin of mathematical linguistics can be roughly placed in the 1950's; it was brought to life first of all by the internal needs of theoretical linguistics, which at that time was ripe for an elaboration of its basic ideas, and also by problems in the automatic processing of linguistic information (see Automatic translation). In mathematical linguistics methods of the theories of algorithms, automata and algebra are widely used. Keeping its applied sense, mathematical linguistics is constantly evolving along the path of transforming into a pure mathematical discipline, being essentially a branch of mathematical logic. At the same time the circle of applications of mathematical linguistics is expanding; its methods have found application in the theory of programming.

The linguistic concepts underlying the formal description of the structure of a language belong to structural linguistics. The most important of these concepts, the representation of a language as a "system of pure relations", brings language near to the abstract systems studied in mathematics. This representation makes concrete the concept of the function of language as a transformation of certain abstract objects — "meanings" — into objects of another type — "texts" — and conversely. This suggests the idea of studying this transformation (after elaboration of the notions of "meaning" and "text") by mathematical means. The application of this approach is difficult if one attempts to consider the transformation "in the large", because of its extraordinary complexity and also because of the difficulty of formalizing the notion of "meaning". However, a clear understanding suggests breaking down the transformation into steps. For example, as one of the roughest articulations of a certain stage can be included the passage from "meaning" of statements to "syntactic structures without linear order" — a set of statements joined by "syntactic relations" but not yet arranged in linear sequence; in the next stage one obtains linear sequences of words, which are then transformed into a chain of sounds. For a more subtle articulation syntactic structures at several levels are introduced, each further removed from the "meaning" and closer to the "text"; the "post-syntactic" stages also were subject to further breaking down. Such a stage already is easier to describe mathematically if one makes more precise the representation of objects at the intermediate levels and simulates the transition from one level to another by effective mappings. It is true, this transformation is ambiguous, as also are all, or almost all (depending on the method of articulation), intermediate steps. This is related to one of the most important peculiarities of language, the presence of synonyms, that is, the possibility of expressing one and the same content by different words. Therefore it is suitable not to construct deterministic effective systems (algorithms) but to construct non-deterministic systems (calculi), which allow either for a given object at some level to enumerate the corresponding objects in the next level or the objects (in the same level) synonymous with it, or to enumerate the set of "correct" objects of the given level (that is, those which by some known correct method can be associated with objects of the preceding level), or to enumerate the set of associated pairs of objects of two given neighbourhood levels (for example, "a sentence + its syntactic structure"), etc. Calculi of this kind are known as formal grammars (cf. Grammar, formal). Simultaneously with formal grammars which simulate the transformations of linguistic objects there arise constructions meant for the formal description of these same objects. In addition, in a set of objects of one level there arise classifications and relations, in many ways similar to the categories of traditional grammar (such as parts of speech, gender, case, etc.) and in a number of cases coinciding with them. Without the introduction of these classifications and relations the actual construction of formal grammars for natural languages is practically impossible.

Thus it is possible to separate three aspects of the formal description of a language: the description of the structure of linguistic objects of different levels, the description of certain special relations and classifications into sets of these objects and the description of the transformations of individual objects to others, and also the structure of the sets of "correct" objects. Associated with these aspects are the three fundamental divisions of mathematical linguistics: 1) the development and study of ways of describing the structure of parts of speech; 2) the study of linguistically significant relations and classifications into sets of linguistic objects (a formal system constructed for this purpose is usually called an analytic model of a language); and 3) the theory of formal grammars.

To describe the structure of parts of speech syntactic structures are used which are represented by a graph or di-graph of a special form, usually with labelled vertices and/or edges. It is the best developed theory of description for the "surface" levels (that is, those most remote from the "meaning"). At these levels the structures are usually trees. The methods of description of the "deeper" levels have been intensively studied. For this, in particular, the apparatus of so-called lexical functions was proposed, playing a role in the description of the meaning of combinations of words similar to that played by the traditional categories of gender, case, number, etc. in the description of syntactic combinations. There is not yet any means of strictly describing the "meaning" level, but in many investigations it appears likely that to this end a "successive approximation" may produce an approach to the formal description of meaning. This does not exclude other approaches; in particular, much research is devoted to methods for expressing in natural language predicates, propositional links, quantifiers, and to the "translation" of formal-logical languages into natural languages and conversely. Here one borders on work on the design of so-called semantic languages, in which meanings are associated with texts by simple and strictly formal methods.

Analytic models of languages are important, in particular, because they allow the logical nature of many ideas and categories of traditional linguistics to be made more precise. These models do not always have the nature of effective procedures, since they may contain such ideas as an (infinite) set of grammatically correct sentences of some language being regarded as given. However, in a number of models the initial data are represented as finite sets and finitary relations; in these cases the procedures in the model are effective. Bordering on the theory of analytic models is the theory of linguistic deciphering: its objective is the construction of procedures, applied in suitable analytic models to "unordered" empirical data of the language, which are always effective and allow one to obtain not only abstract definitions but also concrete information on the structure of a concrete language (for example, algorithms realizing the automatic subdivision of the set of phonemes of a language into the classes of vowels and constants without using any information on the language apart from some sufficiently long text).

The theory of formal grammars occupies a central position in mathematical linguistic because it allows one to simulate the most essential aspects of the function of language — the processing of meaning into text and conversely — and because of this it serves as a connecting link between the remaining divisions of mathematical linguistics. The nature of the apparatus of the theory of formal grammars is in many ways close to the theory of algorithms and automata. The most developed of the other types of formal grammars are those which characterize the set of grammatically correct sentences of a language and attribute a syntactic structure to those sentences. Here, sentences are modelled by chains (words) over a finite alphabet whose elements are to be interpreted as the words of a natural language (therefore, in mathematical linguistics the term "chain" is preferred to the term "word" and also the alphabet is often called a dictionary) and the model of the set of grammatically correct sentences is some formal language. In particular, generative grammars are of this type (cf. Grammar, generative). A generative grammar is essentially a special case of a Post calculus: it consists of a finite alphabet divided into two parts, the basic and the auxiliary alphabet, a finite set of deduction rules, representing the substitution rule in the form $\phi\to\psi$ ($\phi$ and $\psi$ are chains) and one axiom (usually) consisting of one auxiliary symbol, known as the initial symbol. The (formal) language generated by such a grammar is the set of chains over the basic alphabet derived from the axioms. The most important class of generative grammars for linguistic applications are context-sensitive grammars (cf. Grammar, context-sensitive), in which each rule has the form $\xi_1A\xi_2\to\xi_1\theta\xi_2$, where $\xi_1,\xi_2,\theta$ are chains in the union of the basic and auxiliary alphabets, $A$ is an auxiliary symbol and $\theta$ is non-empty. A context-sensitive grammar allows one to concatenate in a natural way the chains of the labelled systems of constituents generated by its language. This class of grammars is also most important in purely mathematical terms, since the languages generated by context-sensitive grammars are a simple and very important subclass of the primitive recursive sets. Among the context-sensitive grammars, in turn, of special importance, both from the theoretical as from the applied point of view, are context-free grammars (cf. Grammar, context-free), in which the rules have the form $A\to\theta$, where $A$ is an auxiliary symbol. Close to context-free grammars are dominating grammars (cf. Grammar, dominating), which also generate formal languages but the constituent chains in these languages are subordination trees, and categorial grammars (cf. Grammar, categorial), which are characterized by a special method of assignment of information on the syntactical properties of words. The other principal types of formal grammars are transformational grammars (cf. Grammar, transformational); they realise the transformation of a syntactic structure not "attached", in general, to chains; these grammars present greater perspectives for the description of the structure of natural language since they allow one to discuss syntactic and linear relations between words separately, which better expresses linguistic practice.

The theory of formal grammars, alongside its "traditional" linguistic applications, has found application in the theory of programming for the description of programming languages and translators. Particularly widespread used for these aims are the context-free grammars, but grammars of a more general form are also used.