# Coding and decoding

The process of representing information in a definite standard form and the inverse process of recovering the information in terms of such a representation of it. In the mathematical literature an encoding (coding) is a mapping of an arbitrary set into the set of finite sequences (words) over some alphabet , while the inverse mapping is called a decoding. Examples of an encoding are: the representation of the natural numbers in the -ary number systems, where each number is made to correspond to the word over the alphabet for which and ; the conversion of English texts by means of a telegraphic code into sequences consisting of the transmission of pulses and pauses of various lengths; and the mapping applied in writing down the digits of a (Russian) postal index (see Fig.a and Fig.b). In this latter case, there corresponds to each decimal digit a word over the alphabet of length 9 in which numbers referring to a line that is to be used are marked by the symbol . (For example, the word corresponds to the digit .)

Figure: c022890a

Figure: c022890b

The investigation of various properties of coding and decoding and the construction of encodings that are effective in some sense and possess specified properties constitute the subject matter of coding theory. Usually, a criterion for the effectiveness of an encoding is in some way or other tied up with the minimization of the length of the code words (images of the elements of the set ), and the specified properties of the coding relate to the guarantee of a given level of noise immunity, to be understood in some sense or other. In particular, noise immunity means the possibility of unique decoding in the absence of, or at a tolerable level of, distortion of the code words. Besides noise immunity, a number of additional requirements may be imposed on the code. For example, in the choice of an encoding for the digits of a postal code, an agreement with the ordinary way of writing digits is necessary. As additional requirements, restrictions are often applied relating to the allowed complexity of the scheme effecting the coding and decoding. The problems in coding theory were in the main created under the influence of the theory of information transmission as developed by C. Shannon [1]. A source of new problems in coding theory is provided by the creation and perfection of automated systems of gathering, storage, transmission, and processing of information. The methods of solving problems in coding theory are mainly combinatorial, probability-theoretic and algebraic.

An arbitrary coding of a set (alphabet) by words over an alphabet can be extended to the set of all words over (messages) as follows:

where , . Such a mapping is called a letter-by-letter encoding of the messages. A more general class of encodings of messages is found by automated encoding realized by initial asynchronous automata (cf. Automaton), which deliver at each instant of time some (possibly empty) word over the alphabet . The importance of this generalization consists in the fact that the automaton realizes, in different states, different encodings of the letters of the alphabet of the messages. A letter-by-letter encoding is an automaton encoding realized by a single-state automaton. One of the branches of coding theory is the study of the general properties of a coding and the construction of algorithms for recognizing these properties (see Coding, alphabetical). In particular, necessary and sufficient conditions have been found for automated and letter-by-letter encodings in order that 1) the decoding be single-valued; 2) there exists a decoding automaton, that is, an automaton effecting the decoding with a certain bounded delay; or 3) there exists a self-adjusting decoding automaton (making it possible to eliminate in a limited amount of time the effect of a mistake in the input sequence or in the functioning of the automaton itself).

The majority of problems in coding theory reduces to the study of finite or countable sets of words over an alphabet . Such sets are called codes. In particular, there corresponds to each single-valued encoding (and letter-by-letter encoding ) a code . One of the basic assertions in coding theory is that the condition of injectivity of a letter-by-letter encoding imposes the following restrictions on the lengths of the code words :

 (1)

The converse statement also holds: If is a set of natural numbers satisfying (1), then there exists a one-to-one letter-by-letter encoding such that the word has length . Furthermore, if the numbers are increasingly ordered, then one can take for the first symbols after the decimal point of the expansion of in an -ary fraction (Shannon's method).

The most definitive results in coding theory relate to the construction of effective one-to-one encodings. The constructions described here are used in practice for the compression of information and access of information from the memory. The concept of effectiveness of an encoding depends on the choice of the cost criterion. In the definition of the cost of a one-to-one letter-by-letter encoding it is assumed that there corresponds to each number a positive number and that . The following versions of a definition of the cost have been investigated:

1) ,

2) , ,

3) , where it is supposed that in the first two cases the are the probabilities with which some Bernoullian source generates the corresponding letters of the alphabet (), while in the third case, the are desirable lengths of code words. In the first definition, the cost is equal to the average length of a code word; in the second definition, as the parameter increases, the longer code words have a greater influence on the cost ( as and as ); in the third definition, the cost is equal to the maximum excess of the length of the code word over the desired length . The problem of constructing a one-to-one letter-by-letter encoding minimizing the cost is equivalent to that of minimizing the function on the sets of natural numbers satisfying the condition (1). The solution of this problem is known for each of the above definitions of a cost.

Suppose that the minimum of the quantity on the sets of arbitrary (not necessarily natural) numbers satisfying condition (1) is equal to and is attained on the set . The non-negative quantity is called the redundancy, and the quantity is called the relative redundancy of the encoding . The redundancy of a one-to-one encoding constructed by the method of Shannon for lengths , , satisfies the inequality . For the first, most usual, definition of the cost as the average number of code symbols necessary for one letter of the message generated by the source, the quantity is equal to the Shannon entropy

of the source calculated in the base , while . The bound of the redundancy, , can be improved by using so-called coding blocks of length , in which messages of length (rather than separate letters) are encoded by the Shannon method. The redundancy of such an encoding does not exceed . This same method is used for the effective encoding of related sources. In connection with the fact that the determination of the lengths in coding by the Shannon method is based on a knowledge of the statistics of the source, methods have been developed, for certain classes of sources, for the construction of a universal encoding that guarantees a definite upper bound on the redundancy for any source in this class. In particular, a coding by blocks of length has been constructed with a redundancy that is, for any Bernoullian source, asymptotically at most (for fixed as ), where this asymptotic limit cannot be improved upon.

Along with the problems of the effective compression of information, problems of the estimation of the redundancy of concrete types of information are also considered. For example, the relative redundancy of certain natural languages (in particular, English and Russian) has been estimated under the hypothesis that their texts are generated by Markov sources with a large number of states.

In the investigation of problems on the construction of effective noise-immune encodings, one usually considers encodings to which correspond codes belonging to the set of words of length over the alphabet . It is understood here that the letters of the alphabet of the messages are equi-probable. The effectiveness of such an encoding is estimated by the redundancy or by the transmission rate . In the definition of the noise-immunity of an encoding, the concept of an error is formalized and a model for the generation of errors is brought into consideration. An error of substitution type (or simply an error) is a transformation of words consisting of the substitution of a symbol in a word by another symbol in the alphabet . For example, the production of a superfluous line in writing out the figure of a Russian postal index leads to the replacement in the coded word of the symbol 0 by the symbol 1, while the omission of a necessary line leads to the replacement of 1 by 0. The possibility of detecting and correcting errors is based on the fact that, for an encoding with non-zero redundancy, the decoding can be extended in arbitrary fashion onto the words in which are not code words. In particular, if the set is partitioned into disjoint sets such that and the decoding is extended so that , then in decoding, all errors that translate a code word in , , will be corrected. Similar possibilities arise also in the case of other types of error, such as the erasure of a symbol (substitution by a symbol of a different alphabet), the alteration of the numerical value of a code word by , , (an arithmetic error), the deletion or insertion of a symbol, etc.

In the theory of information transmission (see Information, transmission of), probabilistic models of error generation are considered; these are called channels. The simplest memoryless channel is defined by the probabilities of substitution of a symbol by a symbol . One defines for this channel the quantity (channel capacity)

where the maximum is taken over all sets such that and . The effectiveness of an encoding is characterized by the transmission rate , while the noise-immunity is characterized by the mean probability of an error in the decoding (under the optimal partitioning of into subsets ). The main result in the theory of information transmission (Shannon's theorem) is that the channel capacity is the least upper bound of the numbers such that for any and for all sufficiently large there exists an encoding for which and .

Another model for the generation of errors (see Error-correcting code; Code with correction of arithmetical errors; Code with correction of deletions and insertions) is characterized by the property that in each word of length there occur not more than a prescribed number of errors. Let be the set of words obtainable from as a result of or fewer errors. If for the code

the sets , , are pairwise disjoint, then in a decoding such that , all errors admitting of a model of the above type for the generation of errors will be corrected, and such a code is called a -error-correcting code. For many types of errors (for example, substitutions, arithmetic errors, insertions and deletions) the function equal to the minimal number of errors of given type taking the word into the word is a metric, and is the metric ball of radius . Therefore the problem of constructing the most effective code (that is, the code with maximum number of words ) in with correction of errors is equivalent to that of the densest packing of the metric space by balls of radius . The code for the figures of a Russian postal index is not a one-error-correcting code, because and , while all other distances between code words are at least 3.

The problem of studying the minimal redundancy of a code in for correction of errors of substitution type is divided into two basic cases. In the first case, when is fixed as , the following asymptotic relation holds:

 (2)

where the "power" bound is attained, based on a count of the number of words of length in a ball of radius . For the asymptotic behaviour of when except for , and also when for many other types of errors (for example, arithmetic errors, deletions and insertions), is not known (1987). In the second case, when , where is some fixed number, and , the "power" bound

where , is substantially improved. It is conjectured that the upper bound

 (3)

obtained by the method of random sampling of codes, is asymptotically exact for , that is, . The proof or refutation of this conjecture is one of the central problems in coding theory.

The majority of the constructions of noise-immune codes are effective when the length of the code is sufficiently large. In this connection, questions relating to the complexity of systems that realize the coding and decoding (encoders and decoders) acquire special significance. Restrictions on the admissible type of decoder or on its complexity may lead to an increase in the redundancy necessary to guarantee a prescribed noise-immunity. For example, the minimal redundancy of a code in for which there is a decoder consisting of a shift register and a single majorizing element and correcting one error has order (compare with (2)). As mathematical models of an encoder and a decoder, diagrams of functional elements are usually considered, and by the complexity is meant the number of elements in the diagram. For the known classes of error-correcting codes, investigations have been carried out of the possible encoding and decoding algorithms and upper bounds for the complexity of the encoder and decoder have been obtained. Also, certain relationships have obtained among the transmission rate of the encoding, the noise-immunity of the encoding and the complexity of the decoder (see [5]).

Yet another line of the investigation in coding theory is connected with the fact that many results (e.g. Shannon's theorem and the upper bound (3)) are not "constructive" , but are theorems on the existence of infinite sequences of codes . In this connection, efforts have been made to sharpen these results in order to prove them in a class of sequences of codes for which there is a Turing machine that recognizes the membership of an arbitrary word of length in the set in a time having slow order of growth with respect to (e.g. ).

Certain new constructions and methods for obtaining bounds, which have been developed within coding theory, have led to a substantial advance in questions that on the face of it seem very remote from the traditional problems in coding theory. One should mention here the use of a maximal code for the correction of one error in an asymptotically-optimal method for realizing functions of the algebra of logic by contact schemes (cf. Contact scheme); the fundamental improvement of the upper bound for the density of packing the -dimensional Euclidean space with identical balls; and the use of inequality (1) in estimating the complexity of realizing a class of functions of the algebra of logic by formulas. The ideas and results of coding theory find further developments in the synthesis of self-correcting schemes and reliable schemes of unreliable elements.

#### References

 [1] C. Shannon, "A mathematical theory of communication" Bell Systems Techn. J. , 27 (1948) pp. 379–423; 623–656 [2] E. Berlekamp, "Algebraic coding theory" , McGraw-Hill (1968) [3] E. Weldon jr., "Error-correcting codes" , M.I.T. (1972) [4] , Discrete mathematics and mathematical problems in cybernetics , 1 , Moscow (1974) pp. Sect. 5 (In Russian) [5] L.A. Bassalygo, V.V. Zyablov, M.S. Pinsker, "Problems of complexity in the theory of correcting codes" Problems of Information Transmission , 13 : 3 pp. 166–175 Problemy Peredachi Informatsii , 13 : 3 (1977) pp. 5–17 [6] V.M. Sidel'nikov, "New bounds for densest packings of spheres in -dimensional Euclidean space" Math. USSR-Sb. , 24 (1974) pp. 147–157 Mat. Sb. , 95 : 1 (1974) pp. 148–158