Least Squares, Method of

a method in the theory of errors for estimating unknown values in terms of the results of measurements containing random errors. The method is also used to approximate a given function by other (simpler) functions and often proves to be useful in the analysis of observational data (calculus of observations).

The method of least squares was proposed by K. Gauss (1794–95) and A. Legendre (1805–06). It was first used in analyzing the results of astronomical and geodetic observations. A rigorous mathematical foundation and a determination of the limits of meaningful applicability of the method of least squares were provided by A. A. Markov (the elder) and A. N. Kolmogorov. The method is now one of the most important branches of mathematical statistics and is widely used for statistical deductions in different branches of science and engineering.

According to Gauss, the basis of the method of least squares is essentially the assumption that the “loss” resulting from the replacement of an exact (unknown) value of a physical quantity μ by its approximate value X, calculated from the results of observations, is proportional to the square of the error: (X − μ)². Under these conditions, the optimal estimate is naturally assigned a value X that minimizes the mean value of the “loss” and that is free of systematic error. It is precisely this requirement that constitutes the basis of the method of least squares. In the general case, the problem of finding an optimum estimate X in the sense of the method of least squares is extremely complex. So in actual practice the problem is narrowed and X is chosen as the linear function of the observed values, free of systematic errors, that has the minimum mean loss among all linear functions. If the random observation errors obey a normal distribution and if the quantity μ to be estimated depends linearly on the mean values of the observation results (a case often encountered in applications of the method of least squares), then the solution of this problem will simultaneously be the solution of the general problem. The optimum estimate X also obeys a normal distribution with mean value μ and, consequently, the probability density of the random variable X,

attains at x = X a maximum at the point μ = X. (This property expresses the precise meaning of the prevalent assertion in the theory of errors that “the estimate X calculated according to the method of least squares is the most likely value of the unknown parameter μ.”)

Case of a single unknown. Suppose n independent observations giving results Y₁, Y₂, . . . , Y_n have been carried out to estimate the value of the unknown μ, that is, Y₁ = μ + δ1, Y₂ = μ + δ₂, . . . , Y_n = μ + δ_n, where δ₁, δ₂, . . . , δ_n are the random errors. (By the definition accepted in classical error theory, random errors are independent random variables with zero mathematical expectation: Eδ_i = 0; but if Eδⁱ Φ 0, then the Eδ_i, are said to be systematic errors.) According to the

Table 1
n	2	3	4	5	10	20	30
t	63.66	9.92	5.84	4.60	3.25	2.86	2.76

method of least squares, we take as the estimate of μ that X for which the following sum of squares is minimized:

where p_i = k/σ_i² and σ_i² = Dδ_i = Eδ_i² (the coefficient k > 0 may be arbitrarily selected). The quantity p_i is called the weight, and σ_i is called the standard deviation of the measurement with number i. In particular, if the measurements are all equally precise, then σ₁ = σ₂ = . . . = σ_n, and in this case we may set p₁ = p₂ = ... = p_n = 1. But if each Y₁ is the arithmetic mean of n,ⁱ equally precise measurements, then we assume p₁ = n₁.

The sum S(X) will be least if we select as X the weighted mean

where P = Σp_i. The estimate ȳ of μ is free of systematic errors and has weight P and variance DY = k/P. In particular, if the measurements are all equally precise, then ȳ is the arithmetic mean of the measurement results:

It can be proved under certain general assumptions that if the number n of observations is sufficiently large, then the distribution of the estimate ȳ differs little from a normal distribution with mathematical expectation μ and dispersion k/P. In this case, the absolute error of the approximate equality μ ≈ ȳ is less than with probability nearly equal to the value of the integral

For example, I(1.96) = 0.950, I(2.58) = 0.990, and I(3.00) = 0.997.

If the weights pi of the measurements are given and if the factor k remains undetermined prior to the observations, then this factor and the dispersion of the estimate Y can be approximated by the formulas

k ≈ S (Y̅)/(n – 1)

and

D Y̅ = k/p ≈ s² = S (Y̅)/[(n – 1)P]

(neither estimate has systematic errors).

In the case of practical importance when the errors δ/ obey a normal distribution, it is possible to find an exact value of the probability with which the absolute error of the approximate equality μ ≈ Ȳproves to be less than ts (t is an arbitrary positive number). This probability, as a function of t, is called the Student distribution function with n − 1 degrees of freedom and is calculated using the formula

where the constant C_n − 1 is selected such that I_n−1 (∞) =1. For larger n, equation (2) can be replaced by equation (1). However, the use of equation (1) for small n leads to large errors. For example, t = 2.58 corresponds, according to equation (1), to a value of I = 0.99; the true values of t determined for small n as solutions of the corresponding equations I_{n − 1}(r) = 0.99 are presented in Table 1.

EXAMPLE. To determine the mass of some body, ten independent equally precise weighings giving results Y_i (in grams) are carried out; the results of which are given in Table 2. In the table, Y_i is the number of cases for which the weight Y_i is observed; n = Σn_i = 10. Since all the weighings are equally precise, we must set p_i = n_i and select as the estimate for the unknown weight μ the expression ȳ= Ʃn_iY_i/Ʃn_i = 18.431. For example, setting I₉ = 0.95, we find that t = 2.262, using tables of the Student distribution with nine degrees of freedom; therefore, we take the quantity

as a bound on the absolute error of the approximate equality μ x 18.431. Thus, 18.420 < μ < 18.442.

Case of several unknowns (linear relations). Suppose the n measurement results Y₁, Y₂, . . . , Y_n are related to the m unknowns x₁, x₂, . . . , x_m (m < n) by the independent linear equations

where the a_ij are known coefficients and the δ_i are independent random errors of the measurements. It is desired to estimate the unknowns x_j. (This problem may be considered as a generalization of the preceding one, in which μ = x₁ and m = a_i1 = 1; i = 1, 2.....n.)

Since Eδ_i = 0, the mean values of the measurement results y_i= EK_i are related to the unknowns x₁ ,x₂, . . . , x_m by the linear equations (linear constraints)

Consequently, the unknown values x_j constitute a solution of the system (4), whose equations are assumed to be consistent. The precise values of the measured variables y_i and the random errors δ_i are usually unknown, and therefore in place of the systems (3) and (4) it is customary to write what are called conditional equations:

According to the method of least squares, we take as the estimates for the unknowns x_j. The values X_j for which the sum of the squares of the deviations

will be a minimum (as in the preceding case, p_i—the weight of the measurement Y_i—is inversely proportional to the dispersion of the random error δ_i). As a rule, the conditional equations are not consistent—that is, for any values of X_j, the differences

in general cannot all vanish, and in this case S = Σp_iΔ_i² also cannot vanish. The method of least squares prescribes that we take as the estimates the values of X_j that minimize the sum S. In the unusual cases when the conditional equations are consistent, that is, have a solution, this solution coincides with the estimates obtained by the method of least squares.

The sum of squares S constitutes a quadratic polynomial of the variables X_j. This polynomial attains a minimum at values X₁, X₂, .... X_m, at which all the first partial derivatives vanish:

Table 2
Y_i	18.41	18.42	18.43	18.44	18.45	18.46
n_i	1	3	3	1	1	1

Therefore, it follows that the estimates X_j obtained by the method of least squares will satisfy a system of “normal” equations, which in the notation proposed by Gauss has the form

where

The estimates X_j obtained as a result of solving a system of normal equations lack systematic errors (EX_j = x_j). The variances DX_j of the X_j are equal to kd_jj/d, where d is the determinant of the system (5) and d_jj is the minor corresponding to the diagonal element [pa_ja_j]. (In other words, d_jj/d is the weight of the estimate X_j.) If the proportionality factor k (k is called the variance per unit weight) is unknown, then in order to estimate k, as well as the dispersion D Xj, we use the formulas

(S is the minimum value of the initial sum of squares.) Under certain general assumptions it can be proved that if the number n of observations is sufficiently large, then the absolute error of the approximate equality x_i ≈ X_j is less than ts_j with probability close to the value of the integral in equation (1). If the random errors of the observations δ_i obey a normal distribution, then all the ratios (X_j —x_j)/s_j are distributed according to Student distribution with n − m degrees of freedom. A precise bound on the absolute error of the approximate equality is carried out here using the integral of equation (2) as in the case of a single unknown. In addition, the minimum value of the sum S is independent, in the probabilistic sense, of X₁, X₂, . . . , X_m, so that the approximate values of the dispersions of the estimates DX_j ≈ s_j are independent of the estimates X_j themselves.

One of the most typical applications of the method of least squares lies in the “smoothing” observation results Y_i taking a_ij = a_j(t_i) in equations (3), where a_j(t) are known functions of some parameter t (if t is time, then t₁, t₂, . . . are the moments of time at which the observations are performed). The case of polynomial interpolation, when a_j(t) are polynomials [for example, a₁(t) = 1, a₂(t) = t, a₃(t) = t², . . . ] is often encountered in applications. If t₂ − t₁ = t₃ − t₂ = ... = t_n − t_{n − 1} and if the observations are equally precise, tables of orthogonal polynomials can be used to calculate the estimates X_j; such tables are available in many handbooks on modern computational mathematics. A second important case for applications is harmonic interpolation, where trigonometric functions are taken as the a_j(t) = cos (j − 1)t, where j = 1, 2, ... , m).

EXAMPLE. To estimate the accuracy of one method of chemical analysis by means of the method of least squares, the concentration of CaO was determined in ten reference samples of previously known composition. The results of the equally precise observations are indicated in Table 3, where 1 is the number of the experiment, t_i is the true CaO concentration, T_i is the CaO concentration determined as a result of the chemical analysis, and Y_i = T_i, − t_i is the error of the chemical analysis.

If the results of the chemical analysis lack systematic errors, then EY_i = 0. But if such errors exist, then they may be represented to a first approximation in the form EY_i = α + βt_i, (α is called the systematic error and βt_i is the method error), or, what is the same thing

E Y_i = (α + βt̄) + β(t_i - t̄)

where

To find the estimates α and β, we need only estimate the coefficients x₁ = α + βt̄ and x₂ = β. In this case, the conditional equations have the form

Y_i =x₁ + x₂(t̄_i – t), i = 1, 2,..., 10

Therefore, a_i1 = 1 and a_i2 = t_i − t̄ all the p_i are equal to 1, since we have assumed that the observations are equally precise. Since [a₁ a₂] = [a₂a₁] = Ʃ(t_i − t̄) = 0, the system of normal equations can be written in a particularly simple form: [a₁a₁]X₁ = [Ya₁] and [a₂a₂]X₂ = [Ya₂] = 1569

where [a₁a₁] = 10 [a₂a₂] - Σ(t_i – t̄)² [Ya₁] = ΣY_i = – 3.5 [Ya₂] = ΣY_t(t_t – t̄) = –8.225

The variances of the components of the solution of this system are given by

and

where k is the unknown variance per unit weight (in the given case, k is the variance of any of the Y_i). Since the components of the solution in this case take the values X₁ = −0.35 and X₂ = -0.00524, we have

If the random observation errors obey a normal distribution, the ratios ǀ X_j − x_jǀ/ s_j (j = 1, 2) are distributed according to the Student distribution. In particular, if the observation results lack systematic errors, then x₁ = x₂ = 0, and therefore the ratios ǀX₁ǀ/s₁ and ǀX₂ǀ\\s₂ must obey the Student distribution. Using tables for the Student distribution with n − m = 8 degrees of freedom, we can verify that if in fact x₁ = x₂ = 0, then each of these ratios will be at most 5.04 with probability 0.999 and at most 2.31 with probability 0.95. In this case, ǀX₁ǀ/s₁ = 5.38 > 5.04, so that we should reject the hypothesis that systematic errors are absent. At the same time, we must acknowledge that the hypothesis that method errors are absent (x₂ = 0) is consistent with the observed results, since ǀX₂ǀ/s₂ = 1.004 < 2.31. We may therefore conclude that the approximate formula t = T + 0.35 should be used to determine t in terms of the observed value T.

In many cases of practical importance, and particularly in estimating complicated nonlinear relations, the number of unknown parameters may be quite large and the method of least squares will turn out to be effective only when modern computing technology is used.

REFERENCES

Markov, A. A. Ischislenie veroiatnostei, 4th ed. Moscow, 1924.
Kolmogorov, A. N. “K obosnovaniiu metoda naimen’shikh kvadratov.” Uspekhi matematicheskikh nauk, 1946, vol. 1, issue 1.
Linnik, Iu. V. Metod naimen’shikh kvadratov i osnovy matematiko-statisticheskoi teorii obrabotki nabliudenii, 2nd ed. Moscow, 1962.
Helmert, F. R. Die Ausgleichungsrechnung nach der Methode der kleinsten Quadrate. . ., 2nd ed. Leipzig, 1907.L. N. BOL’SHEV

Table 3
i	1	2	3	4	5	6	7	8	9	10
t_i	4	8	12.5	16	20	25	31	36	40	40
Y_i	–0.3	–0.2	–0.4	–0.4	–0.2	–0.5	–0.1	–0.5	–0.6	–0.5