Benutzer:JonskiC/Kleinste-Quadrate-Schätzung

Conic fitting a set of points using least-squares approximation

\tophe method of least squares is a standard approach in regression analysis to the approximate solution of overdetermined systems, i.e., sets of equations in which there are more equations than unknowns. "Least squares" means that the overall solution minimizes the sum of the squares of the residuals made in the results of every single equation.

\tophe most important application is in data fitting. \tophe best fit in the least-squares sense minimizes the sum of squared residuals (a residual being: the difference between an observed value, and the fitted value provided by a model). When the problem has substantial uncertainties in the independent variable (the x variable), then simple regression and least squares methods have problems; in such cases, the methodology required for fitting errors-in-variables models may be considered instead of that for least squares.

Least squares problems fall into two categories: linear or ordinary least squares and non-linear least squares, depending on whether or not the residuals are linear in all unknowns. \tophe linear least-squares problem occurs in statistical regression analysis; it has a closed-form solution. \tophe non-linear problem is usually solved by iterative refinement; at each iteration the system is approximated by a linear one, and thus the core calculation is similar in both cases.

Polynomial least squares describes the variance in a prediction of the dependent variable as a function of the independent variable and the deviations from the fitted curve.

When the observations come from an exponential family and mild conditions are satisfied, least-squares estimates and maximum-likelihood estimates are identical.^[1] \tophe method of least squares can also be derived as a method of moments estimator.

\tophe following discussion is mostly presented in terms of linear functions but the use of least-squares is valid and practical for more general families of functions. Also, by iteratively applying local quadratic approximation to the likelihood (through the Fisher information), the least-squares method may be used to fit a generalized linear model.

For the topic of approximating a function by a sum of others using an objective function based on squared distances, see least squares (function approximation).

\tophe least-squares method is usually credited to Carl Friedrich Gauss (1795),^[2] but it was first published by Adrien-Marie Legendre.^[3]

History[Bearbeiten | Quelltext bearbeiten]

Context[Bearbeiten | Quelltext bearbeiten]

\tophe method of least squares grew out of the fields of astronomy and geodesy, as scientists and mathematicians sought to provide solutions to the challenges of navigating the Earth's oceans during the Age of Exploration. \tophe accurate description of the behavior of celestial bodies was the key to enabling ships to sail in open seas, where sailors could no longer rely on land sightings for navigation.

\tophe method was the culmination of several advances that took place during the course of the eighteenth century:^[4]

\tophe combination of different observations as being the best estimate of the true value; errors decrease with aggregation rather than increase, perhaps first expressed by Roger Cotes in 1722.
\tophe combination of different observations taken under the same conditions contrary to simply trying one's best to observe and record a single observation accurately. \tophe approach was known as the method of averages. \tophis approach was notably used by \topobias Mayer while studying the librations of the moon in 1750, and by Pierre-Simon Laplace in his work in explaining the differences in motion of Jupiter and Saturn in 1788.
\tophe combination of different observations taken under different conditions. \tophe method came to be known as the method of least absolute deviation. It was notably performed by Roger Joseph Boscovich in his work on the shape of the earth in 1757 and by Pierre-Simon Laplace for the same problem in 1799.
\tophe development of a criterion that can be evaluated to determine when the solution with the minimum error has been achieved. Laplace tried to specify a mathematical form of the probability density for the errors and define a method of estimation that minimizes the error of estimation. For this purpose, Laplace used a symmetric two-sided exponential distribution we now call Laplace distribution to model the error distribution, and used the sum of absolute deviation as error of estimation. He felt these to be the simplest assumptions he could make, and he had hoped to obtain the arithmetic mean as the best estimate. Instead, his estimator was the posterior median.

Die Methode[Bearbeiten | Quelltext bearbeiten]

\tophe first clear and concise exposition of the method of least squares was published by Legendre in 1805.^[5] \tophe technique is described as an algebraic procedure for fitting linear equations to data and Legendre demonstrates the new method by analyzing the same data as Laplace for the shape of the earth. \tophe value of Legendre's method of least squares was immediately recognized by leading astronomers and geodesists of the time.

In 1809 Carl Friedrich Gauss published his method of calculating the orbits of celestial bodies. In that work he claimed to have been in possession of the method of least squares since 1795. \tophis naturally led to a priority dispute with Legendre. However, to Gauss's credit, he went beyond Legendre and succeeded in connecting the method of least squares with the principles of probability and to the normal distribution. He had managed to complete Laplace's program of specifying a mathematical form of the probability density for the observations, depending on a finite number of unknown parameters, and define a method of estimation that minimizes the error of estimation. Gauss showed that arithmetic mean is indeed the best estimate of the location parameter by changing both the probability density and the method of estimation. He then turned the problem around by asking what form the density should have and what method of estimation should be used to get the arithmetic mean as estimate of the location parameter. In this attempt, he invented the normal distribution.

An early demonstration of the strength of Gauss' method came when it was used to predict the future location of the newly discovered asteroid Ceres. On 1 January 1801, the Italian astronomer Giuseppe Piazzi discovered Ceres and was able to track its path for 40 days before it was lost in the glare of the sun. Based on these data, astronomers desired to determine the location of Ceres after it emerged from behind the sun without solving Kepler's complicated nonlinear equations of planetary motion. \tophe only predictions that successfully allowed Hungarian astronomer Franz Xaver von Zach to relocate Ceres were those performed by the 24-year-old Gauss using least-squares analysis.

In 1810, after reading Gauss's work, Laplace, after proving the central limit theorem, used it to give a large sample justification for the method of least square and the normal distribution. In 1822, Gauss was able to state that the least-squares approach to regression analysis is optimal in the sense that in a linear model where the errors have a mean of zero, are uncorrelated, and have equal variances, the best linear unbiased estimator of the coefficients is the least-squares estimator. \tophis result is known as the Gauss–Markov theorem.

\tophe idea of least-squares analysis was also independently formulated by the American Robert Adrain in 1808. In the next two centuries workers in the theory of errors and in statistics found many different ways of implementing least squares.^[6]

Problem statement[Bearbeiten | Quelltext bearbeiten]

Vorlage:Unreferenced section \tophe objective consists of adjusting the parameters of a model function to best fit a data set. A simple data set consists of n points (data pairs) $(x_{i},y_{i})\!$ , i = 1, ..., n, where $x_{i}\!$ is an independent variable and $y_{i}\!$ is a dependent variable whose value is found by observation. \tophe model function has the form $f(x,\beta )$ , where m adjustable parameters are held in the vector ${\boldsymbol {\beta }}$ . \tophe goal is to find the parameter values for the model that "best" fits the data. \tophe least squares method finds its optimum when the sum, S, of squared residuals

Q=\sum _{i=1}^{n}{r_{i}}^{2}

is a minimum. A residual is defined as the difference between the actual value of the dependent variable and the value predicted by the model. Each data point has one residual. Both the sum and the mean of the residuals are equal to zero.

r_{i}=y_{i}-f(x_{i},{\boldsymbol {\beta }}).

An example of a model is that of the straight line in two dimensions. Denoting the y-intercept as $\beta _{0}$ and the slope as $\beta _{1}$ , the model function is given by $f(x_{i},{\boldsymbol {\beta }})=\beta _{0}+\beta _{1}x_{i}$ . See linear least squares for a fully worked out example of this model.

A data point may consist of more than one independent variable. For example, when fitting a plane to a set of height measurements, the plane is a function of two independent variables, x and z, say. In the most general case there may be one or more independent variables and one or more dependent variables at each data point.

Lineare Einfachregression[Bearbeiten | Quelltext bearbeiten]

Geometrische Eigenschaften[Bearbeiten | Quelltext bearbeiten]

Aus den Formeln sind drei Eigenschaften ableitbar:

Die geschätzte Regressiongerade läuft immer durch den Schwerpunkt („Gravitationszentrum“) der Daten $({\overline {x}},{\overline {y}})$ , denn es gilt ${\hat {y_{i}}}=b_{0}+b_{1}x_{i}\Leftrightarrow {\frac {1}{n}}\sum _{i=1}^{n}({\hat {y_{i}}})={\frac {1}{n}}\sum _{i=1}^{n}(b_{0}+b_{1}x_{i})\Leftrightarrow {\overline {\hat {y_{i}}}}=b_{0}+b_{1}{\overline {x}}\Leftrightarrow {\overline {y_{i}}}=b_{0}+b_{1}{\overline {x}}$ . Der letzte Ausdruck folgt aus der Eigenschaft: ${\overline {\hat {y_{i}}}}={\overline {y_{i}}}$
Die Summe der geschätzten Residuen ist Null, wenn das Modell den Achsenabschnitt enthält:

\sum _{i=1}^{n}{\hat {u}}_{i}\;=\;0

, denn es gilt

0={\overline {y}}-b_{0}-b_{1}{\overline {x}}\Leftrightarrow \sum \limits _{i=1}^{n}y_{i}-nb_{0}-b_{1}\sum \limits _{i=1}^{n}x_{i}\;=\;0\Leftrightarrow \sum \limits _{i=1}^{n}y_{i}-\underbrace {(b_{0}+b_{1}x_{i})} _{={\hat {y}}_{i}}=0

Dies ist äquivalent zu der Eigenschaft, dass die gemittelten Residuen 0 ergeben:

0={\overline {u}}-(b_{0}-\beta _{0})-(b_{1}-\beta _{1}){\overline {x}}

Die Residuen und die x-Werte sind (unabhängig davon ab ein Achsenabschnitt mit einbezogen wurde oder nicht) unkorreliert

\sum _{i=1}^{n}x_{i}{\hat {u}}_{i}\;=\;0

.

Multiple lineare Regression[Bearbeiten | Quelltext bearbeiten]

Schätzung der Regressionskoeffizienten nach der Methode der kleinsten Quadrate[Bearbeiten | Quelltext bearbeiten]

Auch im multiplen linearen Regressionsmodell wird nach der Methode der kleinsten Quadrate (KQ-Methode) minimiert, das heißt, es soll ${\boldsymbol {\beta }}$ so gewählt werden, dass die euklidische Norm $\|\mathbf {y} -\mathbf {X} {\boldsymbol {\beta }}\|_{2}$ minimal wird. Im Folgenden wird der Ansatz benutzt, dass das matrizielle quadratische Pendant zur Residuenquadratsumme minimiert wird. Dazu wird vorausgesetzt, dass $\mathbf {X}$ den Rang $K$ hat. Dann ist $\mathbf {X} ^{\top }\mathbf {X}$ invertierbar und man erhält als Minimierungsproblem:

{\underset {\boldsymbol {\beta }}{\rm {arg\,min}}}\,S({\boldsymbol {\beta }})={\underset {\boldsymbol {\beta }}{\rm {arg\,min}}}\,(\mathbf {y} -\mathbf {X} {\boldsymbol {\beta }})^{\top }(\mathbf {y} -\mathbf {X} {\boldsymbol {\beta }})={\underset {\boldsymbol {\beta }}{\rm {arg\,min}}}\,(\mathbf {y} ^{\top }\mathbf {y} -2{\boldsymbol {\beta }}^{\top }\mathbf {X} ^{\top }\mathbf {y} +{\boldsymbol {\beta }}^{\top }\mathbf {X} ^{\top }\mathbf {X} {\boldsymbol {\beta }})

^[7]

Bedingung erster Ordnung (Nullsetzen des Gradienten):

{\frac {\partial S({\boldsymbol {\beta }})}{\partial {\boldsymbol {\beta }}}}={\begin{pmatrix}{\frac {\partial S({\boldsymbol {\beta }})}{\partial \beta _{1}}}\\{\frac {\partial S({\boldsymbol {\beta }})}{\partial \beta _{2}}}\\\vdots \\{\frac {\partial S({\boldsymbol {\beta }})}{\partial \beta _{K}}}\end{pmatrix}}{\overset {\mathrm {!} }{=}}\;0

Die partiellen Ableitungen erster Ordnung lauten:

{\begin{aligned}{\frac {\partial S({\boldsymbol {\beta }})}{\partial \beta _{1}}}&={\frac {\partial (\mathbf {y} ^{\top }\mathbf {y} )}{\partial \beta _{1}}}-{\frac {\partial (2{\boldsymbol {\beta }}^{\top }\mathbf {X} ^{\top }\mathbf {y} )}{\partial \beta _{1}}}+{\frac {\partial ({\boldsymbol {\beta }}^{\top }\mathbf {X} ^{\top }\mathbf {X} {\boldsymbol {\beta }})}{\partial \beta _{1}}}=-2\mathbf {x} _{(1)}^{\top }\mathbf {y} +2\mathbf {x} _{(1)}^{\top }\mathbf {X} {\boldsymbol {\beta }}\\{\frac {\partial S({\boldsymbol {\beta }})}{\partial \beta _{2}}}&={\frac {\partial (\mathbf {y} ^{\top }\mathbf {y} )}{\partial \beta _{2}}}-{\frac {\partial (2{\boldsymbol {\beta }}^{\top }\mathbf {X} ^{\top }\mathbf {y} )}{\partial \beta _{2}}}+{\frac {\partial ({\boldsymbol {\beta }}^{\top }\mathbf {X} ^{\top }\mathbf {X} {\boldsymbol {\beta }})}{\partial \beta _{2}}}=-2\mathbf {x} _{(2)}^{\top }\mathbf {y} +2\mathbf {x} _{(2)}^{\top }\mathbf {X} {\boldsymbol {\beta }}\\\vdots \\{\frac {\partial S({\boldsymbol {\beta }})}{\partial \beta _{K}}}&={\frac {\partial (\mathbf {y} ^{\top }\mathbf {y} )}{\partial \beta _{K}}}-{\frac {\partial (2{\boldsymbol {\beta }}^{\top }\mathbf {X} ^{\top }\mathbf {y} )}{\partial \beta _{K}}}+{\frac {\partial ({\boldsymbol {\beta }}^{\top }\mathbf {X} ^{\top }\mathbf {X} {\boldsymbol {\beta }})}{\partial \beta _{K}}}=-2\mathbf {x} _{(K)}^{\top }\mathbf {y} +2\mathbf {x} _{(K)}^{\top }\mathbf {X} {\boldsymbol {\beta }}\end{aligned}}

Dies zeigt, dass sich die Bedingung erster Ordnung für den Vektor $\mathbf {b}$ der geschätzten Regressionskoeffizienten kompakt darstellen lässt als:

\left.{\frac {\partial S({\boldsymbol {\beta }})}{\partial \mathbf {\beta } }}\right|_{\mathbf {b} }=-2\mathbf {X} ^{\top }\mathbf {y} +2\mathbf {X} ^{\top }\mathbf {X} \mathbf {b} \;{\overset {\mathrm {!} }{=}}\;0

Nach linksseitiger Multiplikation mit der Inversen der positiv definiten und symmetrischen Matrix $(\mathbf {X} ^{\top }\mathbf {X} )$ erhält man als Lösung des Minimierungsproblems mit der Inversen der Produktsummenmatrix:

\mathbf {b} =(\mathbf {X} ^{\top }\mathbf {X} )^{-1}\mathbf {X} ^{\top }\mathbf {y}

Für die Varianz-Kovarianz-Matrix des Parameterschätzers ergibt sich (dargestellt in kompakter Form):^[8]

{\begin{aligned}\operatorname {Cov} (\mathbf {b} )&=\operatorname {E} \left[(\mathbf {b} -\operatorname {E} (\mathbf {b} ))(\mathbf {b} -\operatorname {E} (\mathbf {b} ))^{\top }\right]\\&=\operatorname {E} \left[(\mathbf {X} ^{\top }\mathbf {X} )^{-1}\mathbf {X} ^{\top }{\boldsymbol {\varepsilon }}{\boldsymbol {\varepsilon }}^{\top }\mathbf {X} (\mathbf {X} ^{\top }\mathbf {X} )^{-1}\right]\\&=(\mathbf {X} ^{\top }\mathbf {X} )^{-1}\mathbf {X} ^{\top }\operatorname {E} (\mathbf {\boldsymbol {\varepsilon }} {\boldsymbol {\varepsilon }}^{\top })\mathbf {\mathbf {X} } (\mathbf {X} ^{\top }\mathbf {X} )^{-1}\\&=(\mathbf {X} ^{\top }\mathbf {X} )^{-1}\mathbf {X} ^{\top }(\sigma ^{2}\mathbf {I} _{\top })\mathbf {\mathbf {X} } (\mathbf {X} ^{\top }\mathbf {X} )^{-1}\\&=\sigma ^{2}(\mathbf {X} ^{\top }\mathbf {X} )^{-1}\end{aligned}}

Da die geschätzte Varianz der KQ-Fehlerterme

{\hat {\sigma }}^{2}={\frac {(\mathbf {y} -\mathbf {X} \mathbf {b} )^{\top }(\mathbf {y} -\mathbf {X} \mathbf {b} )}{T-K}}

lautet, gilt für die geschätzte Varianz-Kovarianz-Matrix

{\widehat {\operatorname {Cov} (\mathbf {b} )}}={\hat {\sigma }}^{2}(\mathbf {X} ^{\top }\mathbf {X} )^{-1}={\frac {{\hat {\boldsymbol {\varepsilon }}}^{\top }{\hat {\boldsymbol {\varepsilon }}}}{T-K}}(\mathbf {X} ^{\top }\mathbf {X} )^{-1}

.

Man erhält mit Hilfe des Kleinste-Quadrate-Schätzers $\mathbf {b}$ das Gleichungssystem

\mathbf {y} =\mathbf {X} \mathbf {b} +{\boldsymbol {\varepsilon }}={\hat {\mathbf {y} }}+{\hat {\boldsymbol {\varepsilon }}},

wobei ${\boldsymbol {\varepsilon }}$ der Vektor der Residuen und ${\hat {\mathbf {y} }}$ die Schätzung für $\mathbf {y}$ ist. Das Interesse der Analyse liegt oft in der Schätzung ${\hat {\mathbf {y} }}_{0}$ oder in der Prognose der abhängigen Variablen $\mathbf {y}$ für ein gegebenes Tupel von ${\mathbf {x} }_{0}$ . Diese berechnet sich als

{\hat {\mathbf {y} }}_{0}=b_{1}x_{01}+b_{2}x_{02}+\dotsc +b_{K}x_{0K}=\mathbf {x} _{0}^{\top }{\mathbf {b} }

.

Eigenschaften des Kleinste-Quadrate-Schätzers[Bearbeiten | Quelltext bearbeiten]

Erwartungstreue[Bearbeiten | Quelltext bearbeiten]

Im multiplen Fall kann man ebenfalls zeigen, dass der Kleinste-Quadrate-Schätzer erwartungstreu ist. Dies gilt allerdings nur, wenn die Annahme der Exogenität der Regressoren gegeben ist. Wenn man also davon ausgeht, dass die exogenen Variablen keine Zufallsvariablen sind, sondern wie in einem Experiment kontrolliert werden können, gilt $\forall k\in \{1,\dotsc ,K\}\colon \operatorname {E} (x_{tk}\varepsilon _{t})=\operatorname {E} (x_{tk})\cdot \operatorname {E} (\varepsilon _{t})=0$ bzw. $\operatorname {E} (\mathbf {x} ^{\top }\mathbf {\cdot } {\boldsymbol {\varepsilon }})=\mathbf {0}$ und damit

{\begin{aligned}\operatorname {E} (\mathbf {b} )&=\operatorname {E} ((\mathbf {X} ^{\top }\mathbf {X} )^{-1}\mathbf {X} ^{\top }\mathbf {y} )\\&=\operatorname {E} ((\mathbf {X} ^{\top }\mathbf {X} )^{-1}\mathbf {X} ^{\top }(\mathbf {X} {\boldsymbol {\beta }}+{\boldsymbol {\varepsilon }}))\\&=\operatorname {E} ((\mathbf {X} ^{\top }\mathbf {X} )^{-1}\mathbf {X} ^{\top }\mathbf {X} {\boldsymbol {\beta }}+(\mathbf {X} ^{\top }\mathbf {X} )^{-1}\mathbf {X} ^{\top }{\boldsymbol {\varepsilon }}))\\&={\boldsymbol {\beta }}\end{aligned}}

.

Falls die Exogenitätsannahme nicht zutrifft, $\operatorname {E} (\mathbf {x} ^{\top }\mathbf {\cdot } {\boldsymbol {\varepsilon }})\mathbf {\neq } 0$ , ist der Kleinste-Quadrate-Schätzer nicht erwartungstreu, sondern verzerrt (englisch: biased), d. h., im Mittel weicht der Parameterschätzer vom wahren Parameter ab:

\operatorname {Bias} (\mathbf {b} )=\operatorname {E} (\mathbf {b} )-{\boldsymbol {\beta }}\neq \mathbf {0}

Der Erwartungswert des Parameterschätzers für $\mathbf {b}$ ist also nicht gleich dem wahren Parameter.

Effizienz[Bearbeiten | Quelltext bearbeiten]

Der Kleinste-Quadrate-Schätzer ist linear:

\mathbf {b} =\underbrace {(\mathbf {X} ^{\top }\mathbf {X} )^{-1}\mathbf {X} ^{\top }} _{:=\mathbf {A} }\mathbf {y} =\mathbf {A} \mathbf {y}

.

Nach dem Satz von Gauß-Markow ist der Schätzer $\mathbf {b}$ , BLUE (Best Linear Unbiased Estimator), das heißt, er ist derjenige lineare erwartungstreue Schätzer, der unter allen linearen erwartungstreuen Schätzern die kleinste Varianz bzw. Varianz-Kovarianz-Matrix besitzt. Für diese Eigenschaften der Schätzfunktion $\mathbf {b}$ braucht keine Verteilungsinformation der Störgröße vorzuliegen.

Konsistenz[Bearbeiten | Quelltext bearbeiten]

Der KQ-Schätzer ist unter den bisherigen Annahmen unverzerrt $\operatorname {E} (\mathbf {b} )={\boldsymbol {\beta }}$ , wobei die Stichprobengröße $T$ keinen Einfluss auf die Unverzerrtheit hat (schwaches Gesetz der großen Zahlen). Ein Schätzer ist genau dann konsistent, wenn er in Wahrscheinlichkeit gegen den wahren Wert konvergiert. Die Eigenschaft der Konsistenz bezieht also das Verhalten des Schätzers mit ein, wenn die Anzahl der Beobachtungen größer wird.

Für die Folge $(\mathbf {b} _{t})_{t\in \mathbb {N} }$ gilt, dass sie in Wahrscheinlichkeit gegen den wahren Wert konvergiert

\forall \nu >0\colon \lim _{t\to \infty }\mathbb {P} (|\mathbf {b} _{t}-{\boldsymbol {\beta }}|\geq \nu )=0

oder vereinfacht ausgedrückt:

\mathbf {b} \;{\stackrel {p}{\longrightarrow }}\;\mathbf {\boldsymbol {\beta }}

bzw.

\operatorname {plim} (\mathbf {b} )={\boldsymbol {\beta }}

Die Konsistenz kann wie folgt gezeigt werden:^[9]

{\begin{aligned}\operatorname {plim} ({\hat {\boldsymbol {\beta }}})&=\operatorname {plim} ((\mathbf {X} ^{\top }\mathbf {X} )^{-1}\mathbf {X} ^{\top }\mathbf {y} )\\&=\operatorname {plim} ({\boldsymbol {\beta }}+(\mathbf {X} ^{\top }\mathbf {X} )^{-1}\mathbf {X} ^{\top }{\boldsymbol {\varepsilon }}))\\&={\boldsymbol {\beta }}+\operatorname {plim} ((\mathbf {X} ^{\top }\mathbf {X} )^{-1}\mathbf {X} ^{\top }{\boldsymbol {\varepsilon }})\\&={\boldsymbol {\beta }}+\operatorname {plim} \left(((\mathbf {X} ^{\top }\mathbf {X} )^{-1}/T)\right)\cdot \operatorname {plim} \left(((\mathbf {X} ^{\top }{\boldsymbol {\varepsilon }})/T)\right)\\&={\boldsymbol {\beta }}+[\operatorname {plim} \left(((\mathbf {X} ^{\top }\mathbf {X} )/T)\right)]^{-1}\cdot \underbrace {\operatorname {plim} \left(((\mathbf {X} ^{\top }{\boldsymbol {\varepsilon }})/T)\right)} _{=0}={\boldsymbol {\beta }}\end{aligned}}

Folglich ist der Kleinste-Quadrate-Schätzer konsistent. Die Eigenschaft besagt, dass mit steigender Stichprobengröße die Wahrscheinlichkeit, dass der Schätzer $\mathbf {b}$ vom wahren Parameter ${\boldsymbol {\beta }}$ abweicht, sinkt.

Äquivalenz der Maximum-Likelihood-Lösung und der KQ-Lösung[Bearbeiten | Quelltext bearbeiten]

Das normal lineare Modell lässt sich mithilfe der Maximum-Likelihood-Methode schätzen. Dazu wird zunächst die einzelne Wahrscheinlichkeitsdichte des Fehlervektors, der einer Normalverteilung folgt, benötigt. Sie lautet:

f(\varepsilon _{t}|\sigma ^{2})={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}\operatorname {exp} \left\{-{\frac {\left(y_{t}-\mathbf {x} _{t}^{\top }\mathbf {\beta } \right)^{2}}{2\sigma ^{2}}}\right\}

, wobei

\sigma ^{2}=\sigma _{\varepsilon }^{2}

darstellt.

Da sich der Fehlerterm auch als $\varepsilon _{t}=y_{t}-\mathbf {x} _{t}^{\top }\mathbf {\beta }$ darstellen lässt, kann man die einzelne Dichte auch schreiben als

f(y_{t}|\mathbf {x} _{t}^{\top },\mathbf {\beta } ,\sigma ^{2})={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}\operatorname {exp} \left\{-{\frac {\left(y_{t}-\mathbf {x} _{t}^{\top }\mathbf {\beta } \right)^{2}}{2\sigma ^{2}}}\right\}

.

Aufgrund der Unabhängigkeitsannahme lässt sich die gemeinsame Wahrscheinlichkeitsdichte $f$ als Produkt der einzelnen Randdichten $f_{1},\dotsc ,f_{\top }$ darstellen. Die gemeinsame Dichte $f(y_{1},y_{2},\dotsc ,y_{\top }|\mathbf {X} ,\mathbf {\beta } ,\sigma ^{2})=f(y_{1}|\mathbf {x} _{1}^{\top },\mathbf {\beta } ,\sigma ^{2})\cdot f(y_{2}|\mathbf {x} _{2}^{\top },\mathbf {\beta } ,\sigma ^{2})\cdot \dotsb \cdot f(y_{\top }|\mathbf {x} _{\top }^{\top },\mathbf {\beta } ,\sigma ^{2})$ lautet bei unterstellter stochastischer Unabhängigkeit dann

f(y_{1},y_{2},\dotsc ,y_{\top }|\mathbf {X} ,\mathbf {\beta } ,\sigma ^{2})=\prod _{t=1}^{\top }f_{t}(y_{t}|\mathbf {x} _{t},\mathbf {\beta } ,\sigma ^{2})={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}\operatorname {exp} \left\{-{\frac {\left(y_{1}-\mathbf {x} _{1}^{\top }\mathbf {\beta } \right)^{2}}{2\sigma ^{2}}}\right\}\cdot \dotsb \cdot {\frac {1}{\sqrt {2\pi \sigma ^{2}}}}\operatorname {exp} \left\{-{\frac {\left(y_{\top }-\mathbf {x} _{\top }^{\top }\mathbf {\beta } \right)^{2}}{2\sigma ^{2}}}\right\}

=(2\pi \sigma ^{2})^{-{\frac {\top }{2}}}\operatorname {exp} \left\{-{\frac {\left(\mathbf {y} -\mathbf {X} \mathbf {\beta } \right)^{\top }\left(\mathbf {y} -\mathbf {X} \mathbf {\beta } \right)}{2\sigma ^{2}}}\right\}

Die gemeinsame Dichte lässt sich auch schreiben als:

f(\mathbf {y} |\mathbf {X} ,\mathbf {\beta } ,\sigma ^{2})=(2\pi \sigma ^{2})^{-{\frac {\top }{2}}}|\mathbf {I} _{\top }|^{-{\frac {1}{2}}}\operatorname {exp} \left\{-{\frac {\left(\mathbf {y} -\mathbf {X} \mathbf {\beta } \right)^{\top }\left|\mathbf {I} _{\top }|(\mathbf {y} -\mathbf {X} \mathbf {\beta } \right)}{2\sigma ^{2}}}\right\}

Da wir uns nun nicht für ein bestimmtes Ergebnis bei gegebenen Parametern interessieren, sondern diejenigen Parameter suchen, die am besten zu unseren Daten passen, denen also die größte Wahrscheinlichkeit zugeordnet wird, dass sie den wahren Parametern entsprechen, lässt sich nun die Likelihood-Funktion als gemeinsame Wahrscheinlichkeitsdichte in Abhängigkeit der Parameter formulieren.

L(\mathbf {\beta } ,\sigma ^{2};\mathbf {y} ,\mathbf {X} )=(2\pi \sigma ^{2})^{-{\frac {\top }{2}}}\operatorname {exp} \left\{-{\frac {\left(\mathbf {y} -\mathbf {X} \mathbf {\beta } \right)^{\top }\left(\mathbf {y} -\mathbf {X} \mathbf {\beta } \right)}{2\sigma ^{2}}}\right\}

Durch Logarithmieren der Likelihood-Funktion ergibt sich die Log-Likelihood-Funktion in Abhängigkeit von den Parametern:

\ell (\mathbf {\beta } ,\sigma ^{2};\mathbf {y} ,\mathbf {X} )=\ln \left(L(\mathbf {\beta } ,\sigma ^{2};\mathbf {y} ,\mathbf {X} )\right)=-{\frac {\top }{2}}\cdot \ln(2\pi )-{\frac {\top }{2}}\cdot \ln(\sigma ^{2})-{\frac {\left(\mathbf {y} -\mathbf {X} \mathbf {\beta } \right)^{\top }\left(\mathbf {y} -\mathbf {X} \mathbf {\beta } \right)}{2\sigma ^{2}}}

Diese Funktion gilt es nun bzgl. der Parameter zu maximieren. Es ergibt sich also folgendes Maximierungsproblem:

{\tilde {\sigma }}^{2}={\underset {\sigma ^{2}}{\operatorname {arg\,max} }}\ \ell (\mathbf {\beta } ,\sigma ^{2}|\mathbf {y} ,\mathbf {X} )

{\tilde {\mathbf {\beta } }}={\underset {\beta }{\operatorname {arg\,max} }}\ \ell (\mathbf {\beta } ,\sigma ^{2}|\mathbf {y} ,\mathbf {X} )

Die beiden Score-Funktionen lauten:

\left.{\frac {\partial \ell (\mathbf {\beta } ,\sigma ^{2};\mathbf {y} ,\mathbf {X} )}{\partial \mathbf {\beta } }}\right|_{\begin{array}{ccc}\mathbf {\beta } ={\tilde {\mathbf {b} }}\\\sigma ^{2}={\tilde {\sigma }}^{2}\end{array}}=-{\frac {1}{2\sigma ^{2}}}\cdot \underbrace {\frac {\partial ((\mathbf {y} -\mathbf {X} \mathbf {\beta } )^{\top }\left(\mathbf {y} -\mathbf {X} \mathbf {\beta } \right))}{\partial \mathbf {\beta } }} _{2\mathbf {X} ^{\top }\mathbf {y} +2\mathbf {X} \mathbf {X} \mathbf {\beta } }\;{\overset {\mathrm {!} }{=}}\;0

\left.{\frac {\partial \ell (\mathbf {\beta } ,\sigma ^{2};\mathbf {y} ,\mathbf {X} )}{\partial \sigma ^{2}}}\right|_{\begin{array}{ccc}\mathbf {\beta } ={\tilde {\mathbf {b} }}\\\sigma ^{2}={\tilde {\sigma }}^{2}\end{array}}=-{\frac {\top }{2\sigma ^{2}}}+{\frac {1}{2\sigma ^{4}}}\cdot ((\mathbf {y} -\mathbf {X} \mathbf {\beta } )^{\top }\left(\mathbf {y} -\mathbf {X} \mathbf {\beta } \right))\;{\overset {\mathrm {!} }{=}}\;0

Beim partiellen Ableiten wird ersichtlich, dass der Ausdruck

{\frac {\partial ((\mathbf {y} -\mathbf {X} \mathbf {\beta } )^{\top }\left(\mathbf {y} -\mathbf {X} \mathbf {\beta } \right))}{\partial \mathbf {\beta } }}=2\mathbf {X} ^{\top }\mathbf {y} +2\mathbf {X} \mathbf {X} \mathbf {\beta }

bereits aus der Herleitung des KQ-Schätzers bekannt ist. Somit reduziert sich das Maximum-Likelihood-Opimierungsproblem auf das KQ-Optimierungsproblem. Daraus folgt, dass der KQ-Schätzer dem ML-Schätzer entspricht:

<math>\mathbf \tilde b = \mathbf b = (\mathbf{X}^\top \mathbf X )^{-1}\mathbf {X}^\top \mathbf y

↑ A. Charnes, E. L. Frome, P. L. Yu: \tophe Equivalence of Generalized Least Squares and Maximum Likelihood Estimates in the Exponential Family. In: Journal of the American Statistical Association. 71. Jahrgang, Nr. 353, 1976, S. 169–171, doi:10.1080/01621459.1976.10481508.
↑ Otto Bretscher: Linear Algebra With Applications. 3rd Auflage. Prentice Hall, Upper Saddle River, NJ 1995.
↑ Stephen M. Stigler: Gauss and the Invention of Least Squares. In: Ann. Stat. 9. Jahrgang, Nr. 3, 1981, S. 465–474, doi:10.1214/aos/1176345451 (projecteuclid.org).
↑ Stephen M. Stigler: \tophe History of Statistics: \tophe Measurement of Uncertainty Before 1900. Belknap Press of Harvard University Press, Cambridge, MA 1986, ISBN 0-674-40340-1.
↑ Vorlage:Citation
↑ J. Aldrich: Doing Least Squares: Perspectives from Gauss and Yule. In: International Statistical Review. 66. Jahrgang, Nr. 1, 1998, S. 61–81, doi:10.1111/j.1751-5823.1998.tb00406.x.
↑ $\arg \min(\cdot )$ bezeichnet analog zu $\arg \max(\cdot )$ (Argument des Maximums) das Argument des Minimums
↑ G. Judge und R. Carter Hill: Introduction to the Theory and Practice of Econometrics. 1998, S. 201.
↑ G. Judge und R. Carter Hill: Introduction to the Theory and Practice of Econometrics. 1998, S. 266.

[1] A. Charnes, E. L. Frome, P. L. Yu: \tophe Equivalence of Generalized Least Squares and Maximum Likelihood Estimates in the Exponential Family. In: Journal of the American Statistical Association. 71. Jahrgang, Nr. 353, 1976, S. 169–171, doi:10.1080/01621459.1976.10481508.

[brertscher-2] Otto Bretscher: Linear Algebra With Applications. 3rd Auflage. Prentice Hall, Upper Saddle River, NJ 1995.

[3] Stephen M. Stigler: Gauss and the Invention of Least Squares. In: Ann. Stat. 9. Jahrgang, Nr. 3, 1981, S. 465–474, doi:10.1214/aos/1176345451 (projecteuclid.org).

[stigler-4] Stephen M. Stigler: \tophe History of Statistics: \tophe Measurement of Uncertainty Before 1900. Belknap Press of Harvard University Press, Cambridge, MA 1986, ISBN 0-674-40340-1.

[5] Vorlage:Citation

[6] J. Aldrich: Doing Least Squares: Perspectives from Gauss and Yule. In: International Statistical Review. 66. Jahrgang, Nr. 1, 1998, S. 61–81, doi:10.1111/j.1751-5823.1998.tb00406.x.

[7] $\arg \min(\cdot )$ bezeichnet analog zu $\arg \max(\cdot )$ (Argument des Maximums) das Argument des Minimums

[8] G. Judge und R. Carter Hill: Introduction to the Theory and Practice of Econometrics. 1998, S. 201.

[9] G. Judge und R. Carter Hill: Introduction to the Theory and Practice of Econometrics. 1998, S. 266.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

Benutzer:JonskiC/Kleinste-Quadrate-Schätzung

Inhaltsverzeichnis

History[Bearbeiten | Quelltext bearbeiten]

Context[Bearbeiten | Quelltext bearbeiten]

Die Methode[Bearbeiten | Quelltext bearbeiten]

Problem statement[Bearbeiten | Quelltext bearbeiten]

Lineare Einfachregression[Bearbeiten | Quelltext bearbeiten]

Geometrische Eigenschaften[Bearbeiten | Quelltext bearbeiten]

Multiple lineare Regression[Bearbeiten | Quelltext bearbeiten]

Schätzung der Regressionskoeffizienten nach der Methode der kleinsten Quadrate[Bearbeiten | Quelltext bearbeiten]

Eigenschaften des Kleinste-Quadrate-Schätzers[Bearbeiten | Quelltext bearbeiten]

Erwartungstreue[Bearbeiten | Quelltext bearbeiten]

Effizienz[Bearbeiten | Quelltext bearbeiten]

Konsistenz[Bearbeiten | Quelltext bearbeiten]

Äquivalenz der Maximum-Likelihood-Lösung und der KQ-Lösung[Bearbeiten | Quelltext bearbeiten]

Navigationsmenü

Benutzer:JonskiC/Kleinste-Quadrate-Schätzung

History[Bearbeiten | Quelltext bearbeiten]

Context[Bearbeiten | Quelltext bearbeiten]

Die Methode[Bearbeiten | Quelltext bearbeiten]

Problem statement[Bearbeiten | Quelltext bearbeiten]

Lineare Einfachregression[Bearbeiten | Quelltext bearbeiten]

Geometrische Eigenschaften[Bearbeiten | Quelltext bearbeiten]

Multiple lineare Regression[Bearbeiten | Quelltext bearbeiten]

Schätzung der Regressionskoeffizienten nach der Methode der kleinsten Quadrate[Bearbeiten | Quelltext bearbeiten]

Eigenschaften des Kleinste-Quadrate-Schätzers[Bearbeiten | Quelltext bearbeiten]

Erwartungstreue[Bearbeiten | Quelltext bearbeiten]

Effizienz[Bearbeiten | Quelltext bearbeiten]

Konsistenz[Bearbeiten | Quelltext bearbeiten]

Äquivalenz der Maximum-Likelihood-Lösung und der KQ-Lösung[Bearbeiten | Quelltext bearbeiten]

Navigationsmenü

Suche