Can the slope of linear regression line can be calculated using any two points in the data?

RED SPOTS

Linear Regression

Linear regression is a technique for choosing a line to represents the relationship between two variables, based on a set of observed values of the variables. Continuing with the income and food expenditure example, we might observe the monthly incomes of several households and also their monthly food expenditures. These data points could measure the behavior of different households (a cross-sectional sample), or the same household in different months (a time-series sample), or perhaps a set of households each of which is observed over a span of months (a longitudinal sample or panel data).

To begin with the simplest possible example of linear regression, suppose that we have exactly two observations on E and Y. These observations come from two different households that can be assumed to have made decisions independently from one another. (This assumption of independence is important. To learn more about why, select Special Topic: Independence of Observations).

If we plot these two data points on a graph with Y on the horizontal axis and E on the vertical axis, we might get a diagram similar to the one in Figure 1, where the data points are labeled x and z. As you can see, there is exactly one straight line that passes through the two data points. We shall represent the mathematical equation for this line as E = b0 + b1 Y. (Note that we must distinguish carefully between the unknown parameters that we denote by capital letters and our estimates of them, which we denote by lower-case letters. Many textbooks use Greek letters for the former and corresponding Roman letters for the latter.) This equation is a "perfect fit" for the data, in the sense that both data points lie exactly on the line. In the case of only two data points, choosing the best straight line to represent the relationship defined by the data is easy! The slope of this line is b1, which is our empirical estimate of B1, while the value of E where the best-fit line intercepts the vertical axis is b0, our estimate of B0.

Figure 1

 

Can the slope of linear regression line can be calculated using any two points in the data?

As discussed in the page on economic models, we can rarely expect the relationship between two economic variables to be "perfect." There are always other variables that affect the endogenous variable. Differences in these other variables between observations will cause some data points to lie above the regression line and others to lie below it. Figure 2 shows a sample of three data points.

Figure 2

Can the slope of linear regression line can be calculated using any two points in the data?

No single line passes through all three points. Choosing the line passing through any two of the three points leaves one point off the line, so we say that there is one degree of freedom in choosing the line. (In the case of only two points, there were zero degrees of freedom; if we added a fourth point, there would be two degrees of freedom.)

Many alternative criteria could be chosen for picking a "best-fit" line for three or more data points. The most common methods involve trying to make the residuals, the deviations of the data points from the estimated regression line, as small as possible. In the case of only two data points, our regression line passes through both points, so the residuals are zero--the data points do not deviate from the line. With three or more data points we cannot find a line that makes all the residuals zero, except in the unusual case where all the points happen to lie on the same line.

By far the most common estimator is the least-squares regression line. This is the line that makes the sum of the squared residuals as small as possible, where the residual is measured as the vertical distance of the point from the estimated regression line. Figure 2 shows the least-squares regression line and the residuals for the three data points. The least-squares regression procedure minimizes the sum of the squares of the residuals because the squares are always positive, even for observations for which the residuals themselves are negative. Adding up the squared residuals assures that positive and negative residuals will not cancel each other out. (We could, of course, minimize the sum of the absolute values of the residuals rather than the squares, but for mathematical reasons it is easier to work with the squares. Statistical practice has settled on least-squares regression as the standard procedure.)

Regression analysis breaks down the overall variation in the endogenous variable E into two components. The first component is the amount of variation that is explained by movements in the exogenous variable Y. This variation corresponds to changes in E that occur as one moves along the estimated regression line to different values of Y. The second component is the variation that is not explained by Y, which is measured by the deviations of the data points from the estimated regression line: the residuals. We sometimes measure how well the estimated regression line fits the data by the share of the total variation in E that is explained by movements along the regression line vs. the residuals. This share is usually called R2. For more about measuring goodness of fit in regressions, go to Special Topic: Goodness of Fit.

We have computer programs to do the calculations of the regression line, so we don't usually need to worry about the formulas that are used to find b1 and b2. These formulas can be examined by selecting Special Topic: Regression Formulas. Although economists usually use specialized statistical programs such as SAS, Stata, and EViews that feature dozens of variations on the basic regression procedure, simple linear regression coefficients can be calculated by spreadsheets such as Microsoft Excel. To learn how to use Excel to calculate the coefficients of the regression line, visit Special Topic: Regression Using Excel.

Disturbances vs. Residuals: To avoid possible confusion, it is important to clarify the relationship between the disturbance term U introduced in the Economic Models page and the residuals, which we shall denote by u, that were examined above. When we formulate and apply an economic model, we assume that the true relationship that underlies our data (sometimes called the data-generating process) is E = B1 + B2Y + U, with U being a random variable drawn from some specified probability distribution. As economists, we know only the sample values of E and Y, not the values of B1, B2, or U. Indeed, the main purpose of regression analysis is to try to estimate B1 and B2. The disturbance term U is equal to E - (B1 + B2Y) and is the distance from the data point to the true (unknown) regression line of the data-generating process. The residual u = E - (b1 + b2Y) is the distance from the data point to the estimated regression line.

Recall that the disturbance term captures the effects on the endogenous variable (E) of factors other than the exogenous variable (Y) that appears on the right-hand side. It is possible to include more than one exogenous variable in a multiple regression. To learn more about extending regression analysis to more than two dimensions, look at Special Topic: Multiple Regression.

The estimated regression line in Figure 2 is upward-sloping, which means that our best estimate b2 of the slope coefficient B2 is positive. But how confident can we be that our upward-sloping estimated regression line truly reflects a positive relationship between the variables rather than a coincidence of random disturbances that happened to lead to a positive slope? To answer this question, we must consider the statistical properties of the estimated regression coefficients--their properties over many repeated random samples. That is the subject of our Next Topic: Sampling Variation. 

Special topics:

Special Topic: Independence of Observations

Special Topic: Goodness of Fit

Special Topic: Regression Formulas

Special Topic: Regression Using Excel

Special Topic: Multiple Regression

To continue to the next topic: Next Topic: Sampling Variation

To return to the RED SPOTS Econometrics Home Page: Home

What will happen to the regression line if the slope is negative?

If the slope is negative, y decreases as x increases and the function runs downhill. If the slope is zero, y does not change, thus is constant—a horizontal line.

How to calculate the slope of the regression line?

To calculate slope for a regression line, you'll need to divide the standard deviation of y values by the standard deviation of x values and then multiply this by the correlation between x and y. The slope can be negative, which would show a line going downhill rather than upwards.

How to find the equation of the regression line for the given data?

The formula for simple linear regression is Y = mX + b, where Y is the response (dependent) variable, X is the predictor (independent) variable, m is the estimated slope, and b is the estimated intercept.

How to find m and b in linear regression?

Remember from algebra, that the slope is the “m” in the formula y = mx + b. In the linear regression formula, the slope is the a in the equation y' = b + ax. They are basically the same thing. So if you're asked to find linear regression slope, all you need to do is find b in the same way that you would find m.