WHY ARE THERE TWO REGRESSION LINES?
There may exist two regression lines in certain circumstances. When the variables X and Y are interchangeable with related to causal effects, one can consider X as independent variable and Y as dependent variable (or) Y as independent variable and X as dependent variable. As the result, we have (1) the regression line of Y on X and (2) the regression line of X on Y.
Both are valid regression lines. But we must judicially select the one regression equation which is suitable to the given environment.
Note: If, X only causes Y, then there is only one regression line, of Y on X.
In the general form of the simple linear regression equation of Y on X
Y= a + bX + e
the constants ‘a’ and ‘b’ are generally called as the regression coefficients.
The coefficient ‘b’ represents the rate of change in the value of the mean of Y due to every unit change in the value of X. When the range of X includes ‘0’, then the intercept ‘a’ is E(Y|X = 0). If the range of X does not include ‘0’, then ‘a’ does not have practical interpretation.
If (xi,yi), i = 1, 2, ..., n is a set of n-pairs of observations made on (X, Y), then fitting of the above regression equation means finding the estimates ‘a’ and ‘b’ for ‘a’ and ‘b’ respectively.
These estimates are determined based on the following general assumptions:
(i) the relationship between Y and X is linear (approximately).
(ii) the error term ‘e’ is a random variable with mean zero.
(iii) the error term ‘e’ has constant variance.
There are other assumptions on ‘e’, which are not required at this level of study.
Before going for further study, the following points are to be kept in mind.
Both the independent and dependent variables must be measured at the interval scale.
There must be linear relationship between independent and dependent variables.
Linear Regression is very sensitive to Outliers (extreme observations). It can affect the regression line extremely and eventually the estimated values of Y too.
Based on the assumption (ii), the response variable Y is also a random variable with mean
E(Y|X=x) = a + bx
In regression analysis, the main objective is finding the line of best fit, which provides the fitted equation of Y on X.
The line of ‘best fit‘ is the line (straight line equation) which minimizes the error in the estimation of the dependent variable Y, for any specified value of the independent variable X from its range.
The regression equation E(Y|X=x) = a +bx represents a family of straight lines for different values of the coefficients ‘a’ and ‘b’. The problem is to determine the estimates of ‘a’ and ‘b’ by minimizing the error in the estimation of Y so that the line is a best fit. This necessitates to find the suitable values of the estimates of ‘a’ and ‘b’.