This blog aims for basic understanding of regression and its evaluation matrices. Also, I will explain a strategy when to choose the linear or non-linear regression equation for solving a problem.
Regression: In statistic, regression analysis is used to determine the relationship between an independent variable(s), also known as predictor, and a dependent variable (predicted value). It is widely used for predicting or forecasting continuous values, which is opposed to the classifiers. Regression is classified in two types: linear and non-linear regression. The next question is "What is linear and non-linear equations?"
Linear regression: In simple term, if the data can be fitted by a straight line is called linear regression. As most of us are introduced by simple linear regression is:
Y =Θ(0) + Θ(1) X .........for one variant
Y = Θ(0) + Θ(1) X + Θ(2) X +.....+ Θ (n) X.... for multi variants
Furthermore, if you think only a straight line can be used to represents the linear regression then it is a wrong. How about this equation:
Y = Θ(0) + Θ(1) X^2...line has curve
Although this above equation has squared of X, but still it is linear model because the difference between of the parameters of the equation are still linear (Θ(0), Θ(1)). Linearity depends on the parameters of the equation not depends upon the predictors. Please visits this link for more details Linear vs non-linear.
Non-Linear: While a linear equation has one form, a non-linear equation has multiple forms (curves). Also, in a non-linear equation one predictor can have multiple parameters.
Example:
Y=Θ(1) + (Θ(2) - Θ(1)) * exp(-Θ(3) * X^Θ(4))
To determine the type of regression we will discuss residual plots. Similarly, to find out the fitness of a model ("good-fit") we will use evaluation matrices. Although, there are many evaluation matrices to find out the "good-fit" but we will discuss only R-Squared (R^2) and Means Squared Error (MSE).
Residual:
Residual is a difference between the actual and observed value (predicted value).
Here is a formula for residual:
residual=actual value - predicted value
Yes, I know you are confused with the error, so "what is error ?." Are error and residual same? Answer is "NO", in theoretically both are used to measure the deviation of an observed value from an expected value, but they are different. If you want to know more about it, please read this page "Errors and residuals".
How can these residuals be useful to recognizes the type of regression? As a first step, fit the model using linear regression, and collect the residual of each data point. Once you finished collecting values then draw a scattered graph, which is also know as residual plot. The X-axis is your predicted values, and Y-axis is residual values. In the residual plot, if you observe any pattern then the problem is a non-linear regression, and otherwise there is a no pattern (randomness) in the graph then the problem is a linear regression. Also, the estimator considered to be a good fit if the data points are around to zero in the graph.
Please visit this website (interpreting residual plots) for interpreting the residual graphs with various example.
Mean Squared Error: Assuming that you have already read the difference between an error and a residual. If not it is alright, here it is a definition: "In statistic, the error is the amount of deviation of an observed value from an expected value. In statistic, an error is a deviation from the mean value of the entire data set." Thus, the expected value is un observable because the it a mean not an actual value. However, in the regression analysis an error is a deviation from an observed value.
In regression, What is a mean squared error (MSE)?
MSE is an average of the squares of the errors. It is also called the sum of squares of the residuals. It measures the dispersion (variance) around the true values of the estimator. Ohh just a minute "What is variance?" I know you are thinking about it. Let see the single line definition of the variance: "It measures how far data is dispersed." In another term, the MSE measures how much variance is not covered by the estimator (error).
Notations:
E : It is an estimator, which predict values
Θ : It represents the actual values
Θ′ : It represents the predicted values by the estimator E(Θ)
N : Number of data points
SE : Sum of the squares of the errors
MSE formula:
SE = ∑ [(Θ-Θ′)^2]
MSE= SE/ N
How to analyze MSE value? MSE value is high then model is not "good-fit", while the value is low the model is well fitted. However, the value is near to zero means the estimator overfitted. The good estimator's MSE value lies between 20-30 percentage of the total squared error.
Note: Variance and MSE are different, Please read this article mean square error and variance.
R-Squared: R-squared value considered as an accuracy of a model. It's value lies from 0 to 1. The higher value represents "good-fit" model, while the low value (near to zero) is a bad fitted model. However, the higher value always do not considered the "good-fit" model. It helps us to measure the amount of variance covered by the model.
Formula: 1- (u/v)
u: It is a mean squared error (MSE)--It measures the variance, which is not covered by the model
v: It is over all variance of the dependent variable.
u/v: What's percentage of variance is not covered by the model.
If you want to calculate the amount (in percentage) covered by the model then take the difference between 1 and u/v.
1-u/v: It represents the total variance you much variance covered by the model
Regression: In statistic, regression analysis is used to determine the relationship between an independent variable(s), also known as predictor, and a dependent variable (predicted value). It is widely used for predicting or forecasting continuous values, which is opposed to the classifiers. Regression is classified in two types: linear and non-linear regression. The next question is "What is linear and non-linear equations?"
Linear regression: In simple term, if the data can be fitted by a straight line is called linear regression. As most of us are introduced by simple linear regression is:
Y =Θ(0) + Θ(1) X .........for one variant
Y = Θ(0) + Θ(1) X + Θ(2) X +.....+ Θ (n) X.... for multi variants
Furthermore, if you think only a straight line can be used to represents the linear regression then it is a wrong. How about this equation:
Y = Θ(0) + Θ(1) X^2...line has curve
Although this above equation has squared of X, but still it is linear model because the difference between of the parameters of the equation are still linear (Θ(0), Θ(1)). Linearity depends on the parameters of the equation not depends upon the predictors. Please visits this link for more details Linear vs non-linear.
Non-Linear: While a linear equation has one form, a non-linear equation has multiple forms (curves). Also, in a non-linear equation one predictor can have multiple parameters.
Example:
Y=Θ(1) + (Θ(2) - Θ(1)) * exp(-Θ(3) * X^Θ(4))
To determine the type of regression we will discuss residual plots. Similarly, to find out the fitness of a model ("good-fit") we will use evaluation matrices. Although, there are many evaluation matrices to find out the "good-fit" but we will discuss only R-Squared (R^2) and Means Squared Error (MSE).
Residual:
Residual is a difference between the actual and observed value (predicted value).
Here is a formula for residual:
residual=actual value - predicted value
Yes, I know you are confused with the error, so "what is error ?." Are error and residual same? Answer is "NO", in theoretically both are used to measure the deviation of an observed value from an expected value, but they are different. If you want to know more about it, please read this page "Errors and residuals".
How can these residuals be useful to recognizes the type of regression? As a first step, fit the model using linear regression, and collect the residual of each data point. Once you finished collecting values then draw a scattered graph, which is also know as residual plot. The X-axis is your predicted values, and Y-axis is residual values. In the residual plot, if you observe any pattern then the problem is a non-linear regression, and otherwise there is a no pattern (randomness) in the graph then the problem is a linear regression. Also, the estimator considered to be a good fit if the data points are around to zero in the graph.
Please visit this website (interpreting residual plots) for interpreting the residual graphs with various example.
Mean Squared Error: Assuming that you have already read the difference between an error and a residual. If not it is alright, here it is a definition: "In statistic, the error is the amount of deviation of an observed value from an expected value. In statistic, an error is a deviation from the mean value of the entire data set." Thus, the expected value is un observable because the it a mean not an actual value. However, in the regression analysis an error is a deviation from an observed value.
In regression, What is a mean squared error (MSE)?
MSE is an average of the squares of the errors. It is also called the sum of squares of the residuals. It measures the dispersion (variance) around the true values of the estimator. Ohh just a minute "What is variance?" I know you are thinking about it. Let see the single line definition of the variance: "It measures how far data is dispersed." In another term, the MSE measures how much variance is not covered by the estimator (error).
Notations:
E : It is an estimator, which predict values
Θ : It represents the actual values
Θ′ : It represents the predicted values by the estimator E(Θ)
N : Number of data points
SE : Sum of the squares of the errors
MSE formula:
SE = ∑ [(Θ-Θ′)^2]
MSE= SE/ N
How to analyze MSE value? MSE value is high then model is not "good-fit", while the value is low the model is well fitted. However, the value is near to zero means the estimator overfitted. The good estimator's MSE value lies between 20-30 percentage of the total squared error.
Note: Variance and MSE are different, Please read this article mean square error and variance.
R-Squared: R-squared value considered as an accuracy of a model. It's value lies from 0 to 1. The higher value represents "good-fit" model, while the low value (near to zero) is a bad fitted model. However, the higher value always do not considered the "good-fit" model. It helps us to measure the amount of variance covered by the model.
Formula: 1- (u/v)
u: It is a mean squared error (MSE)--It measures the variance, which is not covered by the model
v: It is over all variance of the dependent variable.
u/v: What's percentage of variance is not covered by the model.
If you want to calculate the amount (in percentage) covered by the model then take the difference between 1 and u/v.
1-u/v: It represents the total variance you much variance covered by the model