1.Concept: Measuring relationship between variables mathematically. It specifies the relation between a single dependent variable (who’s value is to be predicted) and one or more numeric independent variables (predictors). The variables represent attributes in physical world.
2.Regression methods are also used for statistical hypothesis testing, to determine whether a premise is likely to be true or false
3.Regression analysis is an umbrella covering a large number of methods, some of which include:
a.Simple Linear regression (simple and multiple)
b.Logistic regression (for binary categorical outcome)
c.Poisson regression (based on Poisson distribution of integer / count data)
d.Multinomial logistic regression (to model categorical outcomes)
1.If two sets of data (x and y) are related i.e. the value of Y depends on value of X, and if for delta change in x there is corresponding delta change in Y, the relationship is known as linear correlation
2.An example of a sample data set and the plot of a "best-fit" straight line through the data.
3.The plot (red dots) are not exactly on a line. They are waving. The ML algorithm generates the best fitting line (model) which represents the relation between the two variables x and y
4.Given this line generated based on past data, one can estimate value of Y for any new value of X
1.The model is the line with least sum of squared residuals
2.The line has a slope (shown as “m”) and an intercept (shown as “a”)
3.Mathematically the model is represented by a equation y = mx + c
5.Before we generate a model, we need to understand the degree of relationship between the attributes Y and X
6.Mathematically correlation between two variables indicates how closely their relationship follows a straight line. By default we use Pearson’s correlation which ranges between -1 and +1.
7.Correlation of extreme possible values of -1 and +1 indicate a perfectly linear relationship between X and Y whereas a correlation of 0 indicates absence of linear relationship
a.When r value is small, one needs to test whether it is statistically significant or not to believe that there is
1.Coefficient of relation - Pearson’s coefficient p(x,y) = Cov(x,y) / ( stnd Dev (x) X stnd Dev (y) )
2.Generating linear model for cases where r is near 0, makes no sense. The model will not be reliable. For a given value of X, there can be many values of Y!
a.There are a variety of errors for all those points that don’t fall exactly on the line.
b.It is important to understand these errors to judge the goodness of fit of the model i.e. How representative the model is likely to be in general
c.Let us look at point P1 which is one of the given data points and associated errors due to the model
2.Coefficient of determination
a.Coefficient of determination, r^2, reflects proportion of variance of one variable which is predictable from another variable
b.It is the ratio of explained variation to the total variation
c.Coefficient of determination is such that 0<= r2 <=1
d. r = 0.922 then r^2 = 0.850, then 85% of total variation in y can be captured by linear model. Rest 15% of the variation in y is random and not reflected by the model
Multiple linear regression involve more than one independent variables in determining the value of the dependent variable Y
Most real-world applications involve multiple independent variables. The goal is same as in single variable linear regression, find the line with minimize residual errors by finding the appropriate beta values
1.Muilti –linear regression model is represented as
2.Sum of intercept (alpha) plus the product of estimated beta values for each attribute reflecting the slope of the line against each attribute.
3.The coefficients (beta values) indicate how much the predicted value will change for every unit change in the given predictor with other predictors remaining constant