Correlation

Where we defined correlation as

$corr(x,y) = \frac{cov(x,y)}{\sigma_x\sigma_y}$

$\begin{aligned} cov(x,y) &= E[(x-\mu_x))(y-\mu_y)] \\ &= E(xy-x\mu_y-\mu_x y + \mu_x\mu_y) \\ &= E(xy) - \mu_x E(y) - \mu_y E(x) + E(\mu_x\mu_y) \\ &= E(xy) - \mu_x\mu_y - \mu_y \mu_x + \mu_x \mu_y \\ &= E(xy) - \mu_x \mu_y \end{aligned}$

It is a unit free measurement of a relationship between 2 variables. While correlation coefficients lie between -1 and +1, covariance can take any value between -∞ and +∞.

Pearson Correlation

Which measure two features whether they are linear dependent

$\rho_{xy} = \frac{cov(x,y)}{\sigma_x \sigma_y} = \frac{(x-\mu_x)(y-\mu_y)}{\sigma_x \sigma_y}$

if the data is centred then pearson correlation is equal to the cosine:

$cos(\theta)=\frac{a\cdot b}{|a||b|}$

R Square Correlation

Wikipedia defines r2 like this, ” … is the proportion of the variance in the dependent variable that is predictable from the independent variable(s).” Another definition is:

$1-(\text{total variance explained by model} / \text{total variance})$

Notice that, when we set all our predicted value as the mean of our train data. Then the $R^2$ is 0, and if pur predicted value is bad then mean value, the $R^2$ will be negative.

Link to the paper

Generally it is better to look at adjusted R-squared rather than R-squared and to look at the standard error of the regression rather than the standard deviation of the errors. These are unbiased estimators that correct for the sample size and numbers of coefficients estimated. Adjusted R-squared is always smaller than R-squared, but the difference is usually very small unless you are trying to estimate too many coefficients from too small a sample in the presence of too much noise. Specifically, adjusted R-squared is equal to 1 minus (n - 1)/(n – k - 1) times 1-minus-R-squared, where n is the sample size and k is the number of independent variables. (It is possible that adjusted R-squared is negative if the model is too complex for the sample size and/or the independent variables have too little predictive value, and some software just reports that adjusted R-squared is zero in that case.) Adjusted R-squared bears the same relation to the standard error of the regression that R-squared bears to the standard deviation of the errors: one necessarily goes up when the other goes down for models fitted to the same sample of the same dependent variable.

So, what IS a good value for R-squared? It depends on the variable with respect to which you measure it, it depends on the units in which that variable is measured and whether any data transformations have been applied, and it depends on the decision-making context. If the dependent variable is a nonstationary (e.g., trending or random-walking) time series, an R-squared value very close to 1 (such as the 97% figure obtained in the first model above) may not be very impressive. In fact, if R-squared is very close to 1, and the data consists of time series, this is usually a bad sign rather than a good one: there will often be significant time patterns in the errors, as in the example above. On the other hand, if the dependent variable is a properly stationarized series (e.g., differences or percentage differences rather than levels), then an R-squared of 25% may be quite good. In fact, an R-squared of 10% or even less could have some information value when you are looking for a weak signal in the presence of a lot of noise in a setting where even a very weak one would be of general interest. Sometimes there is a lot of value in explaining only a very small fraction of the variance, and sometimes there isn’t. Data transformations such as logging or deflating also change the interpretation and standards for R-squared, inasmuch as they change the variance you start out with.

However, be very careful when evaluating a model with a low value of R-squared. In such a situation: (i) it is better if the set of variables in the model is determined a priori (as in the case of a designed experiment or a test of a well-posed hypothesis) rather by searching among a lineup of randomly selected suspects; (ii) the data should be clean (not contaminated by outliers, inconsistent measurements, or ambiguities in what is being measured, as in the case of poorly worded surveys given to unmotivated subjects); (iii) the coefficient estimates should be individually or at least jointly significantly different from zero (as measured by their P-values and/or the P-value of the F statistic), which may require a large sample to achieve in the presence of low correlations; and (iv) it is a good idea to do cross-validation (out-of-sample testing) to see if the model performs about equally well on data that was not used to identify or estimate it, particularly when the structure of the model was not known a priori. It is easy to find spurious (accidental) correlations if you go on a fishing expedition in a large pool of candidate independent variables while using low standards for acceptance. I have often had students use this approach to try to predict stock returns using regression models–which I do not recommend–and it is not uncommon for them to find models that yield R-squared values in the range of 5% to 10%, but they virtually never survive out-of-sample testing. (You should buy index funds instead.)