Robust Statistics for Data Scientists Part 2: Resilient Measures of Relationships Between Variables | by Alessandro Tomassini

From basic to advanced techniques for outlier-rich data analysis.

15 min read

11 hours ago

Grasping the interconnections among variables is essential for making data-driven decisions. When we accurately evaluate these links, we bolster the trustworthiness and legitimacy of our findings, crucial in both scholarly and practical contexts.

Data scientists frequently turn to Pearson’s correlation and linear regression to probe and measure variable relationships. These methods presume data normality, independence, and consistent spread (or homoscedasticity) and perform well when these conditions are met. However, real-world data scenarios are seldom ideal. They’re typically marred by noise and outliers, which can skew the results of traditional statistical techniques, leading to incorrect conclusions. This piece, the second in our series on robust statistics, seeks to navigate these obstacles by delving into robust alternatives that promote more dependable insights, even amidst data irregularities.

In case you have missed the first part:

Pearson’s Correlation is a statistical method designed to capture the extent of association between two continuous variables, employing a scale that ranges from -1, indicating perfect inverse proportionality, to +1, representing perfect direct proportionality, with the neutral point 0 reflecting a lack of any discernible relationship. This method assumes that the variables in question adhere to a normal distribution and maintain a linear relationship. However, it is noteworthy that Pearson’s correlation is very sensitive to outliers, which can significantly skew the estimated correlation coefficient, resulting in a potentially misleading representation of the relationship’s intensity or lack thereof.