Outliers

Introduction

Outliers are data points that deviate significantly from the rest of the dataset. They can distort statistical analyses, degrade model performance, and mislead insights if not handled properly. In this article, we explore what outliers are, their types, methods to detect them, real-world examples, and how to deal with them in machine learning.


What Are Outliers?

An outlier is an observation that lies far away from other data points in a dataset. Outliers can arise due to variability in data, errors in data collection, or rare occurrences that have significant impact.

Types of Outliers

  1. Univariate Outliers – Extreme values in a single variable (e.g., a person with a height of 8 feet in a dataset where most heights range between 5 and 6 feet).
  2. Multivariate Outliers – Data points that are unusual when considering multiple features together (e.g., an individual with a very high salary but low experience level).
  3. Global vs. Local Outliers
    • Global outliers deviate significantly from the entire dataset.
    • Local outliers are extreme within a specific subset of data.
  4. Point vs. Collective Outliers
    • Point outliers are single, extreme values.
    • Collective outliers are a group of values behaving differently from the majority.

Real-World Examples of Outliers

1. Financial Transactions (Fraud Detection)

  • A sudden, extremely high transaction amount could indicate fraud.
  • Example: If most transactions range between $10-$500, but one transaction is $50,000, it is an outlier.

2. Healthcare (Medical Diagnosis)

  • An unusually high blood pressure reading in a patient population.
  • Example: If normal blood pressure ranges between 90/60 mmHg and 120/80 mmHg, a reading of 200/140 mmHg is an outlier.

3. Manufacturing (Quality Control)

  • A defective product with extreme deviations from standard size or weight.
  • Example: If a factory produces bolts that weigh between 50-55g, and one bolt weighs 80g, it is an outlier.

4. Sports Analytics

  • A basketball player scoring 100 points in a game while most players score around 20-30.

5. Web Traffic Analysis

  • A sudden spike in website traffic could indicate a viral event or a cyber attack.

How to Detect Outliers?

1. Visualization Techniques

Box Plot

  • A box plot displays the distribution of data and highlights extreme values.
  • Outliers appear as points outside the whiskers of the box plot.

Scatter Plot

  • A scatter plot visualizes relationships between two variables.
  • Outliers appear as points far from the cluster of data.

Histogram

  • A histogram shows the frequency distribution of a dataset.
  • An outlier appears in isolated bins at the extreme ends.

2. Statistical Methods

Z-Score (Standard Score Method)

  • Measures how many standard deviations a data point is from the mean.
  • Formula:If |Z| > 3, the data point is considered an outlier.

Interquartile Range (IQR) Method

  • Measures data spread using quartiles.
  • Formula:
    • Any data point < Q1 – 1.5 * IQR or > Q3 + 1.5 * IQR is considered an outlier.

3. Machine Learning Methods

Isolation Forest

  • A tree-based model that isolates anomalies efficiently.

DBSCAN (Density-Based Clustering)

  • Identifies outliers by clustering dense regions and marking sparse points as anomalies.

One-Class SVM (Support Vector Machine)

  • Learns a decision boundary around normal data and flags points outside it as outliers.

How to Handle Outliers?

1. Removing Outliers

  • Used when the outlier is due to errors.
  • Example: A data entry mistake listing a person’s age as 500 years.

2. Transforming Data

  • Applying log transformation or scaling can reduce the impact of outliers.
  • Example: Converting skewed salary data using log transformation.

3. Using Robust Models

  • Decision Trees and Random Forests are less sensitive to outliers than linear models.

4. Capping or Flooring Values

  • Setting a threshold for extreme values.
  • Example: If most salaries range between $30k-$100k, capping any salary above $150k.

Conclusion

Outliers are crucial in data analysis and machine learning, as they can indicate significant events, errors, or anomalies. Understanding their types, detecting them using statistical and ML-based approaches, and handling them properly ensures better model performance and insightful data analysis.

Information shared by : THYAGU