Start your project today +81 80 3514 0263

📊 Data Quality Issues in Datasets

1. Introduction

Data quality is a critical factor in any data analysis or machine learning project. Imperfect data can lead to incorrect conclusions, unstable models, and poor performance.

2. Types of Data Quality Issues (with Code Examples)

2.1 Missing Values

Missing values appear as NaN, None, empty strings, or whitespace.

import pandas as pd

# Detect missing values
print(df.isnull().sum())

# Fill missing values
[df['age'].fillna(df['age'].mean(), inplace=True)
[df['city'].fillna('Unknown', inplace=True)

Explanation: Missing values are identified using isnull(). Numerical values can be filled with mean/median, while categorical values are often filled with mode or a placeholder.

2.2 Invalid Values

# Remove invalid ages (e.g. negative values)
df = df[df['age'] >= 0]

Explanation: Invalid values violate domain constraints. Filtering or correcting them ensures data integrity.

2.3 Outliers

# Detect outliers using IQR
Q1 = df['age'].quantile(0.25)
Q3 = df['age'].quantile(0.75)
IQR = Q3 - Q1

# Filter out outliers
df = df[(df['age'] >= Q1 - 1.5 * IQR) & (df['age'] <= Q3 + 1.5 * IQR)]

Explanation: Outliers can distort analysis. The IQR method is commonly used to detect and remove extreme values.

2.4 Duplicate Data

# Remove duplicate rows
df = df.drop_duplicates()

Explanation: Duplicate records can bias results. drop_duplicates() removes repeated rows.

2.5 Formatting Inconsistencies

# Remove leading/trailing spaces
df['name'] = df['name'].str.strip()

# Standardize text
[df['city'] = df['city'].str.lower()

Explanation: String normalization ensures consistency in categorical variables.

2.6 Inconsistent Encoding of Values

# Standardize gender values
[df['gender'] = df['gender'].replace({'M': 'Male', 'F': 'Female', 'male': 'Male'})

Explanation: Different representations of the same value are mapped to a unified format.

2.7 Inconsistent Data Types

# Convert column to numeric
[df['age'] = pd.to_numeric(df['age'], errors='coerce')

Explanation: Ensures that all values in a column follow the same data type.

2.8 Encoding Errors

# Read file with correct encoding
df = pd.read_csv('data.csv', encoding='utf-8')

Explanation: Encoding issues occur when the file encoding does not match the expected format.

2.9 Redundant or Irrelevant Data

# Drop unnecessary columns
df = df.drop(columns=['id', 'timestamp'])

Explanation: Removing irrelevant features reduces noise and improves model performance.

2.10 Logical Inconsistencies

# Example: birth date cannot be in the future
from datetime import datetime

current_year = datetime.now().year
df = df[df['birth_year'] <= current_year]

Explanation: Logical checks ensure relationships between variables are valid and consistent.

3. Conclusion

Handling data quality issues is essential for building reliable datasets and models. Each type of issue requires specific preprocessing techniques.