Data quality is a critical factor in any data analysis or machine learning project. Imperfect data can lead to incorrect conclusions, unstable models, and poor performance.
Missing values appear as NaN, None, empty strings, or whitespace.
import pandas as pd
# Detect missing values
print(df.isnull().sum())
# Fill missing values
[df['age'].fillna(df['age'].mean(), inplace=True)
[df['city'].fillna('Unknown', inplace=True)
Explanation: Missing values are identified using isnull(). Numerical values can be filled with mean/median, while categorical values are often filled with mode or a placeholder.
# Remove invalid ages (e.g. negative values)
df = df[df['age'] >= 0]
Explanation: Invalid values violate domain constraints. Filtering or correcting them ensures data integrity.
# Detect outliers using IQR
Q1 = df['age'].quantile(0.25)
Q3 = df['age'].quantile(0.75)
IQR = Q3 - Q1
# Filter out outliers
df = df[(df['age'] >= Q1 - 1.5 * IQR) & (df['age'] <= Q3 + 1.5 * IQR)]
Explanation: Outliers can distort analysis. The IQR method is commonly used to detect and remove extreme values.
# Remove duplicate rows
df = df.drop_duplicates()
Explanation: Duplicate records can bias results. drop_duplicates() removes repeated rows.
# Remove leading/trailing spaces
df['name'] = df['name'].str.strip()
# Standardize text
[df['city'] = df['city'].str.lower()
Explanation: String normalization ensures consistency in categorical variables.
# Standardize gender values
[df['gender'] = df['gender'].replace({'M': 'Male', 'F': 'Female', 'male': 'Male'})
Explanation: Different representations of the same value are mapped to a unified format.
# Convert column to numeric
[df['age'] = pd.to_numeric(df['age'], errors='coerce')
Explanation: Ensures that all values in a column follow the same data type.
# Read file with correct encoding
df = pd.read_csv('data.csv', encoding='utf-8')
Explanation: Encoding issues occur when the file encoding does not match the expected format.
# Drop unnecessary columns
df = df.drop(columns=['id', 'timestamp'])
Explanation: Removing irrelevant features reduces noise and improves model performance.
# Example: birth date cannot be in the future
from datetime import datetime
current_year = datetime.now().year
df = df[df['birth_year'] <= current_year]
Explanation: Logical checks ensure relationships between variables are valid and consistent.
Handling data quality issues is essential for building reliable datasets and models. Each type of issue requires specific preprocessing techniques.