Data Cleaning - Pandas Cleans “Dirty” Data
Overview
- Understand the Data
- Analyze Data Issues
- Clean the Data
- Integrate the Code
1. Understand the Data
Before processing any data, our primary task is to understand the data and its purpose. We need to examine the columns/rows, records, data formats, semantic errors, missing entries, and incorrect formats. This helps us identify the “cleaning” work required before data analysis.
For this tutorial, we use a small CSV file patient_heart_rate.csv (Link: https://pan.baidu.com/s/1geX8oYf Password: odj0) that is easy to overview. The data describes the heart rate of different individuals at various times, with columns including age, weight, gender, and heart rates at different time intervals.
import pandas as pddf = pd.read_csv('../data/patient_heart_rate.csv')df.head()Initial Data Preview
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | Mickey Mouse | 56.0 | 70kgs | 72 | 69 | 71 | NaN | NaN | NaN |
| 1 | 2.0 | Donald Duck | 34.0 | 154.89lbs | NaN | NaN | NaN | 85 | 84 | 76 |
| 2 | 3.0 | Mini Mouse | 16.0 | NaN | NaN | NaN | NaN | 65 | 69 | 72 |
| 3 | 4.0 | Scrooge McDuck | NaN | 78kgs | 78 | 79 | 72 | ”-“ | NaN | NaN |
| 4 | 5.0 | Pink Panther | 54.0 | 198.658lbs | NaN | ”..” | NaN | 69 | NaN | 75 |
| 5 | 6.0 | Huey McDuck | ”:52.0” | 189lbs | ”.-“ | NaN | ”-“ | 68 | 75 | 72 |
| 6 | 7.0 | Dewey McDuck | 19.0 | 56kgs | NaN | NaN | NaN | 71 | 78 | 75 |
| 7 | 8.0 | Scoopy Doo | 32.0 | 78kgs | 78 | 76 | 75 | NaN | NaN | NaN |
| 8 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 9 | 9.0 | Huey McDuck | 52.0 | 189lbs | NaN | NaN | NaN | 68 | 75 | 72 |
| 10 | 10.0 | Louie McDuck | 12.0 | 45kgs | NaN | NaN | NaN | 92 | 95 | 87 |
2. Analyze Data Issues
After examining the data, we identify the following key issues:
- No column headers
- A single column contains multiple parameters (name combines first and last names)
- Inconsistent units in the weight column (mixed “kgs” and “lbs”)
- Missing values (NaN)
- Empty rows (all values are NaN except the index)
- Duplicate records
- Non-ASCII characters in names
- Some column headers should be data (gender-time combinations like “m0006” should be split into gender and time range)
3. Clean the Data
We address each issue step by step:
3.1 Add Custom Column Headers
Pandas allows custom column headers when reading CSV files. We define meaningful headers to replace the default numeric columns:
# Define custom column namescolumn_names = ['id', 'name', 'age', 'weight', 'm0006', 'm0612', 'm1218', 'f0006', 'f0612', 'f1218']df = pd.read_csv('../data/patient_heart_rate.csv', names=column_names)| Id | Name | Age | Weight | m0006 | m0612 | m1218 | f0006 | f0612 | f1218 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | Mickéy Mousé | 56.0 | 70kgs | 72 | 69 | 71 | - | - | - |
| 1 | 2.0 | Donald Duck | 34.0 | 154.89lbs | - | - | - | 85 | 84 | 76 |
| 2 | 3.0 | Mini Mouse | 16.0 | NaN | - | - | - | 65 | 69 | 72 |
| 3 | 4.0 | Scrooge McDuck | NaN | 78kgs | 78 | 79 | 72 | - | - | - |
| 4 | 5.0 | Pink Panther | 54.0 | 198.658lbs | - | - | - | 69 | NaN | 75 |
3.2 Split Combined Column (Name)
The name column contains both first and last names. We split it into two separate columns and drop the original:
# Split name into first_name and last_namedf[['first_name', 'last_name']] = df['name'].str.split(expand=True)df.drop('name', axis=1, inplace=True)| Row | Id | Age | Weight | m0006 | m0612 | m1218 | f0006 | f0612 | f1218 | Firstname | Lastname |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 56.0 | 70kgs | 72 | 69 | 71 | - | - | - | Mickëy | Mousë |
| 1 | 2.0 | 34.0 | 154.89lbs | - | - | - | 85 | 84 | 76 | Donald | Duck |
| 2 | 3.0 | 16.0 | NaN | - | - | - | 65 | 69 | 72 | Mini | Mouse |
| 3 | 4.0 | NaN | 78kgs | 78 | 79 | 72 | - | - | - | Scrooge | McDuck |
| 4 | 5.0 | 54.0 | 198.658lbs | - | - | - | 69 | NaN | 75 | Pink | Panther |
| 5 | 6.0 | 52.0 | 189lbs | - | - | - | 68 | 75 | 72 | Huey | McDuck |
| 6 | 7.0 | 19.0 | 56kgs | - | - | - | 71 | 78 | 75 | Dewey | McDuck |
| 7 | 8.0 | 32.0 | 78kgs | 78 | 76 | 75 | - | - | - | Scööpy | Doo |
| 8 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 9 | 9.0 | 52.0 | 189lbs | - | - | - | 68 | 75 | 72 | Huey | McDuck |
| 10 | 10.0 | 12.0 | 45kgs | - | - | - | 92 | 95 | 77 | Louie | McDuck |
3.3 Standardize Weight Units
The weight column uses mixed units (“kgs” and “lbs”). We convert all values to “kgs” (1 lbs ≈ 0.4536 kgs):
# Identify rows with "lbs" unitsrows_with_lbs = df['weight'].str.contains('lbs').fillna(False)df[rows_with_lbs]| Row | Id | Age | Weight | m0006 | m0612 | m1218 | f0006 | f0612 | f1218 | Firstname | Lastname |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2.0 | 34.0 | 154.89lbs | - | - | - | 85 | 84 | 76 | Donald | Duck |
| 4 | 5.0 | 54.0 | 198.658lbs | - | - | - | 69 | NaN | 75 | Pink | Panther |
| 5 | 6.0 | 52.0 | 189lbs | - | - | - | 68 | 75 | 72 | Huey | McDuck |
| 9 | 9.0 | 52.0 | 189lbs | - | - | - | 68 | 75 | 72 | Huey | McDuck |
# Convert lbs to kgsfor i, lbs_row in df[rows_with_lbs].iterrows(): # Extract numeric value and convert weight_lbs = float(lbs_row['weight'][:-3]) weight_kgs = int(weight_lbs / 2.2) # Simplified conversion for practicality df.at[i, 'weight'] = f"{weight_kgs}kgs"| Row | Id | Age | Weight | m0006 | m0612 | m1218 | f0006 | f0612 | f1218 | Firstname | Lastname |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 56.0 | 70kgs | 72 | 69 | 71 | - | - | - | Mickëy | Mousë |
| 1 | 2.0 | 34.0 | 70kgs | - | - | - | 85 | 84 | 76 | Donald | Duck |
| 2 | 3.0 | 16.0 | NaN | - | - | - | 65 | 69 | 72 | Mini | Mouse |
| 3 | 4.0 | NaN | 78kgs | 78 | 79 | 72 | - | - | - | Scrooge | McDuck |
| 4 | 5.0 | 54.0 | 90kgs | - | - | - | 69 | NaN | 75 | Pink | Panther |
| 5 | 6.0 | 52.0 | 85kgs | - | - | - | 68 | 75 | 72 | Huey | McDuck |
| 6 | 7.0 | 19.0 | 56kgs | - | - | - | 71 | 78 | 75 | Dewey | McDuck |
| 7 | 8.0 | 32.0 | 78kgs | 78 | 76 | 75 | - | - | - | Scööpy | Doo |
| 8 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 9 | 9.0 | 52.0 | 85kgs | - | - | - | 68 | 75 | 72 | Huey | McDuck |
| 10 | 10.0 | 12.0 | 45kgs | - | - | - | 92 | 95 | 87 | Louie | McDuck |
3.4 Handle Missing Values
Missing values are common in datasets. Typical handling methods include:
- Deleting rows with critical missing values
- Replacing with default values (e.g., 0 for numbers, empty string for text)
- Using column mean or most frequent value
- Collaborating with data collection teams to fill gaps at the source
3.5 Remove Empty Rows
Empty rows (all values are NaN) are removed using dropna():
# Delete rows where all values are NaNdf.dropna(how='all', inplace=True)| Row | Id | Age | Weight | m0006 | m0612 | m1218 | f0006 | f0612 | f1218 | Firstname | Lastname |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 56.0 | 70kgs | 72 | 69 | 71 | - | - | - | Mickëy | Mousë |
| 1 | 2.0 | 34.0 | 70kgs | - | - | - | 85 | 84 | 76 | Donald | Duck |
| 2 | 3.0 | 16.0 | NaN | - | - | - | 65 | 69 | 72 | Mini | Mouse |
| 3 | 4.0 | NaN | 78kgs | 78 | 79 | 72 | - | - | - | Scrooge | McDuck |
| 4 | 5.0 | 54.0 | 90kgs | - | - | - | 69 | NaN | 75 | Pink | Panther |
| 5 | 6.0 | 52.0 | 85kgs | - | - | - | 68 | 75 | 72 | Huey | McDuck |
| 6 | 7.0 | 19.0 | 56kgs | - | - | - | 71 | 78 | 75 | Dewey | McDuck |
| 7 | 8.0 | 32.0 | 78kgs | 78 | 76 | 75 | - | - | - | Scööpy | Doo |
| 9 | 9.0 | 52.0 | 85kgs | - | - | - | 68 | 75 | 72 | Huey | McDuck |
| 10 | 10.0 | 12.0 | 45kgs | - | - | - | 92 | 95 | 87 | Louie | McDuck |
3.6 Deduplicate Records
We remove duplicate rows based on first and last names (assuming unique individuals):
| Row | Id | Age | Weight | m0006 | m0612 | m1218 | f0006 | f0612 | f1218 | Firstname | Lastname |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 56.0 | 70kgs | 72 | 69 | 71 | - | - | - | Mickëy | Mousë |
| 1 | 2.0 | 34.0 | 70kgs | - | - | - | 85 | 84 | 76 | Donald | Duck |
| 2 | 3.0 | 16.0 | NaN | - | - | - | 65 | 69 | 72 | Mini | Mouse |
| 3 | 4.0 | NaN | 78kgs | 78 | 79 | 72 | - | - | - | Scrooge | McDuck |
| 4 | 5.0 | 54.0 | 90kgs | - | - | - | 69 | NaN | 75 | Pink | Panther |
| 5 | 6.0 | 52.0 | 85kgs | - | - | - | 68 | 75 | 72 | Huey | McDuck |
| 6 | 7.0 | 19.0 | 56kgs | - | - | - | 71 | 78 | 75 | Dewey | McDuck |
| 7 | 8.0 | 32.0 | 78kgs | 78 | 76 | 75 | - | - | - | Scööpy | Doo |
| 9 | 9.0 | 52.0 | 85kgs | - | - | - | 68 | 75 | 72 | Huey | McDuck |
| 10 | 10.0 | 12.0 | 45kgs | - | - | - | 92 | 95 | 87 | Louie | McDuck |
# Delete duplicate recordsdf.drop_duplicates(['first_name', 'last_name'], inplace=True)3.7 Remove Non-ASCII Characters
Non-ASCII characters in names are removed using regex:
# Remove non-ASCII charactersdf['first_name'].replace({r'[^\x00-\x7F]+': ''}, regex=True, inplace=True)df['last_name'].replace({r'[^\x00-\x7F]+': ''}, regex=True, inplace=True)| Row | Id | Age | Weight | m0006 | m0612 | m1218 | f0006 | f0612 | f1218 | Firstname | Lastname |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 56.0 | 70kgs | 72 | 69 | 71 | - | - | - | Micky | Mous |
| 1 | 2.0 | 34.0 | 70kgs | - | - | - | 85 | 84 | 76 | Donald | Duck |
| 2 | 3.0 | 16.0 | NaN | - | - | - | 65 | 69 | 72 | Mini | Mouse |
| 3 | 4.0 | NaN | 78kgs | 78 | 79 | 72 | - | - | - | Scrooge | McDuck |
| 4 | 5.0 | 54.0 | 90kgs | - | - | - | 69 | NaN | 75 | Pink | Panther |
| 5 | 6.0 | 52.0 | 85kgs | - | - | - | 68 | 75 | 72 | Huey | McDuck |
| 6 | 7.0 | 19.0 | 56kgs | - | - | - | 71 | 78 | 75 | Dewey | McDuck |
| 7 | 8.0 | 32.0 | 78kgs | 78 | 76 | 75 | - | - | - | Scpy | Doo |
| 10 | 10.0 | 12.0 | 45kgs | - | - | - | 92 | 95 | 87 | Louie | McDuck |
3.8 Reshape Column Headers to Data
Columns like “m0006” (male, 00:00-06:00) and “f0612” (female, 06:00-12:00) should be split into gender and time range. We use melt() to reshape the data:
# Define identifier columns for reshapingsorted_columns = ['id', 'age', 'weight', 'first_name', 'last_name']
# Reshape data from wide to long formatdf = pd.melt( df, id_vars=sorted_columns, var_name='sex_hour', value_name='pulse_rate').sort_values(sorted_columns)
# Split "sex_hour" into "sex" (m/f) and "hour" (time range)df[['sex', 'hour']] = df['sex_hour'].apply( lambda x: pd.Series([x[:1], f"{x[1:3]}-{x[3:]}"]))[[0, 1]]
# Drop the original "sex_hour" columndf.drop('sex_hour', axis=1, inplace=True)
# Remove rows with "-" (invalid pulse rate)row_with_dashes = df['pulse_rate'].str.contains('-').fillna(False)df.drop(df[row_with_dashes].index, inplace=True)| id | age | weight | first_name | last_name | puls_rate | sex | hour |
|---|---|---|---|---|---|---|---|
| 1.0 | 56.0 | 70kgs | Micky | Mous | 72 | m | 00-06 |
| 1.0 | 56.0 | 70kgs | Micky | Mous | 69 | m | 06-12 |
| 1.0 | 56.0 | 70kgs | Micky | Mous | 71 | m | 12-18 |
| 2.0 | 34.0 | 70kgs | Donald | Duck | 85 | f | 00-06 |
| 2.0 | 34.0 | 70kgs | Donald | Duck | 84 | f | 06-12 |
| 2.0 | 34.0 | 70kgs | Donald | Duck | 76 | f | 12-18 |
| 3.0 | 16.0 | NaN | Mini | Mouse | 65 | f | 00-06 |
| 3.0 | 16.0 | NaN | Mini | Mouse | 69 | f | 06-12 |
| 3.0 | 16.0 | NaN | Mini | Mouse | 72 | f | 12-18 |
| 4.0 | NaN | 78kgs | Scrooge | McDuck | 78 | m | 00-06 |
| 4.0 | NaN | 78kgs | Scrooge | McDuck | 79 | m | 06-12 |
| 4.0 | NaN | 78kgs | Scrooge | McDuck | 72 | m | 12-18 |
| 5.0 | 54.0 | 90kgs | Pink | Panther | 69 | f | 00-06 |
| 5.0 | 54.0 | 90kgs | Pink | Panther | NaN | f | 06-12 |
| 5.0 | 54.0 | 90kgs | Pink | Panther | 75 | f | 12-18 |
| 6.0 | 52.0 | 85kgs | Huey | McDuck | 68 | f | 00-06 |
| 6.0 | 52.0 | 85kgs | Huey | McDuck | 75 | f | 06-12 |
| 6.0 | 52.0 | 85kgs | Huey | McDuck | 72 | f | 12-18 |
| 7.0 | 19.0 | 56kgs | Dewey | McDuck | 71 | f | 00-06 |
| 7.0 | 19.0 | 56kgs | Dewey | McDuck | 78 | f | 06-12 |
| 7.0 | 19.0 | 56kgs | Dewey | McDuck | 75 | f | 12-18 |
| 8.0 | 32.0 | 78kgs | Scpy | Doo | 78 | m | 00-06 |
| 8.0 | 32.0 | 78kgs | Scpy | Doo | 76 | m | 06-12 |
| 8.0 | 32.0 | 78kgs | Scpy | Doo | 75 | m | 12-18 |
| 10.0 | 12.0 | 45kgs | Louie | McDuck | 92 | f | 00-06 |
| 10.0 | 12.0 | 45kgs | Louie | McDuck | 95 | f | 06-12 |
| 10.0 | 12.0 | 45kgs | Louie | McDuck | 87 | f | 12-18 |
4. Integrated Code
The complete cleaning workflow is integrated below:
import pandas as pd
# Step 1: Read CSV with custom column headerscolumn_names = ['id', 'name', 'age', 'weight', 'm0006', 'm0612', 'm1218', 'f0006', 'f0612', 'f1218']df = pd.read_csv('../data/patient_heart_rate.csv', names=column_names)
# Step 2: Split name into first_name and last_namedf[['first_name', 'last_name']] = df['name'].str.split(expand=True)df.drop('name', axis=1, inplace=True)
# Step 3: Convert weight units from lbs to kgsrows_with_lbs = df['weight'].str.contains('lbs').fillna(False)for i, lbs_row in df[rows_with_lbs].iterrows(): weight_lbs = float(lbs_row['weight'][:-3]) weight_kgs = int(weight_lbs / 2.2) df.at[i, 'weight'] = f"{weight_kgs}kgs"
# Step 4: Remove empty rowsdf.dropna(how='all', inplace=True)
# Step 5: Remove duplicate recordsdf.drop_duplicates(['first_name', 'last_name'], inplace=True)
# Step 6: Remove non-ASCII charactersdf['first_name'].replace({r'[^\x00-\x7F]+': ''}, regex=True, inplace=True)df['last_name'].replace({r'[^\x00-\x7F]+': ''}, regex=True, inplace=True)
# Step 7: Reshape data and split sex_hour into sex and hoursorted_columns = ['id', 'age', 'weight', 'first_name', 'last_name']df = pd.melt( df, id_vars=sorted_columns, var_name='sex_hour', value_name='pulse_rate').sort_values(sorted_columns)
df[['sex', 'hour']] = df['sex_hour'].apply( lambda x: pd.Series([x[:1], f"{x[1:3]}-{x[3:]}"]))[[0, 1]]df.drop('sex_hour', axis=1, inplace=True)
# Step 8: Remove invalid pulse rate recordsrow_with_dashes = df['pulse_rate'].str.contains('-').fillna(False)df.drop(df[row_with_dashes].index, inplace=True)
# Step 9: Reset index for readabilitydf = df.reset_index(drop=True)
# Print cleaned dataprint(df)Additional Notes
Two common and important data cleaning issues not covered in this example:
- Date format handling (standardizing inconsistent date representations)
- Character encoding problems (resolving garbled text from different encodings)
The techniques introduced here cover most basic data cleaning scenarios. For more advanced content, follow the Zhihu column “Data Cleaning”.