771 words
4 minutes
Clean Data with Pandas

Data Cleaning - Pandas Cleans “Dirty” Data#

Overview#

  1. Understand the Data
  2. Analyze Data Issues
  3. Clean the Data
  4. Integrate the Code

1. Understand the Data#

Before processing any data, our primary task is to understand the data and its purpose. We need to examine the columns/rows, records, data formats, semantic errors, missing entries, and incorrect formats. This helps us identify the “cleaning” work required before data analysis.

For this tutorial, we use a small CSV file patient_heart_rate.csv (Link: https://pan.baidu.com/s/1geX8oYf Password: odj0) that is easy to overview. The data describes the heart rate of different individuals at various times, with columns including age, weight, gender, and heart rates at different time intervals.

import pandas as pd
df = pd.read_csv('../data/patient_heart_rate.csv')
df.head()

Initial Data Preview#

0123456789
01.0Mickey Mouse56.070kgs726971NaNNaNNaN
12.0Donald Duck34.0154.89lbsNaNNaNNaN858476
23.0Mini Mouse16.0NaNNaNNaNNaN656972
34.0Scrooge McDuckNaN78kgs787972”-“NaNNaN
45.0Pink Panther54.0198.658lbsNaN”..”NaN69NaN75
56.0Huey McDuck”:52.0”189lbs”.-“NaN”-“687572
67.0Dewey McDuck19.056kgsNaNNaNNaN717875
78.0Scoopy Doo32.078kgs787675NaNNaNNaN
8NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
99.0Huey McDuck52.0189lbsNaNNaNNaN687572
1010.0Louie McDuck12.045kgsNaNNaNNaN929587

2. Analyze Data Issues#

After examining the data, we identify the following key issues:

  • No column headers
  • A single column contains multiple parameters (name combines first and last names)
  • Inconsistent units in the weight column (mixed “kgs” and “lbs”)
  • Missing values (NaN)
  • Empty rows (all values are NaN except the index)
  • Duplicate records
  • Non-ASCII characters in names
  • Some column headers should be data (gender-time combinations like “m0006” should be split into gender and time range)

3. Clean the Data#

We address each issue step by step:

3.1 Add Custom Column Headers#

Pandas allows custom column headers when reading CSV files. We define meaningful headers to replace the default numeric columns:

# Define custom column names
column_names = ['id', 'name', 'age', 'weight', 'm0006', 'm0612', 'm1218', 'f0006', 'f0612', 'f1218']
df = pd.read_csv('../data/patient_heart_rate.csv', names=column_names)
IdNameAgeWeightm0006m0612m1218f0006f0612f1218
01.0Mickéy Mousé56.070kgs726971---
12.0Donald Duck34.0154.89lbs---858476
23.0Mini Mouse16.0NaN---656972
34.0Scrooge McDuckNaN78kgs787972---
45.0Pink Panther54.0198.658lbs---69NaN75

3.2 Split Combined Column (Name)#

The name column contains both first and last names. We split it into two separate columns and drop the original:

# Split name into first_name and last_name
df[['first_name', 'last_name']] = df['name'].str.split(expand=True)
df.drop('name', axis=1, inplace=True)
RowIdAgeWeightm0006m0612m1218f0006f0612f1218FirstnameLastname
01.056.070kgs726971---MickëyMousë
12.034.0154.89lbs---858476DonaldDuck
23.016.0NaN---656972MiniMouse
34.0NaN78kgs787972---ScroogeMcDuck
45.054.0198.658lbs---69NaN75PinkPanther
56.052.0189lbs---687572HueyMcDuck
67.019.056kgs---717875DeweyMcDuck
78.032.078kgs787675---ScööpyDoo
8NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
99.052.0189lbs---687572HueyMcDuck
1010.012.045kgs---929577LouieMcDuck

3.3 Standardize Weight Units#

The weight column uses mixed units (“kgs” and “lbs”). We convert all values to “kgs” (1 lbs ≈ 0.4536 kgs):

# Identify rows with "lbs" units
rows_with_lbs = df['weight'].str.contains('lbs').fillna(False)
df[rows_with_lbs]
RowIdAgeWeightm0006m0612m1218f0006f0612f1218FirstnameLastname
12.034.0154.89lbs---858476DonaldDuck
45.054.0198.658lbs---69NaN75PinkPanther
56.052.0189lbs---687572HueyMcDuck
99.052.0189lbs---687572HueyMcDuck
# Convert lbs to kgs
for i, lbs_row in df[rows_with_lbs].iterrows():
# Extract numeric value and convert
weight_lbs = float(lbs_row['weight'][:-3])
weight_kgs = int(weight_lbs / 2.2) # Simplified conversion for practicality
df.at[i, 'weight'] = f"{weight_kgs}kgs"
RowIdAgeWeightm0006m0612m1218f0006f0612f1218FirstnameLastname
01.056.070kgs726971---MickëyMousë
12.034.070kgs---858476DonaldDuck
23.016.0NaN---656972MiniMouse
34.0NaN78kgs787972---ScroogeMcDuck
45.054.090kgs---69NaN75PinkPanther
56.052.085kgs---687572HueyMcDuck
67.019.056kgs---717875DeweyMcDuck
78.032.078kgs787675---ScööpyDoo
8NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
99.052.085kgs---687572HueyMcDuck
1010.012.045kgs---929587LouieMcDuck

3.4 Handle Missing Values#

Missing values are common in datasets. Typical handling methods include:

  • Deleting rows with critical missing values
  • Replacing with default values (e.g., 0 for numbers, empty string for text)
  • Using column mean or most frequent value
  • Collaborating with data collection teams to fill gaps at the source

3.5 Remove Empty Rows#

Empty rows (all values are NaN) are removed using dropna():

# Delete rows where all values are NaN
df.dropna(how='all', inplace=True)
RowIdAgeWeightm0006m0612m1218f0006f0612f1218FirstnameLastname
01.056.070kgs726971---MickëyMousë
12.034.070kgs---858476DonaldDuck
23.016.0NaN---656972MiniMouse
34.0NaN78kgs787972---ScroogeMcDuck
45.054.090kgs---69NaN75PinkPanther
56.052.085kgs---687572HueyMcDuck
67.019.056kgs---717875DeweyMcDuck
78.032.078kgs787675---ScööpyDoo
99.052.085kgs---687572HueyMcDuck
1010.012.045kgs---929587LouieMcDuck

3.6 Deduplicate Records#

We remove duplicate rows based on first and last names (assuming unique individuals):

RowIdAgeWeightm0006m0612m1218f0006f0612f1218FirstnameLastname
01.056.070kgs726971---MickëyMousë
12.034.070kgs---858476DonaldDuck
23.016.0NaN---656972MiniMouse
34.0NaN78kgs787972---ScroogeMcDuck
45.054.090kgs---69NaN75PinkPanther
56.052.085kgs---687572HueyMcDuck
67.019.056kgs---717875DeweyMcDuck
78.032.078kgs787675---ScööpyDoo
99.052.085kgs---687572HueyMcDuck
1010.012.045kgs---929587LouieMcDuck
# Delete duplicate records
df.drop_duplicates(['first_name', 'last_name'], inplace=True)

3.7 Remove Non-ASCII Characters#

Non-ASCII characters in names are removed using regex:

# Remove non-ASCII characters
df['first_name'].replace({r'[^\x00-\x7F]+': ''}, regex=True, inplace=True)
df['last_name'].replace({r'[^\x00-\x7F]+': ''}, regex=True, inplace=True)
RowIdAgeWeightm0006m0612m1218f0006f0612f1218FirstnameLastname
01.056.070kgs726971---MickyMous
12.034.070kgs---858476DonaldDuck
23.016.0NaN---656972MiniMouse
34.0NaN78kgs787972---ScroogeMcDuck
45.054.090kgs---69NaN75PinkPanther
56.052.085kgs---687572HueyMcDuck
67.019.056kgs---717875DeweyMcDuck
78.032.078kgs787675---ScpyDoo
1010.012.045kgs---929587LouieMcDuck

3.8 Reshape Column Headers to Data#

Columns like “m0006” (male, 00:00-06:00) and “f0612” (female, 06:00-12:00) should be split into gender and time range. We use melt() to reshape the data:

# Define identifier columns for reshaping
sorted_columns = ['id', 'age', 'weight', 'first_name', 'last_name']
# Reshape data from wide to long format
df = pd.melt(
df,
id_vars=sorted_columns,
var_name='sex_hour',
value_name='pulse_rate'
).sort_values(sorted_columns)
# Split "sex_hour" into "sex" (m/f) and "hour" (time range)
df[['sex', 'hour']] = df['sex_hour'].apply(
lambda x: pd.Series([x[:1], f"{x[1:3]}-{x[3:]}"])
)[[0, 1]]
# Drop the original "sex_hour" column
df.drop('sex_hour', axis=1, inplace=True)
# Remove rows with "-" (invalid pulse rate)
row_with_dashes = df['pulse_rate'].str.contains('-').fillna(False)
df.drop(df[row_with_dashes].index, inplace=True)
idageweightfirst_namelast_namepuls_ratesexhour
1.056.070kgsMickyMous72m00-06
1.056.070kgsMickyMous69m06-12
1.056.070kgsMickyMous71m12-18
2.034.070kgsDonaldDuck85f00-06
2.034.070kgsDonaldDuck84f06-12
2.034.070kgsDonaldDuck76f12-18
3.016.0NaNMiniMouse65f00-06
3.016.0NaNMiniMouse69f06-12
3.016.0NaNMiniMouse72f12-18
4.0NaN78kgsScroogeMcDuck78m00-06
4.0NaN78kgsScroogeMcDuck79m06-12
4.0NaN78kgsScroogeMcDuck72m12-18
5.054.090kgsPinkPanther69f00-06
5.054.090kgsPinkPantherNaNf06-12
5.054.090kgsPinkPanther75f12-18
6.052.085kgsHueyMcDuck68f00-06
6.052.085kgsHueyMcDuck75f06-12
6.052.085kgsHueyMcDuck72f12-18
7.019.056kgsDeweyMcDuck71f00-06
7.019.056kgsDeweyMcDuck78f06-12
7.019.056kgsDeweyMcDuck75f12-18
8.032.078kgsScpyDoo78m00-06
8.032.078kgsScpyDoo76m06-12
8.032.078kgsScpyDoo75m12-18
10.012.045kgsLouieMcDuck92f00-06
10.012.045kgsLouieMcDuck95f06-12
10.012.045kgsLouieMcDuck87f12-18

4. Integrated Code#

The complete cleaning workflow is integrated below:

import pandas as pd
# Step 1: Read CSV with custom column headers
column_names = ['id', 'name', 'age', 'weight', 'm0006', 'm0612', 'm1218', 'f0006', 'f0612', 'f1218']
df = pd.read_csv('../data/patient_heart_rate.csv', names=column_names)
# Step 2: Split name into first_name and last_name
df[['first_name', 'last_name']] = df['name'].str.split(expand=True)
df.drop('name', axis=1, inplace=True)
# Step 3: Convert weight units from lbs to kgs
rows_with_lbs = df['weight'].str.contains('lbs').fillna(False)
for i, lbs_row in df[rows_with_lbs].iterrows():
weight_lbs = float(lbs_row['weight'][:-3])
weight_kgs = int(weight_lbs / 2.2)
df.at[i, 'weight'] = f"{weight_kgs}kgs"
# Step 4: Remove empty rows
df.dropna(how='all', inplace=True)
# Step 5: Remove duplicate records
df.drop_duplicates(['first_name', 'last_name'], inplace=True)
# Step 6: Remove non-ASCII characters
df['first_name'].replace({r'[^\x00-\x7F]+': ''}, regex=True, inplace=True)
df['last_name'].replace({r'[^\x00-\x7F]+': ''}, regex=True, inplace=True)
# Step 7: Reshape data and split sex_hour into sex and hour
sorted_columns = ['id', 'age', 'weight', 'first_name', 'last_name']
df = pd.melt(
df,
id_vars=sorted_columns,
var_name='sex_hour',
value_name='pulse_rate'
).sort_values(sorted_columns)
df[['sex', 'hour']] = df['sex_hour'].apply(
lambda x: pd.Series([x[:1], f"{x[1:3]}-{x[3:]}"])
)[[0, 1]]
df.drop('sex_hour', axis=1, inplace=True)
# Step 8: Remove invalid pulse rate records
row_with_dashes = df['pulse_rate'].str.contains('-').fillna(False)
df.drop(df[row_with_dashes].index, inplace=True)
# Step 9: Reset index for readability
df = df.reset_index(drop=True)
# Print cleaned data
print(df)

Additional Notes#

Two common and important data cleaning issues not covered in this example:

  1. Date format handling (standardizing inconsistent date representations)
  2. Character encoding problems (resolving garbled text from different encodings)

The techniques introduced here cover most basic data cleaning scenarios. For more advanced content, follow the Zhihu column “Data Cleaning”.

Clean Data with Pandas
https://fuwari.vercel.app/posts/learning/panda-clean-data/
Author
Zero02
Published at
2026-03-28
License
CC BY-NC-SA 4.0