771 words

4 minutes

Clean Data with Pandas

2026-03-28

Python

Pandas

/

Data Cleaning

Data Cleaning - Pandas Cleans “Dirty” Data#

Overview#

Understand the Data
Analyze Data Issues
Clean the Data
Integrate the Code

1. Understand the Data#

Before processing any data, our primary task is to understand the data and its purpose. We need to examine the columns/rows, records, data formats, semantic errors, missing entries, and incorrect formats. This helps us identify the “cleaning” work required before data analysis.

For this tutorial, we use a small CSV file patient_heart_rate.csv (Link: https://pan.baidu.com/s/1geX8oYf Password: odj0) that is easy to overview. The data describes the heart rate of different individuals at various times, with columns including age, weight, gender, and heart rates at different time intervals.

1
import pandas as pd
2
df = pd.read_csv('../data/patient_heart_rate.csv')
3
df.head()

Initial Data Preview#

	0	1	2	3	4	5	6	7	8	9
0	1.0	Mickey Mouse	56.0	70kgs	72	69	71	NaN	NaN	NaN
1	2.0	Donald Duck	34.0	154.89lbs	NaN	NaN	NaN	85	84	76
2	3.0	Mini Mouse	16.0	NaN	NaN	NaN	NaN	65	69	72
3	4.0	Scrooge McDuck	NaN	78kgs	78	79	72	”-“	NaN	NaN
4	5.0	Pink Panther	54.0	198.658lbs	NaN	”..”	NaN	69	NaN	75
5	6.0	Huey McDuck	”:52.0”	189lbs	”.-“	NaN	”-“	68	75	72
6	7.0	Dewey McDuck	19.0	56kgs	NaN	NaN	NaN	71	78	75
7	8.0	Scoopy Doo	32.0	78kgs	78	76	75	NaN	NaN	NaN
8	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9	9.0	Huey McDuck	52.0	189lbs	NaN	NaN	NaN	68	75	72
10	10.0	Louie McDuck	12.0	45kgs	NaN	NaN	NaN	92	95	87

2. Analyze Data Issues#

After examining the data, we identify the following key issues:

No column headers
A single column contains multiple parameters (name combines first and last names)
Inconsistent units in the weight column (mixed “kgs” and “lbs”)
Missing values (NaN)
Empty rows (all values are NaN except the index)
Duplicate records
Non-ASCII characters in names
Some column headers should be data (gender-time combinations like “m0006” should be split into gender and time range)

3. Clean the Data#

We address each issue step by step:

3.1 Add Custom Column Headers#

Pandas allows custom column headers when reading CSV files. We define meaningful headers to replace the default numeric columns:

1
# Define custom column names
2
column_names = ['id', 'name', 'age', 'weight', 'm0006', 'm0612', 'm1218', 'f0006', 'f0612', 'f1218']
3
df = pd.read_csv('../data/patient_heart_rate.csv', names=column_names)

	Id	Name	Age	Weight	m0006	m0612	m1218	f0006	f0612	f1218
0	1.0	Mickéy Mousé	56.0	70kgs	72	69	71	-	-	-
1	2.0	Donald Duck	34.0	154.89lbs	-	-	-	85	84	76
2	3.0	Mini Mouse	16.0	NaN	-	-	-	65	69	72
3	4.0	Scrooge McDuck	NaN	78kgs	78	79	72	-	-	-
4	5.0	Pink Panther	54.0	198.658lbs	-	-	-	69	NaN	75

3.2 Split Combined Column (Name)#

The name column contains both first and last names. We split it into two separate columns and drop the original:

1
# Split name into first_name and last_name
2
df[['first_name', 'last_name']] = df['name'].str.split(expand=True)
3
df.drop('name', axis=1, inplace=True)

Row	Id	Age	Weight	m0006	m0612	m1218	f0006	f0612	f1218	Firstname	Lastname
0	1.0	56.0	70kgs	72	69	71	-	-	-	Mickëy	Mousë
1	2.0	34.0	154.89lbs	-	-	-	85	84	76	Donald	Duck
2	3.0	16.0	NaN	-	-	-	65	69	72	Mini	Mouse
3	4.0	NaN	78kgs	78	79	72	-	-	-	Scrooge	McDuck
4	5.0	54.0	198.658lbs	-	-	-	69	NaN	75	Pink	Panther
5	6.0	52.0	189lbs	-	-	-	68	75	72	Huey	McDuck
6	7.0	19.0	56kgs	-	-	-	71	78	75	Dewey	McDuck
7	8.0	32.0	78kgs	78	76	75	-	-	-	Scööpy	Doo
8	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9	9.0	52.0	189lbs	-	-	-	68	75	72	Huey	McDuck
10	10.0	12.0	45kgs	-	-	-	92	95	77	Louie	McDuck

3.3 Standardize Weight Units#

The weight column uses mixed units (“kgs” and “lbs”). We convert all values to “kgs” (1 lbs ≈ 0.4536 kgs):

1
# Identify rows with "lbs" units
2
rows_with_lbs = df['weight'].str.contains('lbs').fillna(False)
3
df[rows_with_lbs]

Row	Id	Age	Weight	m0006	m0612	m1218	f0006	f0612	f1218	Firstname	Lastname
1	2.0	34.0	154.89lbs	-	-	-	85	84	76	Donald	Duck
4	5.0	54.0	198.658lbs	-	-	-	69	NaN	75	Pink	Panther
5	6.0	52.0	189lbs	-	-	-	68	75	72	Huey	McDuck
9	9.0	52.0	189lbs	-	-	-	68	75	72	Huey	McDuck

1
# Convert lbs to kgs
2
for i, lbs_row in df[rows_with_lbs].iterrows():
3
    # Extract numeric value and convert
4
    weight_lbs = float(lbs_row['weight'][:-3])
5
    weight_kgs = int(weight_lbs / 2.2)  # Simplified conversion for practicality
6
    df.at[i, 'weight'] = f"{weight_kgs}kgs"

Row	Id	Age	Weight	m0006	m0612	m1218	f0006	f0612	f1218	Firstname	Lastname
0	1.0	56.0	70kgs	72	69	71	-	-	-	Mickëy	Mousë
1	2.0	34.0	70kgs	-	-	-	85	84	76	Donald	Duck
2	3.0	16.0	NaN	-	-	-	65	69	72	Mini	Mouse
3	4.0	NaN	78kgs	78	79	72	-	-	-	Scrooge	McDuck
4	5.0	54.0	90kgs	-	-	-	69	NaN	75	Pink	Panther
5	6.0	52.0	85kgs	-	-	-	68	75	72	Huey	McDuck
6	7.0	19.0	56kgs	-	-	-	71	78	75	Dewey	McDuck
7	8.0	32.0	78kgs	78	76	75	-	-	-	Scööpy	Doo
8	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9	9.0	52.0	85kgs	-	-	-	68	75	72	Huey	McDuck
10	10.0	12.0	45kgs	-	-	-	92	95	87	Louie	McDuck

3.4 Handle Missing Values#

Missing values are common in datasets. Typical handling methods include:

Deleting rows with critical missing values
Replacing with default values (e.g., 0 for numbers, empty string for text)
Using column mean or most frequent value
Collaborating with data collection teams to fill gaps at the source

3.5 Remove Empty Rows#

Empty rows (all values are NaN) are removed using dropna():

1
# Delete rows where all values are NaN
2
df.dropna(how='all', inplace=True)

Row	Id	Age	Weight	m0006	m0612	m1218	f0006	f0612	f1218	Firstname	Lastname
0	1.0	56.0	70kgs	72	69	71	-	-	-	Mickëy	Mousë
1	2.0	34.0	70kgs	-	-	-	85	84	76	Donald	Duck
2	3.0	16.0	NaN	-	-	-	65	69	72	Mini	Mouse
3	4.0	NaN	78kgs	78	79	72	-	-	-	Scrooge	McDuck
4	5.0	54.0	90kgs	-	-	-	69	NaN	75	Pink	Panther
5	6.0	52.0	85kgs	-	-	-	68	75	72	Huey	McDuck
6	7.0	19.0	56kgs	-	-	-	71	78	75	Dewey	McDuck
7	8.0	32.0	78kgs	78	76	75	-	-	-	Scööpy	Doo
9	9.0	52.0	85kgs	-	-	-	68	75	72	Huey	McDuck
10	10.0	12.0	45kgs	-	-	-	92	95	87	Louie	McDuck

3.6 Deduplicate Records#

We remove duplicate rows based on first and last names (assuming unique individuals):

Row	Id	Age	Weight	m0006	m0612	m1218	f0006	f0612	f1218	Firstname	Lastname
0	1.0	56.0	70kgs	72	69	71	-	-	-	Mickëy	Mousë
1	2.0	34.0	70kgs	-	-	-	85	84	76	Donald	Duck
2	3.0	16.0	NaN	-	-	-	65	69	72	Mini	Mouse
3	4.0	NaN	78kgs	78	79	72	-	-	-	Scrooge	McDuck
4	5.0	54.0	90kgs	-	-	-	69	NaN	75	Pink	Panther
5	6.0	52.0	85kgs	-	-	-	68	75	72	Huey	McDuck
6	7.0	19.0	56kgs	-	-	-	71	78	75	Dewey	McDuck
7	8.0	32.0	78kgs	78	76	75	-	-	-	Scööpy	Doo
9	9.0	52.0	85kgs	-	-	-	68	75	72	Huey	McDuck
10	10.0	12.0	45kgs	-	-	-	92	95	87	Louie	McDuck

1
# Delete duplicate records
2
df.drop_duplicates(['first_name', 'last_name'], inplace=True)

3.7 Remove Non-ASCII Characters#

Non-ASCII characters in names are removed using regex:

1
# Remove non-ASCII characters
2
df['first_name'].replace({r'[^\x00-\x7F]+': ''}, regex=True, inplace=True)
3
df['last_name'].replace({r'[^\x00-\x7F]+': ''}, regex=True, inplace=True)

Row	Id	Age	Weight	m0006	m0612	m1218	f0006	f0612	f1218	Firstname	Lastname
0	1.0	56.0	70kgs	72	69	71	-	-	-	Micky	Mous
1	2.0	34.0	70kgs	-	-	-	85	84	76	Donald	Duck
2	3.0	16.0	NaN	-	-	-	65	69	72	Mini	Mouse
3	4.0	NaN	78kgs	78	79	72	-	-	-	Scrooge	McDuck
4	5.0	54.0	90kgs	-	-	-	69	NaN	75	Pink	Panther
5	6.0	52.0	85kgs	-	-	-	68	75	72	Huey	McDuck
6	7.0	19.0	56kgs	-	-	-	71	78	75	Dewey	McDuck
7	8.0	32.0	78kgs	78	76	75	-	-	-	Scpy	Doo
10	10.0	12.0	45kgs	-	-	-	92	95	87	Louie	McDuck

3.8 Reshape Column Headers to Data#

Columns like “m0006” (male, 00:00-06:00) and “f0612” (female, 06:00-12:00) should be split into gender and time range. We use melt() to reshape the data:

1
# Define identifier columns for reshaping
2
sorted_columns = ['id', 'age', 'weight', 'first_name', 'last_name']
3

4
# Reshape data from wide to long format
5
df = pd.melt(
6
    df,
7
    id_vars=sorted_columns,
8
    var_name='sex_hour',
9
    value_name='pulse_rate'
10
).sort_values(sorted_columns)
11

12
# Split "sex_hour" into "sex" (m/f) and "hour" (time range)
13
df[['sex', 'hour']] = df['sex_hour'].apply(
14
    lambda x: pd.Series([x[:1], f"{x[1:3]}-{x[3:]}"])
15
)[[0, 1]]
16

17
# Drop the original "sex_hour" column
18
df.drop('sex_hour', axis=1, inplace=True)
19

20
# Remove rows with "-" (invalid pulse rate)
21
row_with_dashes = df['pulse_rate'].str.contains('-').fillna(False)
22
df.drop(df[row_with_dashes].index, inplace=True)

id	age	weight	first_name	last_name	puls_rate	sex	hour
1.0	56.0	70kgs	Micky	Mous	72	m	00-06
1.0	56.0	70kgs	Micky	Mous	69	m	06-12
1.0	56.0	70kgs	Micky	Mous	71	m	12-18
2.0	34.0	70kgs	Donald	Duck	85	f	00-06
2.0	34.0	70kgs	Donald	Duck	84	f	06-12
2.0	34.0	70kgs	Donald	Duck	76	f	12-18
3.0	16.0	NaN	Mini	Mouse	65	f	00-06
3.0	16.0	NaN	Mini	Mouse	69	f	06-12
3.0	16.0	NaN	Mini	Mouse	72	f	12-18
4.0	NaN	78kgs	Scrooge	McDuck	78	m	00-06
4.0	NaN	78kgs	Scrooge	McDuck	79	m	06-12
4.0	NaN	78kgs	Scrooge	McDuck	72	m	12-18
5.0	54.0	90kgs	Pink	Panther	69	f	00-06
5.0	54.0	90kgs	Pink	Panther	NaN	f	06-12
5.0	54.0	90kgs	Pink	Panther	75	f	12-18
6.0	52.0	85kgs	Huey	McDuck	68	f	00-06
6.0	52.0	85kgs	Huey	McDuck	75	f	06-12
6.0	52.0	85kgs	Huey	McDuck	72	f	12-18
7.0	19.0	56kgs	Dewey	McDuck	71	f	00-06
7.0	19.0	56kgs	Dewey	McDuck	78	f	06-12
7.0	19.0	56kgs	Dewey	McDuck	75	f	12-18
8.0	32.0	78kgs	Scpy	Doo	78	m	00-06
8.0	32.0	78kgs	Scpy	Doo	76	m	06-12
8.0	32.0	78kgs	Scpy	Doo	75	m	12-18
10.0	12.0	45kgs	Louie	McDuck	92	f	00-06
10.0	12.0	45kgs	Louie	McDuck	95	f	06-12
10.0	12.0	45kgs	Louie	McDuck	87	f	12-18

4. Integrated Code#

The complete cleaning workflow is integrated below:

1
import pandas as pd
2

3
# Step 1: Read CSV with custom column headers
4
column_names = ['id', 'name', 'age', 'weight', 'm0006', 'm0612', 'm1218', 'f0006', 'f0612', 'f1218']
5
df = pd.read_csv('../data/patient_heart_rate.csv', names=column_names)
6

7
# Step 2: Split name into first_name and last_name
8
df[['first_name', 'last_name']] = df['name'].str.split(expand=True)
9
df.drop('name', axis=1, inplace=True)
10

11
# Step 3: Convert weight units from lbs to kgs
12
rows_with_lbs = df['weight'].str.contains('lbs').fillna(False)
13
for i, lbs_row in df[rows_with_lbs].iterrows():
14
    weight_lbs = float(lbs_row['weight'][:-3])
15
    weight_kgs = int(weight_lbs / 2.2)
16
    df.at[i, 'weight'] = f"{weight_kgs}kgs"
17

18
# Step 4: Remove empty rows
19
df.dropna(how='all', inplace=True)
20

21
# Step 5: Remove duplicate records
22
df.drop_duplicates(['first_name', 'last_name'], inplace=True)
23

24
# Step 6: Remove non-ASCII characters
25
df['first_name'].replace({r'[^\x00-\x7F]+': ''}, regex=True, inplace=True)
26
df['last_name'].replace({r'[^\x00-\x7F]+': ''}, regex=True, inplace=True)
27

28
# Step 7: Reshape data and split sex_hour into sex and hour
29
sorted_columns = ['id', 'age', 'weight', 'first_name', 'last_name']
30
df = pd.melt(
31
    df,
32
    id_vars=sorted_columns,
33
    var_name='sex_hour',
34
    value_name='pulse_rate'
35
).sort_values(sorted_columns)
36

37
df[['sex', 'hour']] = df['sex_hour'].apply(
38
    lambda x: pd.Series([x[:1], f"{x[1:3]}-{x[3:]}"])
39
)[[0, 1]]
40
df.drop('sex_hour', axis=1, inplace=True)
41

42
# Step 8: Remove invalid pulse rate records
43
row_with_dashes = df['pulse_rate'].str.contains('-').fillna(False)
44
df.drop(df[row_with_dashes].index, inplace=True)
45

46
# Step 9: Reset index for readability
47
df = df.reset_index(drop=True)
48

49
# Print cleaned data
50
print(df)