How I Built a Data Cleaning Pipeline Using One Messy DoorDash Dataset

0
7



Image by Editor

 

Introduction

 
According to CrowdFlower’s survey, data scientists spend 60% of their time organizing and cleaning the data.

In this article, we’ll walk through building a data cleaning pipeline using a real-life dataset from DoorDash. It contains nearly 200,000 food delivery records, each of which includes dozens of features such as delivery time, total items, and store category (e.g., Mexican, Thai, or American cuisine).

 

Predicting Food Delivery Times with DoorDash Data

 
Predicting Food Delivery Times with DoorDash Data
 
DoorDash aims to estimate the time it takes to deliver food accurately, from the moment a customer places an order to the time it arrives at their door. In this data project, we are tasked with developing a model that predicts the total delivery duration based on historical delivery data.

However, we won’t do the whole project—i.e., we won’t build a predictive model. Instead, we’ll use the dataset provided in the project and create a data cleaning pipeline.

Our workflow consists of two major steps.

 
Data Cleaning Pipeline
 

 

Data Exploration

 
Data Cleaning Pipeline
 

Let’s start by loading and viewing the first few rows of the dataset.

 

// Load and Preview the Dataset

import pandas as pd
df = pd.read_csv("historical_data.csv")
df.head()

 

Here is the output.

 
Data Cleaning Pipeline
 

This dataset includes datetime columns that capture the order creation time and actual delivery time, which can be used to calculate delivery duration. It also contains other features such as store category, total item count, subtotal, and minimum item price, making it suitable for various types of data analysis. We can already see that there are some NaN values, which we’ll explore more closely in the following step.

 

// Explore The Columns With info()

Let’s inspect all column names with the info() method. We will use this method throughout the article to see the changes in column value counts; it’s a good indicator of missing data and overall data health.

 

Here is the output.

 
Data Cleaning Pipeline
 

As you can see, we have 15 columns, but the number of non-null values differs across them. This means some columns contain missing values, which could affect our analysis if not handled properly. One last thing: the created_at and actual_delivery_time data types are objects; these should be datetime.

 

Building Data Cleaning Pipeline

 
In this step, we build a structured data cleaning pipeline to prepare the dataset for modeling. Each stage addresses common issues such as timestamp formatting, missing values, and irrelevant features.
 
Building Data Cleaning Pipeline
 

// Fixing the Date and Time Columns Data Types

Before doing data analysis, we need to fix the columns that show the time. Otherwise, the calculation that we mentioned (actual_delivery_time – created_at) will go wrong.

What we’re fixing:

  • created_at: when the order was placed
  • actual_delivery_time: when the food arrived

These two columns are stored as objects, so to be able to do calculations correctly, we have to convert them to the datetime format. To do that, we can use datetime functions in pandas. Here is the code.

import pandas as pd
df = pd.read_csv("historical_data.csv")
# Convert timestamp strings to datetime objects
df["created_at"] = pd.to_datetime(df["created_at"], errors="coerce")
df["actual_delivery_time"] = pd.to_datetime(df["actual_delivery_time"], errors="coerce")
df.info()

 

Here is the output.

 
Building Data Cleaning Pipeline
 

As you can see from the screenshot above, the created_at and actual_delivery_time are datetime objects now.

 
Building Data Cleaning Pipeline
 

Among the key columns, store_primary_category has the fewest non-null values (192,668), which means it has the most missing data. That’s why we’ll focus on cleaning it first.

 

// Data Imputation With mode()

One of the messiest columns in the dataset, evident from its high number of missing values, is store_primary_category. It tells us what kind of food stores are available, like Mexican, American, and Thai. However, many rows are missing this information, which is a problem. For instance, it can limit how we can group or analyze the data. So how do we fix it?

We will fill these rows instead of dropping them. To do that, we will use smarter imputation.

We write a dictionary that maps each store_id to its most frequent category, and then use that mapping to fill in missing values. Let’s see the dataset before doing that.

 
Data Imputation With mode
 

Here is the code.

import numpy as np

# Global most-frequent category as a fallback
global_mode = df["store_primary_category"].mode().iloc[0]

# Build store-level mapping to the most frequent category (fast and robust)
store_mode = (
    df.groupby("store_id")["store_primary_category"]
      .agg(lambda s: s.mode().iloc[0] if not s.mode().empty else np.nan)
)

# Fill missing categories using the store-level mode, then fall back to global mode
df["store_primary_category"] = (
    df["store_primary_category"]
      .fillna(df["store_id"].map(store_mode))
      .fillna(global_mode)
)

df.info()

 

Here is the output.

 
Data Imputation With mode
 

As you can see from the screenshot above, the store_primary_category column now has a higher non-null count. But let’s double-check with this code.

df["store_primary_category"].isna().sum()

 

Here is the output showing the number of NaN values. It’s zero; we got rid of all of them.

 
Data Imputation With mode
 

And let’s see the dataset after the imputation.

 
Data Imputation With mode

 

// Dropping Remaining NaNs

In the previous step, we corrected the store_primary_category, but did you notice something? The non-null counts across the columns still don’t match!

This is a clear sign that we’re still dealing with missing values in some part of the dataset. Now, when it comes to data cleaning, we have two options:

  • Fill these missing values
  • Drop them

Given that this dataset contains nearly 200,000 rows, we can afford to lose some. With smaller datasets, you’d need to be more cautious. In that case, it is advisable to analyze each column, establish standards (decide how missing values will be filled—using the mean, median, most frequent value, or domain-specific defaults), and then fill them.

To remove the NaNs, we will use the dropna() method from the pandas library. We are setting inplace=True to apply the changes directly to the DataFrame without needing to assign it again. Let’s see the dataset at this point.

 
Dropping NaNs
 

Here is the code.

df.dropna(inplace=True)
df.info()

 

Here is the output.

 
Dropping NaNs
 

As you can see from the screenshot above, each column now has the same number of non-null values.

Let’s see the dataset after all the changes.

 
Dropping NaNs
 

 

// What Can You Do Next?

Now that we have a clean dataset, here are a few things you can do next:

  • Perform EDA to understand delivery patterns.
  • Engineer new features like delivery hours or busy dashers ratio to add more meaning to your analysis.
  • Analyze correlations between variables to increase your model’s performance.
  • Build different regression models and find the best-performing model.
  • Predict the delivery duration with the best-performing model.

 

Final Thoughts

 
In this article, we have cleaned the real-life dataset from DoorDash by addressing common data quality issues, such as fixing incorrect data types and handling missing values. We built a simple data cleaning pipeline tailored to this data project and explored potential next steps.

Real-world datasets can be messier than you think, but there are also many methods and tricks to solve these issues. Thanks for reading!
 
 

Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.