machine learning

How To Use Synthetic Data To Build a Portfolio Project

September 22, 2025

Image by Author | Canva

# Introduction

Finding real-world datasets can be challenging because they are often private (protected), incomplete (missing features), or expensive (behind a paywall). Synthetic datasets can solve these problems by letting you generate the data based on your project needs.

Synthetic data is artificially generated information that mimics real-life datasets. You can control the size, complexity, and realism of the synthetic dataset to tailor it based on your data needs.

In this article, we’ll explore synthetic data generation methods. We will then build a portfolio project by examining the data, creating a machine learning model, and using AI to develop a complete portfolio project with a Streamlit app.

# How to Generate Synthetic Data

Synthetic data is often created randomly, using simulations, rules, or AI.

// Method 1: Random Data Generation

To generate data randomly, we’ll use simple functions to create values without any specific rules.

It is useful for testing, but it won’t capture realistic relationships between features. We’ll do it using NumPy’s random method and create a Pandas DataFrame.

import numpy as np
import pandas as pd
np.random.seed(42)
df_random = pd.DataFrame({
    "feature_a": np.random.randint(1, 100, 5),
    "feature_b": np.random.rand(5),
    "feature_c": np.random.choice(["X", "Y", "Z"], 5)
})
df_random.head()

Here is the output.

// Method 2: Rule-Based Data Generation

Rule-based data generation is a smarter and more realistic method than random data generation. It follows a precise formula or set of rules. This makes the output purposeful and consistent.

In our example, the size of a house is directly linked to its price. To show this clearly, we will create a dataset with both size and price. We will define the relationship with a formula:

Price = size × 300 + ε (random noise)

This way, you can see the correlation while keeping the data reasonably realistic.

np.random.seed(42)
n = 5
size = np.random.randint(500, 3500, n)
price = size * 300 + np.random.randint(5000, 20000, n)

df_rule = pd.DataFrame({
    "size_sqft": size,
    "price_usd": price
})
df_rule.head()

Here is the output.

// Method 3: Simulation-Based Data Generation

The simulation-based data generation method combines random variation with rules from the real world. This mix creates datasets that behave like real ones.

What do we know about housing?

Bigger homes usually cost more
Some cities cost more than others
A baseline price

How do we build the dataset?

Pick a city at random
Draw a home size
Set bedrooms between 1 and 5
Compute the price with a clear rule

Price rule: We start with a base price, add a city price bump, and then add size × rate.

price_usd = base_price × city_bump + sqft × rate

Here is the code.

import numpy as np
import pandas as pd
rng = np.random.default_rng(42)
CITIES = ["los_angeles", "san_francisco", "san_diego"]
# City price bump: higher means pricier city
CITY_BUMP = {"los_angeles": 1.10, "san_francisco": 1.35, "san_diego": 1.00}

def make_data(n_rows=10):
    city = rng.choice(CITIES, size=n_rows)
    # Most homes are near 1,500 sqft, some smaller or larger
    sqft = rng.normal(1500, 600, n_rows).clip(350, 4500).round()
    beds = rng.integers(1, 6, n_rows)

    base = 220_000
    rate = 350  # dollars per sqft

    bump = np.array([CITY_BUMP[c] for c in city])
    price = base * bump + sqft * rate

    return pd.DataFrame({
        "city": city,
        "sqft": sqft.astype(int),
        "beds": beds,
        "price_usd": price.round(0).astype(int),
    })

df = make_data()
df.head()

Here is the output.

// Method 4: AI-Powered Data Generation

To have AI create your dataset, you need a clear prompt. AI is powerful, but it works best when you set simple, smart rules.

In the following prompt, we will include:

Domain: What is the data about?
Features: Which columns do we want?
- City, neighborhood, sqft, bedrooms, bathrooms
Relationships: How do the features connect?
- Price depends on city, sqft, bedrooms, and crime index
Format: How should AI return it?

Here is the prompt.

Generate Python code that creates a synthetic California real estate dataset.
The dataset should have 10,000 rows with columns: city, neighborhood, latitude, longitude, sqft, bedrooms, bathrooms, lot_sqft, year_built, property_type, has_garage, condition, school_score, crime_index, dist_km_center, price_usd.
Cities: Los Angeles, San Francisco, San Diego, San Jose, Sacramento.
Price should depend on city premium, sqft, bedrooms, bathrooms, lot size, school score, crime index, and distance from city center.
Include some random noise, missing values, and a few outliers.
Return the result as a Pandas DataFrame and save it to ‘ca_housing_synth.csv’

Let’s use this prompt with ChatGPT.

It returned the dataset as a CSV. Here is the process that shows how ChatGPT created it.

This is the most complex dataset we have generated by far. Let’s see the first few rows of this dataset.

# Building a Portfolio Project from Synthetic Data

We used four different techniques to create a synthetic dataset. We will use the AI-generated data to build a portfolio project.

First, we will explore the data, and then build a machine learning model. Next, we will visualize the results with Streamlit by leveraging AI, and in the final step, we will discover which steps to follow to deploy the model to production.

// Step 1: Exploring and Understanding the Synthetic Dataset

We’ll start exploring the data by first reading it with pandas and showing the first few rows.

df = pd.read_csv("ca_housing_synth.csv")
df.head()

Here is the output.

The dataset includes location (city, neighborhood, latitude, longitude) and property details (size, rooms, year, condition), as well as the target price. Let’s check the information in the column names, size, and length by using the info method.

We have 15 columns, with some, like has_garage or dist_km_center, being quite specific.

// Step 2: Model Building

The next step is to build a machine learning model that predicts home prices.

We will follow these steps:

Here is the code.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.inspection import permutation_importance
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# --- Step 1: Define columns based on the generated dataset
num_cols = ["sqft", "bedrooms", "bathrooms", "lot_sqft", "year_built", 
            "school_score", "crime_index", "dist_km_center", "latitude", "longitude"]
cat_cols = ["city", "neighborhood", "property_type", "condition", "has_garage"]

# --- Step 2: Split the data
X = df.drop(columns=["price_usd"])
y = df["price_usd"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# --- Step 3: Preprocessing pipelines
num_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])
cat_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer([
    ("num", num_pipe, num_cols),
    ("cat", cat_pipe, cat_cols)
])

# --- Step 4: Model
model = RandomForestRegressor(n_estimators=300, random_state=42, n_jobs=-1)

pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("model", model)
])

# --- Step 5: Train
pipeline.fit(X_train, y_train)

# --- Step 6: Evaluate
y_pred = pipeline.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

print(f"MAE:  {mae:,.0f}")
print(f"RMSE: {rmse:,.0f}")
print(f"R²:   {r2:.3f}")

# --- Step 7: (Optional) Permutation Importance on a subset for speed
pi = permutation_importance(
    pipeline, X_test.iloc[:1000], y_test.iloc[:1000],
    n_repeats=3, random_state=42, scoring="r2"
)

# --- Step 8: Plot Actual vs Predicted
plt.figure(figsize=(6, 5))
plt.scatter(y_test, y_pred, alpha=0.25)
vmin, vmax = min(y_test.min(), y_pred.min()), max(y_test.max(), y_pred.max())
plt.plot([vmin, vmax], [vmin, vmax], linestyle="--", color="red")
plt.xlabel("Actual Price (USD)")
plt.ylabel("Predicted Price (USD)")
plt.title(f"Actual vs Predicted (MAE={mae:,.0f}, RMSE={rmse:,.0f}, R²={r2:.3f})")
plt.tight_layout()
plt.show()

Here is the output.

Model Performance:

MAE (85,877 USD): On average, predictions are off by about $86K, which is reasonable given the variability in housing prices
RMSE (113,512 USD): Larger errors are penalized more; RMSE confirms the model handles considerable deviations fairly well
R² (0.853): The model explains ~85% of the variance in home prices, showing strong predictive power for synthetic data

// Step 3: Visualize the Data

In this step, we will show our process, including EDA and model building, using the Streamlit dashboard. Why are we using Streamlit? You can build a Streamlit dashboard quickly and easily deploy it for others to view and interact with.

Using Gemini CLI

To build the Streamlit application, we will use Gemini CLI.

Gemini CLI is an AI-powered open-source command-line agent. You can write code and build applications using Gemini CLI. It is straightforward and free.

To install it, use the following command in your terminal.

npm install -g @google/gemini-cli

After installing, use this code to initiate.

It will ask you to log in to your Google account, and then you’ll see the screen where you will build this Streamlit app.

Building a Dashboard

To build a dashboard, we need to create a prompt that is tailored to your specific data and mission. In the following prompt, we explain everything AI needs to build a Streamlit dashboard.

Build a Streamlit app for the California Real Estate dataset by using this dataset ( path-to-dataset )
Here is the dataset information: 
• Domain: California housing — Los Angeles, San Francisco, San Diego, San Jose, Sacramento.
• Location: city, neighborhood, lat, lon, and dist_km_center (haversine to city center).
• Home features: sqft, beds, baths, lot_sqft, year_built, property_type, has_garage, condition.
• Context: school_score, crime_index.
• Target: price_usd.
• Price logic: city premium + size + rooms + lot size + school/crime + distance to center + property type + condition + noise.
• Files you have: ca_housing_synth.csv (data) and real_estate_model.pkl (trained pipeline).

The Streamlit app should have:
• A short dataset overview section (shape, column list, small preview).
• Sidebar inputs for every model feature except the target:
- Categorical dropdowns: city, neighborhood, property_type, condition, has_garage.
- Numeric inputs/sliders: lat, lon, sqft, beds, baths, lot_sqft, year_built, school_score, crime_index.
- Auto-compute dist_km_center from the chosen city using the haversine formula and that city’s center.
• A Predict button that:
- Builds a one-row DataFrame with the exact training columns (order-safe).
- Calls pipeline.predict(...) from real_estate_model.pkl.
- Displays Estimated Price (USD) with thousands separators.
• One chart only: What-if: sqft vs price line chart (all other inputs fixed to the sidebar values).
- Quality of life: cache model load, basic input validation, clear labels/tooltips, English UI.

Next, Gemini will ask your permission to create this file.

Let’s approve and continue. Once it has finished coding, it will automatically open the streamlit dashboard.

If not, go to the working directory of the app.py file and run streamlit run app.py to start this Streamlit app.

Here is our Streamlit dashboard.

Once you click on the data overview, you can see a section representing the data exploration.

From the property features on the left-hand side, we can customize the property and make predictions accordingly. This part of the dashboard represents what we did in model building, but with a more responsive look.

Let’s select Richmond, San Francisco, single-family, excellent condition, 1500 sqft, and click on the “Predict Price” button:

The predicted price is $1.24M. Also, you can see the actual vs predicted price in the second graph for the entire dataset once you scroll down.

You can adjust more features in the left panel, like the year built, crime index, or the number of bathrooms.

// Step 4: Deploy the Model

The next step is uploading your model to production. To do that, you can follow these steps:

# Final Thoughts

In this article, we have discovered different methods to create synthetic datasets, such as random, rule-based, simulation-based, or AI-powered. Next, we’ve built a portfolio data project by starting from data exploration and building a machine learning model.

We also used an open-source command-line-based AI agent (Gemini CLI) to develop a dashboard that explores the dataset and predicts house prices based on selected features, including the number of bedrooms, crime index, and square footage.

Creating your synthetic data lets you avoid privacy hurdles, balance your examples, and move fast without costly data collection. The downside is that it can reflect your assumptions and miss real-world quirks. If you’re looking for more inspiration, check out this list of machine learning projects that you can adapt for your portfolio.

Finally, we looked at how to upload your model to production using Streamlit Community Cloud. Go ahead and follow these steps to build and showcase your portfolio project today!

Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.