Image by Author | Ideogram
Principal component analysis (PCA) is one of the most popular techniques for reducing the dimensionality of high-dimensional data. This is an important data transformation process in various real-world scenarios and industries like image processing, finance, genetics, and machine learning applications where data contains many features that need to be analyzed more efficiently.
The reasons for the significance of dimensionality reduction techniques like PCA are manifold, with three of them standing out:
- Efficiency: reducing the number of features in your data signifies a reduction in the computational cost of data-intensive processes like training advanced machine learning models.
- Interpretability: by projecting your data into a low-dimensional space, while keeping its key patterns and properties, it is easier to interpret and visualize in 2D and 3D, sometimes helping gain insight from its visualization.
- Noise reduction: often, high-dimensional data may contain redundant or noisy features that, when detected by methods like PCA, can be eliminated while preserving (or even improving) the effectiveness of subsequent analyses.
Hopefully, at this point I have convinced you about the practical relevance of PCA when handling complex data. If that’s the case, keep reading, as we’ll start getting practical by learning how to use PCA in Python.
How to Apply Principal Component Analysis in Python
Thanks to supporting libraries like Scikit-learn that contain abstracted implementations of the PCA algorithm, using it on your data is relatively straightforward as long as the data are numerical, previously preprocessed, and free of missing values, with feature values being standardized to avoid issues like variance dominance. This is particularly important, since PCA is a deeply statistical method that relies on feature variances to determine principal components: new features derived from the original ones and orthogonal to each other.
We will start our example of using PCA from scratch in Python by importing the necessary libraries, loading the MNIST dataset of low-resolution images of handwritten digits, and putting it into a Pandas DataFrame:
import pandas as pd
from torchvision import datasets
mnist_data = datasets.MNIST(root="./data", train=True, download=True)
data = []
for img, label in mnist_data:
img_array = list(img.getdata())
data.append([label] + img_array)
columns = ["label"] + [f"pixel_{i}" for i in range(28*28)]
mnist_data = pd.DataFrame(data, columns=columns)
In the MNIST dataset, each instance is a 28×28 square image, with a total of 784 pixels, each containing a numerical code associated with its gray level, ranging from 0 for black (no intensity) to 255 for white (maximum intensity). These data must firstly be rearranged into a unidimensional array — rather than bidimensional as per its original 28×28 grid arrangement. This process called flattening takes place in the above code, with the final dataset in DataFrame format containing a total of 785 variables: one for each of the 784 pixels plus the label, indicating with an integer value between 0 and 9 the digit originally written in the image.

MNIST Dataset | Source: TensorFlow
In this example, we won’t need the label — useful for other use cases like image classification — but we will assume we may need to keep it handy for future analysis, therefore we will separate it from the rest of the features associated with image pixels in a new variable:
X = mnist_data.drop('label', axis=1)
y = mnist_data.label
Although we will not apply a supervised learning technique after PCA, we will assume we may need to do so in future analyses, hence we will split the dataset into training (80%) and testing (20%) subsets. There’s another reason we are doing this, let me clarify it a bit later.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.2, random_state=42)
Preprocessing the data and making it suitable for the PCA algorithm is as important as applying the algorithm itself. In our example, preprocessing entails scaling the original pixel intensities in the MNIST dataset to a standardized range with a mean of 0 and a standard deviation of 1 so that all features have equal contribution to variance computations, avoiding dominance issues in certain features. To do this, we will use the StandardScaler class from sklearn.preprocessing, which standardizes numerical features:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Notice the use of fit_transform
for the training data, whereas for the test data we used transform
instead. This is the other reason why we previously split the data into training and test data, to have the opportunity to discuss this: in data transformations like standardization of numerical attributes, transformations across the training and test sets must be consistent. The fit_transform
method is used on the training data because it calculates the necessary statistics that will guide the data transformation process from the training set (fitting), and then applies the transformation. Meanwhile, the transform method is utilized on the test data, which applies the same transformation “learned” from the training data to the test set. This ensures that the model sees the test data in the same target scale as that used for the training data, preserving consistency and avoiding issues like data leakage or bias.
Now we can apply the PCA algorithm. In Scikit-learn’s implementation, PCA takes an important argument: n_components
. This hyperparameter determines the proportion of principal components to retain. Larger values closer to 1 mean retaining more components and capturing more variance in the original data, whereas lower values closer to 0 mean keeping fewer components and applying a more aggressive dimensionality reduction strategy. For example, setting n_components
to 0.95 implies retaining sufficient components to capture 95% of the original data’s variance, which may be appropriate for reducing the data’s dimensionality while preserving most of its information. If after applying this setting the data dimensionality is significantly reduced, that means many of the original features did not contain much statistically relevant information.
from sklearn.decomposition import PCA
pca = PCA(n_components = 0.95)
X_train_reduced = pca.fit_transform(X_train_scaled)
X_train_reduced.shape
Using the shape
attribute of the resulting dataset after applying PCA, we can see that the dimensionality of the data has been drastically reduced from 784 features to just 325, while still keeping 95% of the important information.
Is this a good result? Answering this question largely depends on the later application or type of analysis you want to perform with your reduced data. For instance, if you want to build an image classifier of digit images, you may want to build two classification models: one trained with the original, high-dimensional dataset, and one trained with the reduced dataset. If there is no significant loss of classification accuracy in your second classifier, good news: you achieved a faster classifier (dimensionality reduction normally implies greater efficiency in training and inference), and similar classification performance as if you were using the original data.
Wrapping Up
This article illustrated through a Python step-by-step tutorial how to apply the PCA algorithm from scratch, starting from a dataset of handwritten digit images with high dimensionality.
Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.