machine learning

Rethinking the Environmental Costs of Training AI — Why We Should Look Beyond Hardware

May 13, 2025

Summary of This Study

Hardware choices – specifically hardware type and its quantity – along with training time, have a significant positive impact on energy, water, and carbon footprints during AI model training, whereas architecture-related factors do not.
The interaction between hardware quantity and training time slows the growth of energy, water, and carbon consumption slightly by 0.00002%.
Overall energy efficiency during AI model training has improved slightly over the years, around 0.13% per year.
Longer training time can gradually “drain” the overall energy efficiency by 0.03% per hour.

Outline

Introduction
- Research Question 1: Architectural and Hardware Choices vs Resource Consumption
- Research Question 2: Energy Efficiency over Time
Methods
- Estimation methods
- Analysis methods
Results
- RQ1:
  - Architecture Factors Don’t Hold Much Predictive Power as Hardware Ones
  - Final Model Selection
  - Coefficients Interpretation
- RQ2
Discussion

1. Introduction

Ever since the 1940s, when the first digital computers were invented, scientists have always dreamed of creating machines as smart as humans, what now became Artificial Intelligence (AI). Fast forward to November 2022, when ChatGPT — an AI model capable of listening and answering instantly — was released, it felt like a dream come true. Afterward, hundreds of new AI models have rushed into the race (take a look at the timeline here). Today, every single day, one billion messages are sent through ChatGPT (OpenAI Newsroom, 2024), highlighting the rapid AI adoption by users. Yet, few people stop to ask: What are the environmental costs behind this new convenience?

Before users can ask AI questions, these models must first be trained. Training is the process where models, or algorithms, are fed datasets and try to find the best fit. Imagine a simple regression y = ax + b: training means feeding the algorithm x and y values and allowing it to find the best parameters a and b. Of course, AI models typically would not be as simple as a linear regression. They would contain tons of parameters, thus requiring massive amounts of computation and datasets. Moreover, they would need to run a substantial amount of specialized hardware that can handle that sheer amount of computation and complexity. All of that combined made AI consume much more energy than traditional software.

In addition, AI training requires a stable and uninterrupted energy supply, which primarily comes from non-renewable energy sources like natural gas or coal-based, because solar and wind energy can fluctuate based on weather conditions (Calvert, 2024). Moreover, due to the high intensity of energy use, data centers — buildings that store AI models — heat up rapidly, emitting significant carbon footprints and requiring large amounts of water for cooling. Therefore, AI models have broad environmental impacts that include not only energy usage but also water consumption and carbon emissions.

Unfortunately, there is not much official and disclosed data regarding energy, water, and carbon footprints of AI models. The public remains largely unaware of these environmental impacts and thus has not created strong pressure or motivations for tech companies to take more systematic changes. Furthermore, while some improvements have been made — especially in hardware energy efficiency — there remains little systematic or coordinated effort to effectively reduce the overall environmental impacts of AI. Therefore, I am hoping to increase public awareness of these hidden environmental costs and to explore whether recent improvements in energy efficiency are substantial. More particularly, I’m seeking to address two research questions in this study:

RQ1: Is there a significant relationship between AI models’ architectural and hardware choices and their resource consumption during training?

RQ2: Has AI training become energy-efficient over time?

2. Methods:

The paper used a dataset called Notable AI Models from Epoch AI (Epoch AI, 2025), a research institute that investigates the trends of AI development. The models included were either historically relevant or represent cutting-edge advances in AI. Each model was recorded with key training information such as the number of parameters, dataset size, total compute, hardware type, and hardware quantity, all collected from various sources, including literature reviews, publications, and research papers. The dataset also reported the confidence level for these attributes. To produce a reliable analysis, I evaluated only models with a confidence rating of “Confident” or “Likely”.

As noted earlier, there was limited data regarding direct resource consumption. Fortunately, the dataset authors have estimated Total Power Draw (in watts, or W) based on several factors, including hardware type, hardware quantity, and some other data center efficiency rates and overhead. It is important to note that power and energy are different: power (W) refers to the amount of electricity used per unit of time, while energy (in kilowatt-hours, or kWh) measures the total cumulative electricity consumed over time.

Since this study investigated resource consumption and energy efficiency during the training phase of AI models, I constructed and estimated four environmental metrics: total energy used (kWh), total water used (liters, or L), total carbon emissions (kilograms of CO2e, or kgCO2e), and energy efficiency (FLOPS/W, to be explained later).

a. Estimation methods

First, this study estimated energy consumption by selecting models with available total power draw (W) and training times (hours). Energy was computed as follows:

\[text{Energy (kWh)} = frac{text{Total Power Draw (W)}}{1000} times text{Training Time (h)}\]

Next, water consumption and carbon emissions were estimated by rearranging the formulas of two standard rates used in data centers: Water Usage Effectiveness (WUE, in L/kWh) and Carbon Intensity (CI, in kgCO2e/kWh):

\[text{WUE (L/kWh)} = frac{text{Water (L)}}{text{Energy (kWh)}} Longrightarrow text{Water (L)} = text{WUE (L/kWh)} times text{Energy (kWh)}\]

This study used the average WUE of 0.36 L/kWh in 2023, reported by Lawrence Berkeley National Laboratory (2024).

\[mathrm{CI left( frac{mathrm{kgCO_2e}}{mathrm{kWh}} right)} = frac{mathrm{Carbon (kgCO_2e)}}{mathrm{Energy (kWh)}} Longrightarrow mathrm{Carbon (kgCO_2e)} = mathrm{CI left( frac{mathrm{kgCO_2e}}{mathrm{kWh}} right)} times mathrm{Energy (kWh)}\]

This study used an average carbon intensity of 0.548 kg CO₂e/kWh, reported by recent environmental research (Guidi et al, 2024).

Finally, this study estimated energy efficiency using the FLOPS/W metric. A floating-point operation (FLOP) is a basic arithmetic operation (e.g., addition or multiplication) with decimal numbers. FLOP per second (FLOPS) measures how many such operations a system can perform each second, and is commonly used to evaluate computing performance. FLOPS per Watt (FLOPS/W) measures how much computing performance is achieved per unit of power consumed:

\[text{Energy Efficiency (FLOPS/W)} = frac{text{Total Compute (FLOP)}}{text{Training Time (h)} times 3600 times text{Total Power Draw(W)}}\]

It is important to note that FLOPS/W is typically used to measure hardware-level energy efficiency. However, it’s possible that the actual efficiency during AI training may be different from the thereotical efficiency reported for the hardware used. I would like to investigate whether any of the training-related factors, beyond hardware alone, may contribute significantly to overall energy efficiency.

b. Analysis methods:

RQ1: Architectural and Hardware Choices vs Resource Consumption

Among energy, water, and carbon consumption, I focused on modeling energy consumption, as both water and carbon are derived directly from energy using fixed conversion rates and all three response variables shared identical distributions. As a result, I believe we could safely assume that the best-fitting model of energy consumption can be applied to water and carbon. While the statistical models were the same, I would still report the results of all three to quantify how many kilowatt-hours of energy, liters of water, and kilograms of carbon are wasted for every unit increase in each significant factor. That way, I am hoping to communicate the environmental impacts of AI in a more holistic, concrete, and tangible terms.

Figure 2a. Histogram of Energy Consumption (kWh)

Figure 2b. Histogram of log of Energy Consumption (kWh)

Based on Figure 1, the histogram of energy showed extreme right skew and the presence of some outliers. Therefore, I performed a log transformation on energy data, aiming to stabilize variance and move the distribution closer to normality (Fig. 2). A Shapiro-Wilk test confirmed the log-transformed energy data is approximately normal (p-value = 0.5). Based on this, two types of distributions were considered: the Gaussian (normal) and the Gamma distribution. While the Gaussian distribution is approriate for symmetric and normal data, the Gamma distribution is more suited for positive, skewed data — commonly used in engineering modeling where small values occur more frequently than larger values. For each distribution, the paper compared two approaches for incorporating the log transformation: directly log transforming the response variable versus using a log link function within a generalized linear model (GLM). I identified the best combination of distribution and log approach by evaluating their Akaike Information Criterion (AIC), diagnostic plots, along with prediction accuracy.

The candidate predictors included Parameters, Training Compute, Dataset Size, Training Time, Hardware Quantity, and Hardware Type. Architecture-related variables comprised Parameters, Training Compute, and Dataset Size, while hardware-related variables consisted of Hardware Quantity and Hardware Type. Training Time didn’t fall neatly into either category but was included due to its central role in training AI models. After fitting all candidate predictors into the selected GLM specification, I tested for multicollinearity to determine whether any variables should be excluded. Following this, I explored interaction terms, as each resource consumption may not have responded linearly to each independent variable. The following interactions were considered based on domain knowledge and various sources:

Model Size and Hardware Type: Different hardware types have different memory designs. The larger and more complex the model is, the more memory it requires (Bali, 2025). Energy consumption can be different depending on how the hardware handles memory demands.
Dataset Size and Hardware Type: Similarly, with different memory designs, hardware may access and read data at different data size (Krashinsky et al, 2020). As dataset size increases, energy consumption can vary depending on how the hardware handles large volumes of data.
Training Time with Hardware Quantity: Running multiple hardware units at the same time adds extra overhead, like keeping everything in sync (HuggingFace, 2025). As training goes on, these coordination costs can grow and put more strain on the system, leading to faster energy drain.
Training Time with Hardware Type: As training time increases, energy use may vary across hardware types since some hardware types may manage heat better or maintain performance more consistently over time, while others may slow down or consume more energy.

RQ2: Energy Efficiency over Time

Figure 2c. Histogram of Energy Efficiency (FLOPS/W)

Figure 2d. Histogram of Energy Efficiency (FLOPS/W)

The distribution of energy efficiency was highly skewed. Even after a log transformation, the distribution remained non-normal and overdispersed. To reduce distortion, I removed one extreme outlier with exceptionally high efficiency, as it was not a frontier model and likely less impactful. A Gamma GLM was then fitted using Publication Date as the primary predictor. If models using the same hardware exhibited wide variation in efficiency, it would suggest that other factors beyond the hardware may contribute to these differences. Therefore, architecture and hardware predictors from the first research question would be used to assess which variables significantly influence energy efficiency over time.

3. Results

RQ1: Architectural and Hardware Choices vs Resource Consumption

I ultimately used a Gamma GLM with a log link to model resource consumption. This combination was chosen because it had a lower AIC value (1780.85) than the Gaussian log-link model (2005.83) and produced predictions that matched the raw data more closely than models using a log-transformed response variable. Those log-transformed models generated predictions that substantially underestimated the actual data on the original scale (see this article on why log-transforming didn’t work in my case).

Architecture Factors Don’t Hold Much Predictive Power as Hardware Ones

After fitting all candidate explanatory variables to a Gamma log-link GLM, we found that two architecture-related variables — Parameters and Dataset Size — do not exhibit a significant relationship with resource consumption (p > 0.5). A multicollinearity test also showed that Dataset Size and Training Compute were highly correlated with other predictors (GVIF > 6). Based on this, I hypothesized that all three architecture variables—Parameters, Dataset Size, and Training Compute) may not hold much predictive power. I then removed all three variables from the model and an ANOVA test confirmed that simplified models (Models 4 and 5) are not significantly worse than the full model (Model 1), with p > 0.05:

Model 1: Energy_kWh ~ Parameters + Training_compute_FLOP + Training_dataset_size + 
    Training_time_hour + Hardware_quantity + Training_hardware + 
    0
Model 2: Energy_kWh ~ Parameters + Training_compute_FLOP + Training_time_hour + 
    Hardware_quantity + Training_hardware
Model 3: Energy_kWh ~ Parameters + Training_dataset_size + Training_time_hour + 
    Hardware_quantity + Training_hardware
Model 4: Energy_kWh ~ Parameters + Training_time_hour + Hardware_quantity + 
    Training_hardware + 0
Model 5: Energy_kWh ~ Training_time_hour + Hardware_quantity + Training_hardware + 
    0
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
1        46     108.28                       
2        47     111.95 -1  -3.6700  0.07809 .
3        47     115.69  0  -3.7471           
4        48     116.09 -1  -0.3952  0.56314  
5        49     116.61 -1  -0.5228  0.50604

Moving on with Model 5, I found that Training Time and Hardware Quantity showed significant positive relationships with Energy Consumption (GLM: training time, t = 9.70, p-value < 0.001; hardware quantity, t = 6.89, p-value < 0.001). All hardware types were also statistically significant (p-value < 0.001), indicating strong variation in energy use across different types. Detailed results are presented below:

glm(formula = Energy_kWh ~ Training_time_hour + Hardware_quantity + 
    Training_hardware + 0, family = Gamma(link = "log"), data = df)

Coefficients:
                                                Estimate Std. Error t value Pr(>|t|)    
Training_time_hour                             1.351e-03  1.393e-04   9.697 5.54e-13 ***
Hardware_quantity                              3.749e-04  5.444e-05   6.886 9.95e-09 ***
Training_hardwareGoogle TPU v2                 7.213e+00  7.614e-01   9.474 1.17e-12 ***
Training_hardwareGoogle TPU v3                 1.060e+01  3.183e-01  33.310  < 2e-16 ***
Training_hardwareGoogle TPU v4                 1.064e+01  4.229e-01  25.155  < 2e-16 ***
Training_hardwareHuawei Ascend 910             1.021e+01  1.126e+00   9.068 4.67e-12 ***
Training_hardwareNVIDIA A100                   1.083e+01  3.224e-01  33.585  < 2e-16 ***
Training_hardwareNVIDIA A100 SXM4 40 GB        1.084e+01  5.810e-01  18.655  < 2e-16 ***
Training_hardwareNVIDIA A100 SXM4 80 GB        1.149e+01  5.754e-01  19.963  < 2e-16 ***
Training_hardwareNVIDIA GeForce GTX 285        3.065e+00  1.077e+00   2.846  0.00644 ** 
Training_hardwareNVIDIA GeForce GTX TITAN X    6.377e+00  7.614e-01   8.375 5.13e-11 ***
Training_hardwareNVIDIA GTX Titan Black        6.371e+00  1.079e+00   5.905 3.28e-07 ***
Training_hardwareNVIDIA H100 SXM5 80GB         1.149e+01  6.825e-01  16.830  < 2e-16 ***
Training_hardwareNVIDIA P100                   5.910e+00  7.066e-01   8.365 5.32e-11 ***
Training_hardwareNVIDIA Quadro P600            5.278e+00  1.081e+00   4.881 1.16e-05 ***
Training_hardwareNVIDIA Quadro RTX 4000        5.918e+00  1.085e+00   5.455 1.60e-06 ***
Training_hardwareNVIDIA Quadro RTX 5000        4.932e+00  1.081e+00   4.563 3.40e-05 ***
Training_hardwareNVIDIA Tesla K80              9.091e+00  7.760e-01  11.716 8.11e-16 ***
Training_hardwareNVIDIA Tesla V100 DGXS 32 GB  1.059e+01  6.546e-01  16.173  < 2e-16 ***
Training_hardwareNVIDIA Tesla V100S PCIe 32 GB 1.089e+01  1.078e+00  10.099 1.45e-13 ***
Training_hardwareNVIDIA V100                   9.683e+00  4.106e-01  23.584  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Gamma family taken to be 1.159293)

    Null deviance: 2.7045e+08  on 70  degrees of freedom
Residual deviance: 1.1661e+02  on 49  degrees of freedom
AIC: 1781.2

Number of Fisher Scoring iterations: 25

Final Model Selection

To better capture possible non-additive effects, various interaction terms were explored and their respective AIC scores (Table 1). The table below summarizes the tested models and their respective AIC scores:

Model	Predictors	AIC
5	Training Time + Hardware Quantity + Hardware Type	350.78
6	Training Time + Hardware Quantity + Hardware Type * Parameters	357.97
7	Training Time + Hardware Quantity + Hardware Type * Dataset Size	335.89
8	Training Time * Hardware Quantity + Hardware Type	345.39
9	Training Time * Hardware Type + Hardware Quantity	333.03

Table 1. Summary of different GLM models and their respective AIC scores.

Although AIC scores did not vary drastically, meaning their model fits are similar, Model 8 was preferred as it was the only one with significant effects in both main terms and interaction. Interactions involved Hardware Type were not significant despite some exhibiting better AIC, likely due to limited sample size across 18 hardware types.

In Model 8, both Training Time and Hardware Quantity showed a significant positive relationship with energy consumption (GLM: t = 11.09, p < 0.001), and between hardware quantity and energy consumption (GLM: training time, t = 11.09, p < 0.001; hardware quantity, t = 7.32, p < 0.001; Fig. 3a). Their interaction term was significantly negative (GLM: t = –4.32, p < 0.001), suggesting that energy consumption grows more slowly when training time increases alongside with a higher number of hardware units. All hardware types remained significant (p < 0.001). Detailed results are as below:

glm(formula = Energy_kWh ~ Training_time_hour * Hardware_quantity + 
    Training_hardware + 0, family = Gamma(link = "log"), data = df)

Coefficients:
                                                 Estimate Std. Error t value Pr(>|t|)    
Training_time_hour                              1.818e-03  1.640e-04  11.088 7.74e-15 ***
Hardware_quantity                               7.373e-04  1.008e-04   7.315 2.42e-09 ***
Training_hardwareGoogle TPU v2                  7.136e+00  7.379e-01   9.670 7.51e-13 ***
Training_hardwareGoogle TPU v3                  1.004e+01  3.156e-01  31.808  < 2e-16 ***
Training_hardwareGoogle TPU v4                  1.014e+01  4.220e-01  24.035  < 2e-16 ***
Training_hardwareHuawei Ascend 910              9.231e+00  1.108e+00   8.331 6.98e-11 ***
Training_hardwareNVIDIA A100                    1.028e+01  3.301e-01  31.144  < 2e-16 ***
Training_hardwareNVIDIA A100 SXM4 40 GB         1.057e+01  5.635e-01  18.761  < 2e-16 ***
Training_hardwareNVIDIA A100 SXM4 80 GB         1.093e+01  5.751e-01  19.005  < 2e-16 ***
Training_hardwareNVIDIA GeForce GTX 285         3.042e+00  1.043e+00   2.916  0.00538 ** 
Training_hardwareNVIDIA GeForce GTX TITAN X     6.322e+00  7.379e-01   8.568 3.09e-11 ***
Training_hardwareNVIDIA GTX Titan Black         6.135e+00  1.047e+00   5.862 4.07e-07 ***
Training_hardwareNVIDIA H100 SXM5 80GB          1.115e+01  6.614e-01  16.865  < 2e-16 ***
Training_hardwareNVIDIA P100                    5.715e+00  6.864e-01   8.326 7.12e-11 ***
Training_hardwareNVIDIA Quadro P600             4.940e+00  1.050e+00   4.705 2.18e-05 ***
Training_hardwareNVIDIA Quadro RTX 4000         5.469e+00  1.055e+00   5.184 4.30e-06 ***
Training_hardwareNVIDIA Quadro RTX 5000         4.617e+00  1.049e+00   4.401 5.98e-05 ***
Training_hardwareNVIDIA Tesla K80               8.631e+00  7.587e-01  11.376 3.16e-15 ***
Training_hardwareNVIDIA Tesla V100 DGXS 32 GB   9.994e+00  6.920e-01  14.443  < 2e-16 ***
Training_hardwareNVIDIA Tesla V100S PCIe 32 GB  1.058e+01  1.047e+00  10.105 1.80e-13 ***
Training_hardwareNVIDIA V100                    9.208e+00  3.998e-01  23.030  < 2e-16 ***
Training_time_hour:Hardware_quantity           -2.651e-07  6.130e-08  -4.324 7.70e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Gamma family taken to be 1.088522)

    Null deviance: 2.7045e+08  on 70  degrees of freedom
Residual deviance: 1.0593e+02  on 48  degrees of freedom
AIC: 1775

Number of Fisher Scoring iterations: 25

Figure 3a. Relationship between hardware quantity and log of energy consumption across training time groups. Training time was originally a continuous variable. For the sake of visualization, training time was divided into three equal-sized levels and labeled as high, mid, and low.

Coefficients Interpretation

To further interpret the coefficients, we can exponentiate each coefficient and subtract one to estimate the percent change in the response variable for each additional unit in the predictor (Popovic, 2022). For energy consumption, each additional hour of training would increase energy use by 0.18%, each additional hardware unit would add 0.07%, and their interaction reduced their combined main effects by 0.00002%. Similarly, since water and carbon were directly proportional with energy, the percent change in training time, hardware quantity, and their interaction remained the same (Fig. 3b, Fig. 3c). However, since hardware types were categorical variables and functioned as baseline intercepts, their values differed across energy, water, and carbon models to reflect differences in overall scale.

Figure 3b. Relationship between hardware quantity and log of water consumption across training time groups.

Figure 3c. Relationship between hardware quantity and log of carbon emissions across training time groups.

RQ2: Energy Efficiency over Time

I also used a log-linked Gamma model to examine the relationship between Energy Efficiency and Publication Date, as the Shapiro-Wilk test indicated that the log-transformed data was not normally distributed (p < 0.001). There was a positive relationship between Publication Date and Energy Efficiency, with an estimated improvement of 0.13% per year (GLM: t = 8.005, p < 0.001, Fig. 3d).

Figure 3d. Relationship between publication year and log of energy efficiency (FLOPS/W). Each point represents a model, and the blue line shows a fitted trend using a linear model.

To further investigate, we examined the trends by individual hardware type and observed noticeable variation in efficiency among AI models using the same hardware (Fig. 3e). Among all architecture and hardware choices, Training Time was the only statistically significant factor influencing energy efficiency (GLM: t = 8.581, p < 0.001), with longer training time decreases energy efficiency by 0.03% per hour.

Figure 3e. Trends in log of energy efficiency (FLOPS/W) by hardware type over time. Each panel represents a specific hardware model, showing individual data points and fitted linear trends. Only hardware types used in at least three models are included.

4. Discussion

This study found that hardware choices — including Hardware Type and Hardware Quantity — along with Training Time, have a significant relationship with each resource consumption during AI Model Training, while architecture variables do not. I suspect that Training Time may have implicitly captured some of the underlying effects of those architecture-related factors. In addition, the interaction between Training Time and Hardware also contributes to the resource usage. However, this analysis is constrained by the small dataset (70 valid models) across 18 hardware types, which likely limits the statistical power of hardware-involved interaction terms. Further research could explore these interactions with larger and more diverse datasets.

To illustrate how resource-intensive AI training can be, we use Model 8 to predict the baseline energy consumption for a single hour of training on one NVIDIA A100 chip. Here are the predictions for each type of resource under this simple setup:

Energy: The predicted energy use is 29,213 kWh, nearly three times the annual energy consumption of an average U.S. household (10,500 kWh/year) (U.S. Energy Information Administration, 2023), with each extra hour adding 5258 kWh more and each extra chip adding 2044 kWh.
Water: Similarly, the same training session would consume 10,521 liters of water, almost ten times the average U.S. household’s daily water use (300 gallons or 1135 liters/day) (United States Environmental Protection Agency, 2024), with each extra hour adding 1,894 liters and each chip adding 736 liters.
Carbon: the predicted carbon emission is 16,009 kg, about four times the annual emissions of a U.S. household (4000kg/year) (University of Michigan, 2024), with each extra hour adding 2881 kg and each extra chip adding 1120 kg.

This study also found that AI models have become more energy-efficient over time, but only slightly, with an estimated improvement of 0.13% per year. This suggests that while newer hardware is more efficient, its adoption has not been widespread. While the environmental impact of AI may be mitigated over time as hardware hardware has become more efficient, this focus on hardware alone may overlook other contributors to overall energy consumption. In this dataset, both Training Compute and Total Power Draw are often estimated values and may include some system-level overhead beyond hardware alone. Therefore, the efficiency estimates in this study may reflect not just hardware performance, but potentially other training-related overhead. This study observed substantial variation in energy efficiency even among models using the same hardware. One key finding is that longer training time can “drain” energy efficiency, reducing it by approximately 0.03%. Further studies should explore how training practices, beyond hardware selection, impact the environmental costs of AI development.

References

Calvert, B.. 2024. AI already uses as much energy as a small country. It’s only the beginning. Vox. https://www.vox.com/climate/2024/3/28/24111721/climate-ai-tech-energy-demand-rising

OpenAI Newsroom. 2024. Fresh numbers shared by @sama earlier today: 300M weekly active ChatGPT users. 1B user messages sent on ChatGPT every day 1.3M devs have built on OpenAI in the US. Tweet via X. 2024. https://x.com/OpenAINewsroom/status/1864373399218475440

Epoch AI. 2025. Data on Notable AI Models. Epoch AI. https://epoch.ai/data/notable-ai-models

Shehabi, A., S.J. Smith, A. Hubbard, A. Newkirk, N. Lei, M.A.B. Siddik, B. Holecek, J. Koomey, E. Masanet, and D. Sartor. 2024. 2024 United States Data Center Energy Usage Report. Lawrence Berkeley National Laboratory, Berkeley, California. LBNL-2001637.

Guidi, G., F. Dominici, J. Gilmour, K. Butler, E. Bell, S. Delaney, and F.J. Bargagli-Stoffi. 2024. Environmental Burden of United States Data Centers in the Artificial Intelligence Era. arXiv abs/2411.09786.

Bali, S.. 2025. GPU Memory Essentials for AI Performance. NVIDIA Developer. https://developer.nvidia.com/blog/gpu-memory-essentials-for-ai-performance/

Krashinsky, R., O. Giroux, S. Jones, N. Stam, and S. Ramaswamy. 2020. NVIDIA Ampere Architecture In-Depth. NVIDIA Developer. https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/

HuggingFace. 2025. Performance Tips for Training on Multiple GPUs. HuggingFace Documentation. https://huggingface.co/docs/transformers/en/perf_train_gpu_many

Popovic, G.. 2022. Interpreting GLMs. Environmental Computing. Environment Computing. https://environmentalcomputing.net/statistics/glms/interpret-glm-coeffs/

U.S. Energy Information Administration. 2023. Use of Energy Explained: Electricity Use in Homes. https://www.eia.gov/energyexplained/use-of-energy/electricity-use-in-homes.php

United States Environmental Protection Agency. 2024. How We Use Water. https://www.epa.gov/watersense/how-we-use-water

Center for Sustainable Systems, University of Michigan. 2024. Carbon Footprint Factsheet. Pub. No. CSS09–05.

The post Rethinking the Environmental Costs of Training AI — Why We Should Look Beyond Hardware appeared first on Towards Data Science.