Top 10 Free API Providers for Data Science Projects

0
9



Image by Author | ChatGPT

 

Introduction

 
Getting real-world data for your data science projects is often the hardest part. Toy datasets are easy to find, but for high-quality or real-time data you usually need to use APIs or build custom scraping pipelines to extract information from the web.

In this article, I share my 10 favorite free APIs—the ones I use daily for data collection, data integration, and building AI agents. These APIs are organized into five categories, spanning trusted data repositories, web scraping, and web search, so you can quickly choose the right tool and move from data to insight faster.

 

Foundational Data Repositories

 
A foundational data repository is a community-based platform where different organizations and open-source contributors share their datasets with the wider world. With a simple command, you can access these datasets for your project.

 

// 1. Kaggle API

Kaggle datasets are extremely popular when working on data science projects. Instead of downloading them manually, you can create a data pipeline that will automatically download the dataset, unzip it, and load it into your workspace.

These datasets are shared by the open-source community for everyone to use. To get started, generate an API key from your Kaggle account and set it as an environment variable. After that, you can run the following commands in your terminal. Kaggle also provides a Python SDK, which allows for easy integration with your code.

kaggle datasets download -d kingabzpro/world-vaccine-progress -p data --unzip

 

// 2. Hugging Face CLI

Similar to Kaggle, Hugging Face is also a data science and machine learning community where people share datasets, models, and demos. You can easily install the Hugging Face CLI and integrate it into your workflows using either CLI commands or Python code. Both options allow you to download datasets without needing an API key.

An API key is only required when the dataset is gated.

hf download kingabzpro/dermatology-qa-firecrawl-dataset

 

Web and Crawling APIs

 
The web contains a wide variety of data. If you can’t find the information you need on the platforms mentioned above, you may need to curate your own data by scraping the web or using a web search API.

 

// 3. Firecrawl

Firecrawl provides an API for extracting content from websites and converting it into a markdown format for easier AI integrations. It also comes with a scraping and extraction API that is integrated with an LLM (large language model) for advanced web scraping options.

This API is a must-have. I use it every day for data creation and for integrating it into my AI projects.

curl -s -X POST "https://api.firecrawl.dev/v2/scrape" \
  -H "Authorization: Bearer $FIRECRAWL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://abid.work",
    "formats": ["markdown", "html"]
  }'

 

// 4. Tavily

Tavily is a fast web search API that provides 1,000 search requests per month for free. It is both accurate and quick. You can use it to create datasets, integrate it into your AI projects, or utilize it as a simple search API for your development needs.

curl --request POST \
  --url https://api.tavily.com/search \
  --header "Authorization: Bearer <token>" \
  --header "Content-Type: application/json" \
  --data '{
    "query": "who is Leo Messi?",
    "auto_parameters": false,
    "topic": "general",
    "search_depth": "basic",
    "chunks_per_source": 3,
    "max_results": 1,
    "days": 7,
    "include_answer": true,
    "include_raw_content": true,
    "include_images": false,
    "include_image_descriptions": false,
    "include_favicon": false,
    "include_domains": [],
    "exclude_domains": [],
    "country": null
  }'

 

Geospatial and Weather APIs

 
If you are looking for weather and geospatial datasets, you will know that things keep changing. That’s why you need real-time access to these datasets via API.

 

// 5. OpenWeatherMap

OpenWeatherMap is a service that provides global weather data via APIs, including current conditions, forecasts, nowcasts, historical records, and even minute-by-minute hyperlocal precipitation forecasts.

curl "https://api.openweathermap.org/data/2.5/weather?q=London&appid=YOUR_API_KEY&units=metric"

 

// 6. OpenStreetMap

OpenStreetMap provides world map data, and the Overpass API is a read-only web database that serves custom-selected parts of OSM and can be queried with Overpass QL. The example below fetches cafe nodes within a small London bounding box.

curl -G "https://overpass-api.de/api/interpreter" \
  --data-urlencode 'data=[out:json];node["amenity"="cafe"](51.50,-0.15,51.52,-0.10);out;'

 

Financial Market Data APIs

 
Financial market data APIs are highly recommended if you are working on a financial project and need real-time data on stocks, crypto, and other finance-related information and news.

 

// 7. Alpha Vantage

Alpha Vantage is a financial data platform offering free APIs for real-time and historical market data across stocks, forex, cryptocurrencies, commodities, and options, with outputs in JSON or CSV. It also provides chart-ready time series at intraday, daily, weekly, and monthly intervals, and over 50 technical indicators for analysis.

curl "https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=IBM&apikey=YOUR_API_KEY"

 

// 8. Yahoo Finance

Many beginners and practitioners use the yfinance API to access stock quotes, historical time series data, dividends and splits, as well as basic metadata. This allows them to create analysis-ready data frames for quick prototypes and classroom projects.

Yahoo Finance offers free stock quotes, news, portfolio tools, and coverage of international markets, enabling users to explore a wide range of market data at no direct cost.

import yfinance as yf
print(yf.download("AAPL", period="1y").head())

 

Social and Community Data APIs

 
If you are working on a project to analyze text and community conversations from top social media platforms, then these APIs provide easy access to real social media data.

 

// 9. Reddit

Reddit offers a rich, community-driven data source, and the Python Reddit API Wrapper (PRAW) makes it simple to access the official Reddit API for tasks like fetching posts, comments, and subreddit metadata in Python.

PRAW works by sending requests to Reddit’s API under the hood and is commonly used in teaching and research to collect discussion threads for analysis.

import praw

r = praw.Reddit(
    client_id="ID",
    client_secret="SECRET",
    user_agent="myapp:ds-project:v1 (by u/yourname)"
)

print([s.title for s in r.subreddit("Python").hot(limit=5)])


 

// 10. X

X (previously known as Twitter) provides a developer platform with REST endpoints for user and content retrieval, plus streaming options for real-time data. Access generally requires authentication, adherence to rate limits and policy, and selecting an access tier appropriate for your volume and use case.

curl -H "Authorization: Bearer YOUR_BEARER_TOKEN" \
  "https://api.x.com/2/users/by/username/jack"

 

Final Thoughts

 
These APIs provide free access to data that is often difficult to obtain. They greatly enhance your ability to gather web data or improve your web scraping efforts, allowing you to create customized datasets.

I highly recommend bookmarking this article to revisit when you need high-quality, real-time data from the web. By leveraging these APIs, you can unlock valuable insights that will aid in your research and analysis.
 
 

Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master’s degree in technology management and a bachelor’s degree in telecommunication engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.