10 Command-Line Tools Every Data Scientist Should Know

0
6



Image by Author

 

Introduction

 
Although in modern data science you will mainly find Jupyter notebooks, Pandas, and graphical dashboards, they don’t always give you the level of control you might need. On the other hand, command-line tools may not be as intuitive as you wish, but they are powerful, lightweight, and much faster at executing the specific jobs they are designed for.

For this article, I’ve tried to create a balance between utility, maturity, and power. You’ll find some classics that are nearly unavoidable, along with more modern additions that fill gaps or optimize performance. You can even call this a 2025 version of a must-have CLI tools list. For those who aren’t familiar with CLI tools but want to learn, I’ve included a bonus section with resources in the conclusion, so scroll all the way down before you start including these tools in your workflow.

 

1. curl

 
curl is my go-to for making HTTP requests like GET, POST, or PUT; downloading files; and sending/receiving data over protocols such as HTTP or FTP. It’s ideal for retrieving data from APIs or downloading datasets, and you can easily integrate it with data-ingestion pipelines to pull JSON, CSV, or other payloads. The best thing about curl is that it’s pre-installed on most Unix systems, so you can start using it right away. However, its syntax (especially around headers, body payloads, and authentication) can be verbose and error-prone. When you are interacting with more complex APIs, you may prefer an easier-to-use wrapper or Python library, but knowing curl is still an essential plus for quick testing and debugging.

 

2. jq

 
jq is a lightweight JSON processor that lets you query, filter, transform, and pretty-print JSON data. With JSON being a dominant format for APIs, logs, and data interchange, jq is indispensable for extracting and reshaping JSON in pipelines. It acts like “Pandas for JSON in the shell.” The biggest advantage is that it provides a concise language for dealing with complex JSON, but learning its syntax can take time, and extremely large JSON files may require additional care with memory management.

 

3. csvkit

 
csvkit is a suite of CSV-centric command-line utilities for transforming, filtering, aggregating, joining, and exploring CSV files. You can select and reorder columns, subset rows, combine multiple files, convert from one format to another, and even run SQL-like queries against CSV data. csvkit understands CSV quoting semantics and headers, making it safer than generic text-processing utilities for this format. Being Python-based means performance can lag on very large datasets, and some complex queries may be easier in Pandas or SQL. If you prefer speed and efficient memory usage, consider the csvtk toolkit.

 

4. qwk / sed

 
Link (sed): https://www.gnu.org/software/sed/manual/sed.html
Classic Unix tools like awk and sed remain irreplaceable for text manipulation. awk is powerful for pattern scanning, field-based transformations, and quick aggregations, while sed excels at text substitutions, deletions, and transformations. These tools are fast and lightweight, making them perfect for quick pipeline work. However, their syntax can be non-intuitive. As logic grows, readability suffers, and you may migrate to a scripting language. Also, for nested or hierarchical data (e.g., nested JSON), these tools have limited expressiveness.

 

5. parallel

 
GNU parallel speeds up workflows by running multiple processes in parallel. Many data tasks are “mappable” across chunks of data. Let’s say you have to execute the same transformation on hundreds of files—parallel can spread work across CPU cores, speed up processing, and manage job control. You must, however, be mindful of I/O bottlenecks and system load, and quoting/escaping can be tricky in complex pipelines. For cluster-scale or distributed workloads, consider resource-aware schedulers (e.g., Spark, Dask, Kubernetes).

 

6. ripgrep (rg)

 
ripgrep (rg) is a fast recursive search tool designed for speed and efficiency. It respects .gitignore by default and ignores hidden or binary files, making it significantly faster than traditional grep. It’s perfect for quick searches across codebases, log directories, or config files. Because it defaults to ignoring certain paths, you may need to adjust flags to search everything, and it isn’t always available by default on every platform.

 

7. datamash

 
datamash provides numeric, textual, and statistical operations (sum, mean, median, group-by, etc.) directly in the shell via stdin or files. It’s lightweight and useful for quick aggregations without launching a heavier tool like Python or R, which makes it ideal for shell-based ETL or exploratory analysis. But it’s not designed for very large datasets or complex analytics, where specialized tools perform better. Also, grouping very high cardinalities may require substantial memory.

 

8. htop

 
htop is an interactive system monitor and process viewer that provides live insights into CPU, memory, and I/O usage per process. When running heavy pipelines or model training, htop is extremely useful for tracking resource consumption and identifying bottlenecks. It’s more user-friendly than traditional top, but being interactive means it doesn’t fit well into automated scripts. It may also be missing on minimal server setups, and it doesn’t replace specialized performance tools (profilers, metrics dashboards).

 

9. git

 
git is a distributed version control system essential for tracking changes to code, scripts, and small data assets. For reproducibility, collaboration, branching experiments, and rollback, git is the standard. It integrates with deployment pipelines, CI/CD tools, and notebooks. Its drawback is that it’s not meant for versioning large binary data, for which Git LFS, DVC, or specialized systems are better suited. The branching and merging workflow also comes with a learning curve.

 

10. tmux / screen

 
Terminal multiplexers like tmux and screen let you run multiple terminal sessions in a single window, detach and reattach sessions, and resume work after an SSH disconnect. They’re essential if you need to run long experiments or pipelines remotely. While tmux is recommended due to its active development and flexibility, its config and keybindings can be tricky for newcomers, and minimal environments may not have it installed by default.

 

Wrapping Up

 
If you’re getting started, I’d recommend mastering the “core four”: curl, jq, awk/sed, and git. These are used everywhere. Over time, you’ll discover domain-specific CLIs like SQL clients, the DuckDB CLI, or Datasette to slot into your workflow. For further reading, check out the following resources:

  1. Data Science at the Command Line by Jeroen Janssens
  2. The Art of Command Line on GitHub
  3. Mark Pearl’s Bash Cheatsheet
  4. Communities like the unix & command-line subreddits often surface useful tricks and new tools that will expand your toolbox over time.

 
 

Kanwal Mehreen is a machine learning engineer and a technical writer with a profound passion for data science and the intersection of AI with medicine. She co-authored the ebook “Maximizing Productivity with ChatGPT”. As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She’s also recognized as a Teradata Diversity in Tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having founded FEMCodes to empower women in STEM fields.