SQL for Data Analysts: Essential Queries for Data Extraction & Transformation

0
8



Image by Editor

 

Introduction

 
Data analysts need to work with large amounts of information stored in databases. Before they can create reports or find insights, they must first pull the right data and prepare it for use. This is where SQL (Structured Query Language) comes in. SQL is a tool that helps analysts retrieve data, clean it up, and organize it into the desired format.

In this article, we’ll look at the most important SQL queries that every data analyst should know.

 

1. Selecting Data with SELECT

 
The SELECT statement is the foundation of SQL. You can choose specific columns or use * to return all available fields.

SELECT name, age, salary FROM employees;

 

This query pulls only the name, age, and salary columns from the employees table.

 

2. Filtering Data with WHERE

 
WHERE narrows rows to those that match your conditions. It supports comparison and logical operators to create precise filters.

SELECT * FROM employees WHERE department="Finance";

 

The WHERE clause returns only employees who belong to the Finance department.

 

3. Sorting Results with ORDER BY

 
The ORDER BY clause sorts query results in ascending or descending order. It is used to rank records by numeric, text, or date values.

SELECT name, salary FROM employees ORDER BY salary DESC;

 

This query sorts employees by salary in descending order, so the highest-paid employees appear first.

 

4. Removing Duplicates with DISTINCT

 
The DISTINCT keyword returns only unique values from a column. It is useful when generating clean lists of categories or attributes.

SELECT DISTINCT department FROM employees;

 

DISTINCT removes duplicate entries, returning each department name only once.

 

5. Limiting Results with LIMIT

 
The LIMIT clause restricts the number of rows returned by a query. It is often paired with ORDER BY to display top results or sample data from large tables.

SELECT name, salary 
FROM employees 
ORDER BY salary DESC 
LIMIT 5;

 

This retrieves the top 5 employees with the highest salaries by combining ORDER BY with LIMIT.

 

6. Aggregating Data with GROUP BY

 
The GROUP BY clause groups rows that share the same values in specified columns. It is used with aggregate functions like SUM(), AVG(), or COUNT() to produce summaries.

SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department;

 

GROUP BY organizes rows by department, and AVG(salary) calculates the average salary for each group.

 

7. Filtering Groups with HAVING

 
The HAVING clause filters grouped results after aggregation has been applied. It is used when conditions depend on aggregate values, such as totals or averages.

SELECT department, COUNT(*) AS num_employees
FROM employees
GROUP BY department
HAVING COUNT(*) > 10;

 

The query counts employees in each department and then filters to keep only departments with more than 10 employees.

 

8. Combining Tables with JOIN

 
The JOIN clause combines rows from two or more tables based on a related column. It helps retrieve connected data, such as employees with their departments.

SELECT e.name, d.name AS department
FROM employees e
JOIN departments d ON e.dept_id = d.id;

 

Here, JOIN combines employees with their matching department names.

 

9. Combining Results with UNION

 
UNION combines the results of two or more queries into a single dataset. It automatically removes duplicates unless you use UNION ALL, which keeps them.

SELECT name FROM employees UNION SELECT name FROM customers;

 

This query combines names from both the employees and customers tables into a single list.

 

10. String Functions

 
String functions in SQL are used to manipulate and transform text data. They help with tasks like combining names, changing case, trimming spaces, or extracting parts of a string.

SELECT CONCAT(first_name, ' ', last_name) AS full_name, LENGTH(first_name) AS name_length FROM employees;

 

This query creates a full name by combining first and last names and calculates the length of the first name.

 

11. Date and Time Functions

 
Date and time functions in SQL let you work with temporal data for analysis and reporting. They can calculate differences, extract components like year or month, and adjust dates by adding or subtracting intervals. For example, DATEDIFF() with CURRENT_DATE can measure tenure.

SELECT name, hire_date, DATEDIFF(CURRENT_DATE, hire_date) AS days_at_company FROM employees;

 

It calculates how many days each employee has been with the company by subtracting their hire date from today.

 

12. Creating New Columns with CASE

 
The CASE expression creates new columns with conditional logic, similar to if-else statements. It lets you categorize or transform data dynamically within your queries.

SELECT name,
       CASE 
           WHEN age < 30 THEN 'Junior'
           WHEN age BETWEEN 30 AND 50 THEN 'Mid-level'
           ELSE 'Senior'
       END AS experience_level
FROM employees;

 

The CASE statement creates a new column called experience_level based on age ranges.

 

13. Handling Missing Values with COALESCE

 
COALESCE handles missing values by returning the first non-null value from a list. It’s commonly used to replace NULL fields with a default value, such as “N/A.”

SELECT name, COALESCE(phone, 'N/A') AS contact_number FROM customers;

 

Here, COALESCE replaces missing phone numbers with “N/A.”

 

14. Subqueries

 
Subqueries are queries nested inside another query to provide intermediate results. They are used in WHERE, FROM, or SELECT clauses to filter, compare, or build datasets dynamically.

SELECT name, salary FROM employees WHERE salary > (SELECT AVG(salary) FROM employees);

 

This query compares each employee’s salary to the company’s average salary by using a nested subquery.

 

15. Window Functions

 
Window functions perform calculations across a set of rows while still returning individual row details. They are commonly used for ranking, running totals, and comparing values between rows.

SELECT name, salary, RANK() OVER (ORDER BY salary DESC) AS salary_rank FROM employees;

 

The RANK() function assigns each employee a ranking based on salary, without grouping the rows.

 

Conclusion

 
Mastering SQL is one of the most valuable skills for any data analyst, as it provides the foundation for extracting, transforming, and interpreting data. From filtering and aggregating to joining and reshaping datasets, SQL empowers analysts to convert raw information into meaningful insights that drive decision-making. By becoming proficient in essential queries, analysts not only streamline their workflows but also ensure accuracy and scalability in their analyses.
 
 

Jayita Gulati is a machine learning enthusiast and technical writer driven by her passion for building machine learning models. She holds a Master’s degree in Computer Science from the University of Liverpool.