In this article, you will find the most important SQL commands for basic work with relational databases. The Structured Query Language (SQL) is the most commonly used language when working with relational databases. The language can be used for much more than simple queries, despite its name. It can also be used to perform all operations necessary to create and maintain databases.
What are the advantages of Structured Query Language?
The structured query language provides many functions to read, modify or delete data. In addition, it is preferred by many analysts over other languages for the following reasons:
- It is semantically very easy to read and understand. The commands can be understood to a large extent even by beginners.
- The language can be used directly within the database environment. For basic work with information, the data does not have to be transferred from the database to another tool first. Simple calculations and queries are possible directly in the database.
- Compared to other spreadsheet tools, such as Excel, data analysis with Structured Query Language can be easily replicated and copied because everyone has access to the same data in the database. Thus, the same query always leads to the same result.
SQL provides an alternative to the kind of summaries and calculations that most perform in Excel spreadsheets, such as totals, averaging, or finding the maximum within a column. These calculations can also be performed over multiple data sets simultaneously.
What sample data do we use?
In order to be able to test the most common SQL commands live, we use a Credit Record dataset from Kaggle, which lists credit data for various, anonymous people such as their income, education, or occupation.
We load this table as Pandas DataFrame into our notebook and can then test common SQL commands on it:
How to query data?
The complete table can be queried using “SELECT”. Since no specific column is to be queried, we simply use the star “*” so that all available columns are output. With the help of “LIMIT 10”, we make sure that only the top 10 rows are returned by the SQL command, otherwise, it becomes too confusing:
Since there is a large number of columns, we limit ourselves to the “NAME_INCOME_TYPE” and “CNT_CHILDREN” columns. We can query these by specifying them explicitly:
As we can see, some income types appear twice in these lines. This can happen if multiple credit suspects can have the same income type.
To get only the unique entries for the income type, we use the additional parameter “DISTINCT”:
How to filter data?
The data can be filtered using the “WHERE” parameter. Depending on the data type, there are different queries for this:
- Numerical values can be compared using the greater than or less than sign, e.g. “AMT_INCOME_TOTAL < 427500”. These can be supplemented with an equal sign, e.g. “AMT_INCOME_TOTAL < 427500”.
- For texts or strings, the comparisons “=” or “<>” are used to check whether the texts match (“=”) or differ (“<>”).
For all credit applicants with an income of more than $427,500, we get the following SQL command:
To be able to use multiple filters in one SQL command, we can connect them either with an “AND” or an “OR”. This way we get all applicants with a high income and that are female:
How can the results be sorted?
Each output can be sorted according to a column using “ORDER BY”. Both numbers and strings can be sorted, which are then sorted alphabetically:
By default, the outputs are always sorted in ascending order. To change this, you must additionally specify “DESC” for descending:
How can records be counted?
Using the SQL command “count” you can count the values in columns or in the whole database:
In this case, we have 438,557 records in the database. To count the values within a column, we use the name of the column instead of the star “*”. Additionally, we can use the parameter “DISTINCT” to count only the unique values in the column:
Thus, there are 866 different income levels in the data set.
This is what you should take with you
- Structured Query Language is the most widely used language when working with relational databases.
- With the help of SQL commands, extensive data queries can be designed and individualized.
Thanks to Deepnote for sponsoring this article! Deepnote offers me the possibility to embed Python code easily and quickly on this website and also to host the related notebooks in the cloud.
What is the Snowflake Schema?
Explanation of the Snowflake scheme compared to the Star scheme.
What is Data Augmentation?
Use and methods of data augmentation.
What is Tableau?
Learn how to use Tableau for data visualization and analysis in our comprehensive guide.
What is the Normalization of databases?
Learn about database normalization and how it can improve your database. Maximize efficiency and minimize redundancy with normalization.
What are the Primary Key and Foreign Key?
Learn about primary and foreign keys in database management. Understand their differences, importance, and usage. Read more in this article!
What is Apache Parquet?
Learn how to optimize Big Data storage with Apache Parquet. Explore its features, benefits, and implementation in this comprehensive guide.
What are CSV files?
Learn all about CSV files, including how to they are structured, best practices and comparison to Apache Parquet.
What is the CAP Theorem?
Understanding CAP Theorem: Consistency, Availability, and Partition Tolerance in Distributed Systems. Learn the trade-offs in system design.
What is Batch Processing?
Learn about batch processing in data science. Discover how batch processing works, its advantages, and common applications.
What is the Modern Data Stack?
Discover the modern data stack: A comprehensive guide to building scalable and efficient data pipelines. Learn more now!
Other Articles on the Topic of SQL Commands
- The Olympia dataset used can be found here.