Referential integrity is a term from database theory and describes rules of how a relational database ensures the integrity and consistency, i.e. the true content, of the data. It focuses primarily on the primary and foreign keys in the tables.
What are primary and foreign keys?
Each table in a database ideally has one column or a combination of several columns that have a unique value for a record. This column or several common columns are called the primary key of the table. It ensures that each record, i.e. each row in the table, can be uniquely identified. This provides the ability to access individual records using an SQL query.
If a table does not have a column with unique values, multiple columns can also be used to establish this uniqueness. This distinguishes different types of primary keys.
The foreign key is an attribute or a combination of several attributes that is the primary key in another table. This is used to create links between tables in databases. Depending on the primary key in the other table, it can be a composite or unique foreign key.
However, it is important to mention that the foreign key does not have to be unique in its table, but values can also occur more often. With the help of such connections, large tables can be split into several smaller tables, which saves storage space and improves clarity.
What is referential integrity?
Referential integrity (RI) ensures that table relationships in databases do not lead to inconsistencies in the database. For this purpose, a total of two rules are established that must be adhered to to ensure referential integrity:
- RI rule: When a new record with a foreign key is inserted, it must be ensured that a record for this key exists in the referenced table. If this is not the case, the insertion process should not be possible.
- RI rule: A record must not be deleted if it is used as a foreign key in another table. The database must ensure that such transactions are not possible.
If these two rules are fulfilled, inconsistencies cannot occur when changes are made in the database.
How to create referential integrity in a database?
Modern database systems, such as MySQL, ensure that no transactions take place that violate referential integrity. In case of a violation, the transaction is not executed and a corresponding error message is issued. For example, if we wanted to delete a customer in our example with the SQL statement “DELETE”, an error would occur because the customer would still be referenced in the order table.
Since it can quickly become confusing in large databases and you don’t want to delete all entries manually to ensure referential integrity, there is a special SQL command that can be used to delete records and ensure integrity. Using “ON DELETE CASCADE” we can delete the customer and it would also automatically delete all the orders of the customer in the order table so that the integrity still exists.
In reality, of course, we would not perform such a transaction as it would distort the sales figures and instead rather work with a status for customers that marks them as either active or inactive.
What are the integrity rules for databases?
In addition to referential integrity, there are other integrity provisions that a database must meet to be in a consistent, or truthful, state. These include:
- Range integrity: The values in an attribute, i.e., a column, must lie within a predefined range. For example, the entry of age can only be between 0 and 150 years, but if an entry wants to store an age of 200 years, the range integrity is violated.
- Entity integrity: A unique primary key must be defined for each record, i.e. each entity.
- Referential integrity: It must be ensured that a foreign key refers to a data record. In addition, a primary key can only be deleted if the data record in which the key appears as a foreign key is also deleted.
- Logical consistency: Depending on the application, the user can set up his integrity requirements that must also be met.
The database is only in a consistent state if all these conditions are met.
How do you maintain the referential integrity of a database sustainably?
Referential integrity not only includes the initial setup of a database but must also be ensured during operation, for example when data is updated or deleted. It is therefore important to think about the correct handling of these processes and to set up appropriate processes.
Updates
When values or rows in a table are updated and are part of a relationship, the rules of referential integrity are used to ensure that the existing relationships are not affected. The following commands can be used:
- Cascade updates: Using the CASCADE command, when a primary key is updated in all tables with a relationship, the associated foreign keys are also automatically updated. This application should be used so that changes in a table are applied to the entire database.
- Restriction of updates: The opposite command to this is NO ACTION, which prevents updates to the related data records. This ensures data consistency and prevents unintentional changes to key values.
- Set zero or set default: If the foreign keys have been defined with SET NULL or SET DEFAULT, an update to the primary key in an attached table can result in the foreign key values being set to NULL or the defined default value.
Deletions
If data records are deleted, this can quickly lead to referenced data records becoming orphaned or relationships losing their validity. The following commands can therefore be used to prevent these problems:
- Cascade deletions: Just as with updates, CASCADE causes referenced data records in subordinate tables to be deleted as well. However, each execution should be used with caution to prevent data loss due to unintentional deletion.
- Restrict deletions: The analog to NO ACTION for updates is the RESTRICT command, which ensures that the data records in related tables are not deleted during deletions. In this way, referential integrity ensures that data integrity is maintained and no accidental loss of data occurs.
- Set zero or set default: If the tables and their foreign keys have been set up with SET NULL or SET DEFAULT, deletion can result in the foreign keys in the related tables being set to NULL or the default value.
Weighing up
- The CASCADE function is a powerful and helpful tool, but it can have major consequences. Therefore, before implementing CASCADE actions, you should check which changes will be made to prevent unintentional data loss.
- The appropriate option for ensuring referential integrity should be selected depending on the data integrity requirements.
- Complex relationships quickly become confusing and should be avoided, as they can otherwise lead to problems during updates or deletions.
Best practices
- A database schema helps to document the relationships between tables and list the restrictions on referential integrity. This provides clear rules for the future maintenance and development of the database.
- Before changes are made to the production database, changes and deletions should be carried out in test environments to ensure that there are no undesirable consequences.
- If many CASCADE actions are used, the database should be checked regularly for anomalies, especially if deletions or updates have taken place.
Deletions and updates in the context of referential integrity are critical to building and maintaining a consistent and accurate database. However, taking the above actions into account will ensure that deletions or updates do not lead to unwanted side effects and that the database remains free of inconsistencies.
What is Normalization?
Normalization refers to a concept from database design to eliminate redundancies, i.e. duplications in the database. This saves storage space and also prevents anomalies from occurring. Integrity and normalization are two closely related topics in database theory, so they should always be considered together.
In practice, three normal forms are particularly important. This is because often, when these are satisfied, the database is performing well, and relatively little work has had to be done. Thus, the cost-benefit ratio is comparatively high up to the third normal form. In theory, however, there are up to five normal forms, but in this article, we will limit ourselves only to the first three:
The 1st normal form is achieved when all records are atomic. This means that each data field may contain only one value. It should also be ensured that each column contains only values of the same data type (numeric, text, etc.). The following examples would need to be modified accordingly to have a database in 1st normal form:
The 2nd normal form is satisfied if the first normal form is satisfied, and also each column in a row is fully functionally dependent on the primary key. The primary key denotes an attribute that can be used to uniquely identify a database row. This includes, for example, the invoice number to identify an invoice or the ID number to identify a person.
The 3rd normal form exists if the two preceding normal forms are fulfilled and there are additionally no so-called transitive dependencies. A transitive dependency exists when an attribute that is not a primary key depends not only on this primary key but also on other attributes.
This is what you should take with you
- Referential integrity describes provisions about foreign keys in a database that lead to consistent records.
- When changing data records, it must be ensured that foreign keys refer to existing data records and that all referring data records are also deleted when primary keys are deleted.
- Modern database systems already do not allow transactions that would violate referential integrity and issue a corresponding error message.
- In addition to referential integrity, there is also range integrity, entity integrity, and logical consistency, which ensure that data is consistent.
What is Data Quality?
Ensuring Data Quality: Importance, Challenges, and Best Practices. Learn how to maintain high-quality data to drive better business decisions.
What is Data Imputation?
Impute missing values with data imputation techniques. Optimize data quality and learn more about the techniques and importance.
What is Outlier Detection?
Discover hidden anomalies in your data with advanced outlier detection techniques. Improve decision-making and uncover valuable insights.
What is the Bivariate Analysis?
Unlock insights with bivariate analysis. Explore types, scatterplots, correlation, and regression. Enhance your data analysis skills.
What is a RESTful API?
Learn all about RESTful APIs and how they can make your web development projects more efficient and scalable.
What is Time Series Data?
Unlock insights from time series data with analysis and forecasting techniques. Discover trends and patterns for informed decision-making.
Other Articles on the Topic of Referential Integrity
IBM has published an interesting article on referential integrity.
Niklas Lang
I have been working as a machine learning engineer and software developer since 2020 and am passionate about the world of data, algorithms and software development. In addition to my work in the field, I teach at several German universities, including the IU International University of Applied Sciences and the Baden-Württemberg Cooperative State University, in the fields of data science, mathematics and business analytics.
My goal is to present complex topics such as statistics and machine learning in a way that makes them not only understandable, but also exciting and tangible. I combine practical experience from industry with sound theoretical foundations to prepare my students in the best possible way for the challenges of the data world.