What is Referential Integrity?

Referential integrity is a term from database theory and describes rules of how a relational database ensures the integrity and consistency, i.e. the true content, of the data. It focuses primarily on the primary and foreign keys in the tables.

What are primary and foreign keys?

Each table in a database ideally has one column or a combination of several columns that have a unique value for a record. This column or several common columns are called the primary key of the table. It ensures that each record, i.e. each row in the table, can be uniquely identified. This provides the ability to access individual records using an SQL query.

If a table does not have a column with unique values, multiple columns can also be used to establish this uniqueness. This distinguishes different types of primary keys.

referential integrity — Unique primary key | Source: Author

The foreign key is an attribute or a combination of several attributes that is the primary key in another table. This is used to create links between tables in databases. Depending on the primary key in the other table, it can be a composite or unique foreign key.

Foreign key relationship of two tables | Source: Author

However, it is important to mention that the foreign key does not have to be unique in its table, but values can also occur more often. With the help of such connections, large tables can be split into several smaller tables, which saves storage space and improves clarity.

What is referential integrity?

Referential integrity (RI) ensures that table relationships in databases do not lead to inconsistencies in the database. For this purpose, a total of two rules are established that must be adhered to to ensure referential integrity:

RI rule: When a new record with a foreign key is inserted, it must be ensured that a record for this key exists in the referenced table. If this is not the case, the insertion process should not be possible.
RI rule: A record must not be deleted if it is used as a foreign key in another table. The database must ensure that such transactions are not possible.

If these two rules are fulfilled, inconsistencies cannot occur when changes are made in the database.

How to create referential integrity in a database?

Modern database systems, such as MySQL, ensure that no transactions take place that violate referential integrity. In case of a violation, the transaction is not executed and a corresponding error message is issued. For example, if we wanted to delete a customer in our example with the SQL statement “DELETE”, an error would occur because the customer would still be referenced in the order table.

Since it can quickly become confusing in large databases and you don’t want to delete all entries manually to ensure referential integrity, there is a special SQL command that can be used to delete records and ensure integrity. Using “ON DELETE CASCADE” we can delete the customer and it would also automatically delete all the orders of the customer in the order table so that the integrity still exists.

In reality, of course, we would not perform such a transaction as it would distort the sales figures and instead rather work with a status for customers that marks them as either active or inactive.

What are the integrity rules for databases?

In addition to referential integrity, there are other integrity provisions that a database must meet to be in a consistent, or truthful, state. These include:

Range integrity: The values in an attribute, i.e., a column, must lie within a predefined range. For example, the entry of age can only be between 0 and 150 years, but if an entry wants to store an age of 200 years, the range integrity is violated.
Entity integrity: A unique primary key must be defined for each record, i.e. each entity.
Referential integrity: It must be ensured that a foreign key refers to a data record. In addition, a primary key can only be deleted if the data record in which the key appears as a foreign key is also deleted.
Logical consistency: Depending on the application, the user can set up his integrity requirements that must also be met.

The database is only in a consistent state if all these conditions are met.

How do you maintain the referential integrity of a database sustainably?

Referential integrity not only includes the initial setup of a database but must also be ensured during operation, for example when data is updated or deleted. It is therefore important to think about the correct handling of these processes and to set up appropriate processes.

Updates

When values or rows in a table are updated and are part of a relationship, the rules of referential integrity are used to ensure that the existing relationships are not affected. The following commands can be used:

Cascade updates: Using the CASCADE command, when a primary key is updated in all tables with a relationship, the associated foreign keys are also automatically updated. This application should be used so that changes in a table are applied to the entire database.
Restriction of updates: The opposite command to this is NO ACTION, which prevents updates to the related data records. This ensures data consistency and prevents unintentional changes to key values.
Set zero or set default: If the foreign keys have been defined with SET NULL or SET DEFAULT, an update to the primary key in an attached table can result in the foreign key values being set to NULL or the defined default value.

Deletions

If data records are deleted, this can quickly lead to referenced data records becoming orphaned or relationships losing their validity. The following commands can therefore be used to prevent these problems:

Cascade deletions: Just as with updates, CASCADE causes referenced data records in subordinate tables to be deleted as well. However, each execution should be used with caution to prevent data loss due to unintentional deletion.
Restrict deletions: The analog to NO ACTION for updates is the RESTRICT command, which ensures that the data records in related tables are not deleted during deletions. In this way, referential integrity ensures that data integrity is maintained and no accidental loss of data occurs.
Set zero or set default: If the tables and their foreign keys have been set up with SET NULL or SET DEFAULT, deletion can result in the foreign keys in the related tables being set to NULL or the default value.

Weighing up

The CASCADE function is a powerful and helpful tool, but it can have major consequences. Therefore, before implementing CASCADE actions, you should check which changes will be made to prevent unintentional data loss.
The appropriate option for ensuring referential integrity should be selected depending on the data integrity requirements.
Complex relationships quickly become confusing and should be avoided, as they can otherwise lead to problems during updates or deletions.

Best practices

A database schema helps to document the relationships between tables and list the restrictions on referential integrity. This provides clear rules for the future maintenance and development of the database.
Before changes are made to the production database, changes and deletions should be carried out in test environments to ensure that there are no undesirable consequences.
If many CASCADE actions are used, the database should be checked regularly for anomalies, especially if deletions or updates have taken place.

Deletions and updates in the context of referential integrity are critical to building and maintaining a consistent and accurate database. However, taking the above actions into account will ensure that deletions or updates do not lead to unwanted side effects and that the database remains free of inconsistencies.

What is Normalization?

Normalization refers to a concept from database design to eliminate redundancies, i.e. duplications in the database. This saves storage space and also prevents anomalies from occurring. Integrity and normalization are two closely related topics in database theory, so they should always be considered together.

In practice, three normal forms are particularly important. This is because often, when these are satisfied, the database is performing well, and relatively little work has had to be done. Thus, the cost-benefit ratio is comparatively high up to the third normal form. In theory, however, there are up to five normal forms, but in this article, we will limit ourselves only to the first three:

The 1st normal form is achieved when all records are atomic. This means that each data field may contain only one value. It should also be ensured that each column contains only values of the same data type (numeric, text, etc.). The following examples would need to be modified accordingly to have a database in 1st normal form:

The 2nd normal form is satisfied if the first normal form is satisfied, and also each column in a row is fully functionally dependent on the primary key. The primary key denotes an attribute that can be used to uniquely identify a database row. This includes, for example, the invoice number to identify an invoice or the ID number to identify a person.

The 3rd normal form exists if the two preceding normal forms are fulfilled and there are additionally no so-called transitive dependencies. A transitive dependency exists when an attribute that is not a primary key depends not only on this primary key but also on other attributes.

This is what you should take with you

Referential integrity describes provisions about foreign keys in a database that lead to consistent records.
When changing data records, it must be ensured that foreign keys refer to existing data records and that all referring data records are also deleted when primary keys are deleted.
Modern database systems already do not allow transactions that would violate referential integrity and issue a corresponding error message.
In addition to referential integrity, there is also range integrity, entity integrity, and logical consistency, which ensure that data is consistent.