Integrity constraints in a relational database - should we overlook them?
If, as stated in your post, the intention is to create a relational database (RDB for brevity) and, therefore, it is expected that it functions as such, the short answer is:
- No, you should not overlook data integrity constraints.
The primary objective should be to manage the pertinent data as it is: a quite valuable organizational asset, and a reliable manner to achieve said objective is employing technical means that are supported on sound theory.
Thus, as a database professional, you can take advantage of the state-of-the-art and elegant relational model mechanisms supplied by Dr. E. F. Codd to enforce business rules, and avoid the (technical and organizational) problems that would eventually arise if they are not utilized.
In this respect, I will share (a) my overall take on constraints and also (b) several considerations about the state of affairs of the database and the work environment at issue as follows.
FOREIGN KEY constraints, data relationships and referential integrity
A RDB must reflect the characteristics of the business context of interest with high accuracy, which definitely requires an in-depth conceptual-level analysis led by a modeler or designer who follows best practices, counting with the indispensable assistance of the business experts. That analysis must yield the correct identification and formulation the applicable business rules.
Consequently, if such a modeler has identified that there exist interrelationships between the data of relevance, he or she must configure the corresponding logical-level restrictions so that the database management system (DBMS) can guarantee that the data remains consistent with the exact characteristics and rules determined in the analysis referred to above at all times.
Regarding the database under discussion, one can infer that the pertinent interrelationships have been identified, since you mention that there is a procedural (and easy to circumvent) attempt to enforce them from outside of the DBMS facilities, by dint of application program code (which is a pre-relational approach) that in any case has to “touch” the database to try to validate the wholeness of said interrelationships.
However, as you know, that is not the optimal technique to protect referential integrity, because the relational science has prescribed a very powerful instrument for this purpose, i.e., FOREIGN KEY (FK) constraints. These constraints are very easy to create (via the superior declarative approach) as they are single sentences that avoid resorting to unnecessary and error prone ad hoc procedures. It is very useful to note that the execution speed of FK constraints has been highly optimized by specialized programmers (and the major platform vendors have worked on it for even decades now).
Furthermore, since a RDB must be an independent (self-protective, self-describing, etc.) software component that is capable of being accessed by multiple application programs (desktop, automatic, web, mobile, combinations thereof), it should not be “coupled” with the code of any of these apps.
Likewise, the data —being an significant organizational resource— naturally tends to outlive application programs, application programmers, application development platforms and programming paradigms.
PRIMARY KEY constraints and implications of duplicate rows
When —conceptually speaking— a particular kind of thing has been deemed of significance in a business environment, a database modeler has to (1) determine its relevant characteristics —i.e., its properties—, confirm said kind of thing as an entity instances prototype —i.e., an entity type— and (2) represent it by way of a table that is integrated by one or more columns in a logical design.
Then, just like it is paramount to distinguish each individual instance of a given entity type in the real world business, each corresponding row enclosed in a table must be uniquely distinguished as well. If a table does not have any KEY declared, it will eventually retain duplicates, and if there are two or more rows that retain exactly the same values, then they all carry the same meaning, they all represent the same fact.
On that point, duplicate rows should be discarded due to multiple reasons. From a theoretical perspective, the designer has to make sure that each row is always unique for the purpose of having tables that work as relationally as the SQL data sub-language permits (having important repercussions on data manipulation operations). Besides, from an informational perspective, if multiple rows represent the same fact, recording them is not only superfluous but harmful, as exemplified bellow:
- Suppose that someone has inserted two identical rows in a certain table.
- Later, someone else comes and updates only one occurrence of the duplicates. As a consequence, the other occurrence is not up-to-date anymore.
- Successively, another person updates the occurrence that had not been modified so far. In this manner, both duplicates have undergone different changes at distinct points in time.
- After that, when someone is interested in selecting the information conveyed by the rows in question, he or she can find two different “versions” of it.
In this way:
- Which “version” can be considered the correct, reliable one?
- Which one reflects the real world accurately?
As you know, this phenomenon can even have legal implications, a circumstance that surely is of enormous importance.
Besides, the time and effort that has to be employed to handle such contradictions (perhaps through some kind of “update synchronization”) should better be devoted to tasks that actually produce value for your organization. So, retaining contradictory rows ought to be avoided by design to keep the consistency of a database intact.
That is why the identification of a PRIMARY KEY (PK) and the declaration of the respective constraint should always be performed by the database designer. But it must be mentioned too that a table may have more than one column or combination of columns that hold values that uniquely identify every row; as a consequence, besides setting up a PK constraint (ideally established as PRIMARY due to pragmatical reasons), the designer must as well declare one or more ALTERNATE KEYs (usually defined via one or more UNIQUE plus NOT NULL constraints) when applies (which is pretty common).
Another advantageous property of PKs is that, when “migrated” to other tables to take part in single or composite FKs, they can help to enforce the cardinality ratios of the relationships/associations that exist among the data. All this, yes, by means of simple and efficient declarative settings, ensured by the DBMS invariably.
(Current) CHECK constraints and single-row validation
Let us not forget about the relevance of (current) CHECK constraints that, restricting declaratively the valid set of column values of a row (which may appear simple, but is in fact an fundamental feature of a relational DBMS), help as well to make certain that the rules of the business context are reflected with precision always.
As you marked your question with the MySQL tag, starting with version 8.0.16 (see also this MySQL server team blog post) such a platform finally! enforces this type of constraint. In this regard, it must be mentioned that in all its prior versions this DBMS did permit declaring CHECK restrictions, but ignored its enforcement!, situation that, understandably, was reported as a bug since 2004.
So, if using legacy versions, you would have to take care of this factor by other means, e.g., ACID TRANSACTIONS, TRIGGERS, or other methods within the DBMS itself (see this answer by @ypercubeᵀᴹ for information on this subject) so that the data continues to be consistent.
ASSERTION constraints: setting up further multi-row and multi-table business rules declaratively
One aspect that for whatever reasons is very poorly supported —if at all— by the different SQL DBMSs, including MySQL, is enabling multi-row and multi-table constraints in a declarative fashion —beyond PKs and FKs, evidently—.
For its part, the SQL standard has included ASSERTIONs from many years now. I do not know what rules of your business environment would benefit from that logical-level validation approach but, as a database designer, I consider that it would be pretty handy to constraint data with one or more ASSERTIONs when required, although I have to mention that from the point of view of the DBMS developers, this paramount kind of tool has been difficult to implement at the physical level of abstraction.
It appears that the Oracle vendor and/or developers are evaluating ASSERTION support since 2016, and that would make that DBMS more relationally-compliant and, hence, more robust and competitive. I guess that, if (i) their consumers keep pushing and (ii) Oracle succeeds in the implementation, then (iii) other DBMS vendors/communities will have to enable them too, and their usage will start to spread. Certainly, that would be a huge progress in the data management field, and being one of the most distinctive tools envisioned by Dr. Codd, I personally hope that we will see that happening soon.
Data consistency and the decision-making process
As discussed above, one of the most important aspects of a RDB is that it guarantees by itself the consistency of the data it retains, and said consistency is only met when the RDB complies with the integrity constraints declared by the modeler.
In this respect, it is compulsory to have base tables (those established in a DDL structure) which integrity is protected in order to be able to create derived tables (e.g., a SELECT statement or view that retrieves columns from multiple tables) that are trustworthy, because derived tables have to be produced necessarily in terms of base tables.
It is pretty obvious that people use information as the main tool in the organizational (and in the ordinary) decision-making process. Then, if the information presented by a database is not coherent and accurate, the decisions based on such information will not be sound (to say the least). That is why a RDB must be carefully designed and implemented: it should be built to become a reliable resource that can assist its users to make well-founded decisions.
“Denormalization”
Alas, “a ‘denormalized’ database is faster than a normalized one” is a widely spread misconception, although it is also an “argument” that can be refuted on logical, physical and pragmatical grounds.
Firstly, denormalization implies necessarily that a base table has been previously normalized (by virtue of a formal, science-based, procedure fulfilled at the logical level of abstraction of a database).
So, assuming that said table was in actual fact normalized correctly, “denormalizing” it (which, in contrast to the formal meaning of the word, involves appending to it columns that belong in, and are also part of, other tables in an ad hoc fashion) might aid, e.g., to speed up (at the physical level) the processing of only one or a few particular SELECT statement, while such course of action might, at the same time, be undermining the execution of many other associated data manipulation operations (e.g., several INSERT, UPDATE, DELETE and SELECT statements, or combinations thereof enclosed within a single or multiple ACID TRANSACTIONS).
In addition, denormalization (be it formal or informal) would introduce update/modification anomalies that deteriorate the coherence of the database, a problem that “may” be handled by complex, costly and error-prone procedures, when all this can be prevented from the very beginning.
Physical-level scaffoldings supporting normalized and “denormalized” tables
A logical (abstract) layout (SQL-DDL design) that is meant to be utilized in the real world clearly holds physical (concrete) repercussions that must be considered.
In this manner, a “denormalized” table would necessarily be “wider” (holding additional columns) which means that its rows would necessarily be heavier (requiring more and larger physical-level components), so that means that the underlying computing processes (e.g., those that have to do with the hard drive or memory) can easily turn slower.
In contrast, a normalized table that is of course “narrower” (having less columns) would be a “lighter” element (served by less and smaller physical components) that “behaves faster”, which would speed up the series of actions related to, e.g., data writing and reading.
That being so, it is very convenient to (a) normalize the relevant tables formally and prudently, keeping them as such, and then (b) to make use of any physical level resource that can optimize data retrieval and modification speed, e.g., implementing a careful and efficient indexing strategy, enabling proper software and hardware server configurations, upgrading network bandwidth capabilities, etc.
The functioning of the database under consideration and approaching your colleagues
The following paragraphs of your question have to do with the speed of the data retrieval operations:
[A]s the product “works”, there is hesitation to enhance the database; nevertheless, the first thing I noticed is one page taking 1 minute to load (yes, 60 seconds!).
If loading a certain page takes that long, it is evident that the users of the system are not receiving an appropriate service; therefore, even when it “works”, its functioning does not seem to be optimal at all, point that demonstrates that your intentions to make the entire computerized information system (database and apps) more efficient are well sustained, and shows a very constructive attitude.
Then, even when science definitely supports you and hence you should maintain a firm posture, I suggest approaching the situation in a diplomatic way, since at the end of the day, your employers, coworkers and yourself are JOINing efforts in order to make the whole organization more successful. Thus, that is one argument that you should stress, that, while they are doing other things more than well, improving general and specific data management practices can considerably help in producing more organizational and individual growth.
Most of the relevant queries include JOIN operations, which makes them run very, very, very slow with large amounts of data (the database contains millions of rows).
It is worth to note that the JOIN operator is an essential and powerful element that pertains to relational manipulation of data. Then, although more robust platforms serve it with comparatively faster executions, the circumstance you describe is most probably a symptom of a non-eficient design (at the conceptual, logical and physical levels of abstraction). So, my first sight estimations are:
- The INDEX settings may require improvement.
- The PK and FK column type and size definitions need to be reviewed (and I totally agree with @Rick James regarding his PK considerations, as composite KEYs tend to be much more efficient than appended surrogates in the appropriate cases).
- Further (formal, science-based) normalization might help to alleviate these problems, on account of the fact that, in the right circumstances (i.e., carried out in a well-designed RDB), JOINs are executed very fast.
Moreover, yes, as @TommCatt mentions in his answer, sometimes a (logical) rewrite of a query modifies its (physical) execution plan accelerating data reading/writing, which is a factor that should decidedly be taken into account.
The basic premise of your developers is absolutely wrong. Foreign keys will impact slightly the performance of the DML of your system. They are not used at all in queries thus have no effect on their performance. So your developers don't know what they're talking about and are the very last people you should consider taking advice from.
Foreign keys play a critical role in maintaining the integrity of your data. This is much more important than any tiny performance improvement gained by removing them (even it that were true).
Do not, under any circumstances, remove FKs from an OLTP database.
Also, denormalizing will sometimes speed up some queries. It, as they say, depends. Still, even if there is some speed improvement, it is generally not worth the extra effort to maintain the data integrity.
It is very rare when simple tuning can't get you a lot more speed improvement than denormalizing. This is where a good DBA can (finally) earn his pay. You can also tune your queries. I once took a query that returned an answer in no less than 30 minutes and got it to work in under 8 seconds. No changes to the database, just rewrote the query. Granted, this is my personal best record, so your mileage may vary, but denormalizing should be the very last thing you try.
You may also want to keep the more complicated queries from being written by the developers. Ask them what data they want and in what format they want it in. Then provide views to give it to them. The complicated queries will be the views. The developers then only have to write:
select <something> from <SomeView> where <whatever>;
I am also assuming your database is otherwise well designed. A poor design of the database, or even small parts of it, can really slow things up. I've worked often with Very Large Tables (billions of records each) with queries that joined them together left and right and expected (and got) answers in fractions of a second. The size of a table is not determinative of the speed of the query.
I really cringe when someone says, "because the product 'works' there is hesitation to enhance the database." If this "hesitation" is more like "not on my watch, pal!" then you may even want to start updating your resume. Nothing good ever comes from such an environment and you will get the blame for every future failure even though you may have lobbied for hours to make a change that would have prevented the failure. You'll hear, "Now is not a good time to make changes" over and over. Right. Good luck.
Changing the title changes the question. FOREIGN KEYs
are optional. They do:
- An FK implicitly create an
INDEX
in one of the tables. Such an index can be manually added. (So FK is not required for this.) - An FK checks for integrity. This is the FK's main claim to fame. An FK is not required since your application can do similar checks, or decide that a check is not needed. So...
- The integrity check costs something in performance; so it slows down the processing. (This is usually not a big deal.)
- FKs don't do everything that everyone wants; this forum is littered with "why can't FKs do X" questions. In particular the
CHECK
option is not acted on. - FKs can
CASCADE
things. (Personally, I prefer to stay in control, and not assume that the FK will 'do the right thing'.)
Bottom line for FKs: Some people insist on FKs; some products live perfectly well without them. You decide.
Getting rid of PRIMARY KEY
in InnoDB is a big mistake. On the other hand, getting rid of a surrogate AUTO_INCREMENT
and using a "natural" PK made up of one (or more) columns is often the right thing to do. A simple, common, case is a many:many mapping table, as discussed here .
Based on personal experience, I suggest hat 2/3 of tables are better of using 'natural' instead of auto_inc PK.