Normalizing a table with a field that generally uniquely identifies a row, but is sometimes null
Provided that Sku and ItemNumber will always imply unique values
I consider that you found the answer already by discovering that, conceptually speaking, ItemNumber is an optional property; i.e., when you determined that it does not apply to each and every one of the occurrences —represented by logical-level rows— of the Product entity type. Therefore, the item_number
column should not be declared as an ALTERNATE KEY (AK for brevity) in the product
table, as you rightly pointed out.
In this respect, your Scenario B is quite reasonable, as the following conceptual-level formulation demonstrates:
- A product may or may not have an item number.
In other words, there is a one to zero or one (1:0/1) cardinality ratio between Product and ItemNumber.
Then, yes, you should introduce a new table to deal with the optional column, and I agree that product_item_number
is a very descriptive name for it. This table should have sku
constrained as its PRIMARY KEY (PK), so as to ensure that no more than one row with the same sku
value is inserted into it, just like you did.
It is also important to mention that product_item_number.sku
should as well be a constrained as a FOREIGN KEY (FK) making a reference to product.sku
.
Here is a sample SQL-DDL logical-level design that illustrates the previous suggestions:
-- You should determine which are the most fitting
-- data types and sizes for all your table columns
-- depending on your business context characteristics.
-- Also, you should make accurate tests to define
-- the most convenient INDEXing strategies.
CREATE TABLE product (
sku TEXT NOT NULL,
name TEXT NOT NULL,
price NUMERIC NOT NULL,
quantity NUMERIC NOT NULL,
--
CONSTRAINT product_PK PRIMARY KEY (sku),
CONSTRAINT product_AK UNIQUE (name), -- AK.
CONSTRAINT valid_price_CK CHECK (price > 0),
CONSTRAINT valid_quantity_CK CHECK (quantity > 0)
);
CREATE TABLE product_item_number (
sku TEXT NOT NULL, -- To be constrained as PK and FK to ensure the 1:0/1 correspondence ratio between the relevant rows.
item_number TEXT NOT NULL,
--
CONSTRAINT product_item_number_PK PRIMARY KEY (sku),
CONSTRAINT product_item_number_AK UNIQUE (item_number), -- In this context, ‘item_number’ is an AK.
CONSTRAINT product_item_number_TO_product_FK FOREIGN KEY (sku)
REFERENCES product (sku)
);
Tested on PostgreSQL 11 in this db<>fiddle.
Moreover, there is another conceptual formulation that guides in shaping the database design presented above:
- If it exists, the ItemNumber of a Product must be unique.
So, where the item_number
column should actually be declared as an AK is right there, in the product_item_number
table, because said column requires uniqueness protection only when the pertinent value is provided, hence the UNIQUE and NOT NULL constraints have to be configured accordingly.
Missing values and the “Closed World Interpretation”
The logical SQL-DDL arrangement previously described is an example of the relational approach to handle missing values, although it is not the most popular —or usual—. This approach is related to the “Closed World Interpretation” —or “Assumption”—. Adopting this position, (a) the information recorded in the database is always deemed true, and (b) the information that is not recorded in it is, at all times, deemed false. In this way, one is exclusively retaining facts that are known.
In the present business scenario, when a user supplies all the data points that are comprised in the product
table you have to INSERT the corresponding row and if, and only if, the user made the item_number
datum available you also have to INSERT the product_item_number
counterpart. In case that the item_number
value is unknown or it simply does not apply, you do not INSERT a product_item_number
row, and that is it.
With this method you avoid holding NULL marks/markers in your base tables —and the logical-level consequences that I will detail in the next section—, but you should be aware that this is a “controversial” topic in the database administration ambit. On this point, you might find of value the answers for the Stack Overflow question entitled:
- “How can I avoid NULLs in my database, while also representing missing data?”
The popular course of action
I guess, however, that the popular —or common— proceeding would be to have a single product
table that includes the item_number
column which, in turn, would be set as NULLable and, at the same time, defined with a UNIQUE constraint. The way I see it, this approach would make your database and the applicable data manipulation operations less elegant (as shown, e.g., in this outstanding Stack Overflow answer), but it is a possibility.
See the successive DDL statements that exemplify this course of action:
CREATE TABLE product (
sku TEXT NOT NULL,
name TEXT NOT NULL,
price NUMERIC NOT NULL,
quantity NUMERIC NOT NULL,
item_number TEXT NULL, -- Accepting NULL marks.
--
CONSTRAINT product_PK PRIMARY KEY (sku),
CONSTRAINT product_AK1 UNIQUE (name), -- AK.
CONSTRAINT product_AK2 UNIQUE (item_number), -- Being ‘NULLable’, this is not an AK.
CONSTRAINT valid_price_CK CHECK (price > 0),
CONSTRAINT valid_quantity_CK CHECK (quantity > 0)
);
Tested on PostgreSQL 11 in this db<>fiddle.
So, having established item_number
as a column that can contain NULLs, it is not correct to say, logically speaking, that it is an AK. Furthermore, you would be storing ambiguous NULL marks —which are not values, no matter if the PostgreSQL documentation labels them that way—, thus it can be argued that the table would not be a proper representation of an adapted mathematical relation and normalization rules cannot be applied to it.
Since a NULL indicates that a column value is (1) unknown or (2) inapplicable, it cannot be rightly stated that said mark belongs to the item_number
valid domain of values. As you know, this kind of mark tells something about the “status” of a real value, but it is not a value itself and, naturally, it does not behave as such —and, by the way, it is worth to mention that NULLs behave differently across the distinct SQL database management systems, even across distinct versions of the same database management system—.
Then, if (i) the domain of values of a certain column and (ii) the meaning that said column carries is not entirely clear as a result of the inclusion of NULLs:
How could one evaluate and define the relevant functional dependencies?
How can it be identified and declared as PRIMARY or ALTERNATE KEY (as in the case of the
item_number
)?
Despite both the theoretical and practical —e.g. regarding data manipulation— implications that concern to the retention of NULL marks in a database, this is the approach to handle missing data that you will find in the vast majority of the databases built on SQL platforms, since it permits attaching columns for optional values to the base tables of significance and, as an effect, eludes the creation of (a) a complementary table and (b) the associated tasks.
The decision
I have presented the two alternatives so that you can determine by yourself which one is more suitable to achieve your objectives.
Assuming that the Sku and ItemNumber values can eventually be duplicated
There are some points of your question that caught my attention in an particular way, so I listed them:
Sometimes (maybe 3% to 5% of the time), the item_number is actually equal to the SKU. That is, one of my suppliers in particular affixes to their products what I suspect is not a globally unique SKU, fashioned after their item number.
[…] there may be cases where a supplier recycles a catalog number with a different sku (maybe?), or situations where two manufacturer's both make a "d57-red" or something like that. In that case, I think I'd have to programmatically prefix offending item_numbers with manufacturer names or something like that.
A sku will always be unique in my domain (The small amount of non-globablly unique supplier-provided SKUs are unlikely to ever collide).
Those points can have remarkable repercussions because they seem to suggest that:
The ItemNumber values can eventually become duplicated and, when that happens, you might evaluate combining two different pieces of information that bear different meanings in the same column.
It is probable that, after all, the Sku values might be repeated (even if it is a small amount of repeated Sku instances).
In this regard, it is worth to note that two paramount objectives of a data modelling exercise are (1) determining each individual datum of significance and (2) preventing the retention of more than one of them in the same column. These factors, e.g., facilitate the delineation of a stable and versatile database structure and assist in the avoidance of duplicated information —which helps to maintain the data values consistent with the business rules, via the respective constraints—.
Alternative to handle Sku duplicates: Introducing a manufacturer
table to the scenario
Consequently, on condition that the same Sku value can be shared across different Manufacturers, you could make use of a composite PK constraint in the product
table, and it would be made up of (i) the manufacturer PK column and (ii) sku
. E.g.:
CREATE TABLE manufacturer (
manufacturer_number INTEGER NOT NULL, -- This could be something more meaningful, e.g., ‘manufacturer_code’.
name TEXT NOT NULL,
--
CONSTRAINT manufacturer_PK PRIMARY KEY (manufacturer_number),
CONSTRAINT manufacturer_AK UNIQUE (name) -- AK.
);
CREATE TABLE product (
manufacturer_number INTEGER NOT NULL,
sku TEXT NOT NULL,
name TEXT NOT NULL,
price NUMERIC NOT NULL,
quantity NUMERIC NOT NULL,
--
CONSTRAINT product_PK PRIMARY KEY (manufacturer_number, sku), -- Composite PK.
CONSTRAINT product_AK UNIQUE (name), -- AK.
CONSTRAINT product_TO_manufacturer_FK FOREIGN KEY (manufacturer_number)
REFERENCES manufacturer (manufacturer_number),
CONSTRAINT valid_price_CK CHECK (price > 0),
CONSTRAINT valid_quantity_CK CHECK (quantity > 0)
);
And, if the ItemNumber demands uniqueness preservation when it is applicable, then the product_item_number
table can be structured as follows:
CREATE TABLE product_item_number (
manufacturer_number INTEGER NOT NULL,
sku TEXT NOT NULL,
item_number TEXT NOT NULL,
--
CONSTRAINT product_item_number_PK PRIMARY KEY (manufacturer_number, sku), -- Composite PK.
CONSTRAINT product_item_number_AK UNIQUE (item_number), -- AK.
CONSTRAINT product_item_number_TO_product_FK FOREIGN KEY (manufacturer_number, sku)
REFERENCES product (manufacturer_number, sku)
);
Tested on PostgreSQL 11 in this db<>fiddle.
In case that ItemNumber does not require preventing duplicates, you simply remove the UNIQUE constraint declared for such a column, as shown in the next DDL statements:
CREATE TABLE product_item_number (
manufacturer_number INTEGER NOT NULL,
sku TEXT NOT NULL,
item_number TEXT NOT NULL, -- In this case, ‘item_number’ does not require a UNIQUE constraint.
--
CONSTRAINT product_item_number_PK PRIMARY KEY (manufacturer_number, sku), -- Composite PK.
CONSTRAINT product_item_number_TO_product_FK FOREIGN KEY (manufacturer_number, sku)
REFERENCES product (manufacturer_number, sku)
);
On the other hand, supposing that ItemNumber does actually entail avoiding repeated values exclusively with regards to the associated Manufacturer, you can set up a composite UNIQUE constraint which would consist of manufacturer_number
and item_number
, as demonstrated in the code lines below:
CREATE TABLE product_item_number (
manufacturer_number INTEGER NOT NULL,
sku TEXT NOT NULL,
item_number TEXT NOT NULL,
--
CONSTRAINT product_item_number_PK PRIMARY KEY (manufacturer_number, sku), -- Composite PK.
CONSTRAINT product_item_number_AK UNIQUE (manufacturer_number, item_number), -- Composite AK.
CONSTRAINT product_item_number_TO_product_FK FOREIGN KEY (manufacturer_number, sku) -- Composite FK.
REFERENCES product (manufacturer_number, sku)
);
When Sku values are always unique but a specific ItemNumber value can be shared among distinct Manufacturers
If you can guarantee that Product.Sku will never imply duplicates but an ItemNumber might be used by distinct Manufacturers, you can configure your database as exposed here:
CREATE TABLE manufacturer (
manufacturer_number INTEGER NOT NULL,
name TEXT NOT NULL,
--
CONSTRAINT manufacturer_PK PRIMARY KEY (manufacturer_number),
CONSTRAINT manufacturer_AK UNIQUE (name) -- AK.
);
CREATE TABLE product (
sku TEXT NOT NULL,
name TEXT NOT NULL,
price NUMERIC NOT NULL,
quantity NUMERIC NOT NULL,
--
CONSTRAINT product_PK PRIMARY KEY (sku),
CONSTRAINT product_AK UNIQUE (name), -- AK.
CONSTRAINT valid_price_CK CHECK (price > 0),
CONSTRAINT valid_quantity_CK CHECK (quantity > 0)
);
CREATE TABLE product_item_number (
sku TEXT NOT NULL,
manufacturer_number INTEGER NOT NULL,
item_number TEXT NOT NULL,
--
CONSTRAINT product_item_number_PK PRIMARY KEY (sku, manufacturer_number),
CONSTRAINT product_item_number_AK UNIQUE (manufacturer_number, item_number), -- In this context, ‘manufacturer_number’ and ‘item_number’ compose an AK.
CONSTRAINT product_item_number_TO_product_FK FOREIGN KEY (sku)
REFERENCES product (sku),
CONSTRAINT product_item_number_TO_manufacturer_FK FOREIGN KEY (manufacturer_number)
REFERENCES manufacturer (manufacturer_number)
);
Tested on PostgreSQL 11 in this db<>fiddle.
Physical-level considerations
We have not discussed the exact type and size of the product.sku
column but, if it is “big” in terms of bytes, then it may end up undermining the data retrieval speed of your system —due to aspects of the physical level of abstraction, associated with, e.g., the sizes of the indexes and disk space usage—.
In this manner, you might like to assess the incorporation of an INTEGER column which can offer a faster response than a possibly “heavy” TEXT one —but it all depends on the precise features of the compared columns—. It may well be a product_number
that, as expected, would represent a numeric value in a sequence standing for the set of recorded products
.
An expository arrangement that incorporates this new element is the one that follows:
CREATE TABLE product (
product_number INTEGER NOT NULL,
sku TEXT NOT NULL,
name TEXT NOT NULL,
price NUMERIC NOT NULL,
quantity NUMERIC NOT NULL,
--
CONSTRAINT product_PK PRIMARY KEY (sku),
CONSTRAINT product_AK UNIQUE (name), -- AK.
CONSTRAINT valid_price_CK CHECK (price > 0),
CONSTRAINT valid_quantity_CK CHECK (quantity > 0)
);
CREATE TABLE product_item_number
(
product_number INTEGER NOT NULL,
item_number TEXT NOT NULL,
--
CONSTRAINT product_item_number_PK PRIMARY KEY (product_number),
CONSTRAINT product_item_number_AK UNIQUE (item_number), -- AK.
CONSTRAINT product_item_number_TO_product_FK FOREIGN KEY (product_number)
REFERENCES product (product_number)
);
I highly recommend carrying out substantial testing sessions with a considerable data load in order to decide which keys are more convenient —physically speaking—, always taking into account the overall database features (the number of columns of all the tables, the types and sizes of the columns, the constraints and the underlying indexes, etc.).
Similar scenario
You business environment of interest presents a certain resemblance to the scenario dealt with in these posts, so you might find of relevance some of the discussed points.
If your attribute item_number
is unique, you can leave it in your original table even in case it can have null values. In fact the PostgreSQL manual says:
For the purpose of a unique constraint, null values are not considered equal.
So this could be right solution:
CREATE TABLE product (
sku text PRIMARY KEY,
name text UNIQUE NOT NULL,
price numeric NOT NULL CHECK (price > 0),
quantity numeric NOT NULL CHECK (quantity > 0),
item_number text UNIQUE
);
which is more efficient than solution B, and more simple to program than solution C.
Note that this solution is normalized, so that you do not have any redundancy, neither you have insertion/deletion anomalies.
Addition
For a relation to be formally in Boyce Codd Normal Form (which is stricter than the Third Normal Form), for each dependency the determinant must be a (super)key. But first note that the normalization theory usually do not treat null values. See for instance the book of Elmasri, Navathe, “Fundamental of Database Systems". 6th Edition, 2010:
There is no fully satisfactory relational design theory as yet that includes NULL values
In this case we have at least the dependency:
sku → name, price, quantity, item_number
and in fact sku
is a key for the relation.
Then, supposing that there are no null values, if you want item_number
be unique, there exist another dependency:
item_number → sku, name, price, quantity
and so, item_number
is another key.
In this relation there are no other functional dependencies, a part from those derived from these two, and both these dependencies do not violate the the BCNF (both determinants are keys). So the relation is in Boyce Codd Normal Form.
On the other hand, if you consider that item_number
can have null values, you could assume that the second dependency does not hold, so that the relation is again in BCNF.