What is atomicity in dbms
Atomicity and 1NF... that is not about atomic transactions, but about definition and column content.
"Atomic" means "cannot be divided or split in smaller parts". Applied to 1NF this means that a column should not contain more than one value. It should not compose or combine values that have a meaning of their own.
This tipically regards 2 very common mistakes made by database designers:
1. multiple values in one column (list columns)
columns that contain a list of values, tipically space or comma separated, like this blog post table:
id title date_posted content tags
1 new idea 2014-05-23 ... tag1,tag2,tag3
2 why this? 2014-05-24 ... tag2,tag5
3 towel day 2014-05-26 ... tag42
or this contacts table:
id room phones
4 432 111-111-111 222-222-222
5 456 999-999-999
6 512 888-888-8888 333-3333-3333
This type of denormalization is rare, as most database designers see this cannot be a good thing. But you do find tables like this. They usually come from modifications to the database, whereas it may seem simpler to widen a column and use it to stuff multiple values instead of adding a normalized related table (which often breaks existing applications).
2. complex multi-part columns
In this case one column contains different bits of information and could maybe be designed as a set of separate columns.
Typical example are fullname and address columns:
id fullname address
1 Mark Tomers 56 Tomato Road
2 Fred Askalong 3277 Hadley Drive
3 May Anne Brice 225 Century Avenue - apartment 43/a
These types of denormalizations are very common, as it is quite difficult to draw the line and what is atomic and what is not. Depending on the application, a multi-part column could very well be the best solution in some cases. It is less structured, but simpler.
Structuring an address in many atomic columns may mean having more complex code to handle results for output. Another complexity comes from the structure not being adeguate to fit all types of addresses. Using one single VARCHAR column does not pose this problem, but may pose others... typically about searching and sorting.
An extreme case of multi-part columns are dates and times. Most RDBMS provide date and time data types and provide functions to handle date and time algebra and the extraction of the various bits (month, hour, etc...). Few people would consider convenient to have separate year, mont, day columns in a relational database. But I've seen it... and with good reasons: the use case was birthdates for a justice department database. They had to handle many immigrants with few or no documents. Sometimes you just knew a person was born in a certain year, but you would not know the day or month or birth. You can't handle that type of info with a single date column.
"Every column should be atomic."
Chris Date says, "Please note very carefully that it is not just simple things like the integer 3 that are legitimate values. On the contrary, values can be arbitrarily complex; for example, a value might be a geometric point, or a polygon, or an X ray, or an XML document, or a fingerprint, or an array, or a stack, or a list, or a relation (and so on)."[1]
He also says, "A relvar is in 1NF if and only if, in every legal value of that relvar, every tuple contains exactly one value for each attribute."[2]
He generally discourages the use of the word atomic, because it has confusing connotations. Single value is probably a better term to use.
For example, a date like '2014-01-01' is a single value. It's not indivisible; on the contrary, it quite clearly is divisible. But the dbms does one of two things with single values that have parts. The dbms either returns those values as a whole, or the dbms provides functions to manipulate the parts. (Clients don't have to write code to manipulate the parts.)[3]
In the case of dates, SQL can
- return dates as a whole (
SELECT CURRENT_DATE
), - return one or more parts of a date (
EXTRACT(YEAR FROM CURRENT_DATE)
), - add and subtract intervals (
CURRENT_DATE + INTERVAL '1' DAY
), - subtract one date from another (
CURRENT_DATE - DATE '2014-01-01'
),
and so on. In this (narrow) respect, SQL is quite relational.
- An Introduction to Database Systems, 8th ed, p 113. Emphasis in the original.
- Ibid, p 358.
- In the case of a "user-defined" type, the "user" is presumed to be a database programmer, not a client of the database.
Re "atomic"
In Codd's original 1969 and 1970 papers he defined relations as having a value for every attribute in a row. The value could be anything, including a relation. This used no notion of "atomic". He explained that "atomic" meant not relation-valued (ie not table-valued):
So far, we have discussed examples of relations which are defined on simple domains--domains whose elements are atomic (nondecomposable) values. Nonatomic values can be discussed within the relational framework. Thus, some domains may have relations as elements.
He used "simple", "atomic" and "nondecomposable" as informal expository notions. He understood that a relation has rows of which each column has an associated name and value; attributes are by definition "single-valued"; the value is of any type. The only structural property that matters relationally is being a relation. It is also just a value, but you can query it relationally. Then he used "nonsimple" etc meaning relation-valued.
By the time of Codd's 1990 book The Relational Model for Database Management: Version 2:
From a database perspective, data can be classified into two types: atomic and compound. Atomic data cannot be decomposed into smaller pieces by the DBMS (excluding certain special functions). Compound data, consisting of structured combinations of atomic data, can be decomposed by the DBMS.
In the relational model there is only one type of compound data: the relation. The values in the domains on which each relation is defined are required to be atomic with respect to the DBMS. A relational database is a collection of relations of assorted degrees. All of the query and manipulative operators are upon relations, and all of them generate relations as results. Why focus on just one type of compound data? The main reason is that any additional types of compound data add complexity without adding power.
"In the relational model there is only one type of compound data: the relation."
Sadly, "atomic = non-relation" is not what you're going to hear. (Unfortunately Codd was not the clearest writer and his expository remarks get confused with his bottom line.) Virtually all presentations of the relational model get no further than what was for Codd merely a stepping stone. They promote an unhelpful confused fuzzy notion canonicalized/canonized as "atomic" determining "normalized". Sometimes they wrongly use it to define realtion. Whereas Codd used everyday "nonatomic" to introduce defining relational "nonatomic" as relation-valued and defined "normalized" as free of relation-valued domains.
(Neither is "not a repeating group" helpful as "atomic", defining it as not something that is not even a relational notion. And sure enough in 1970 Codd says "terms attribute and repeating group in present database terminology are roughly analogous to simple domain and nonsimple domain, respectively".)
Eg: This misinterpretation was promoted for a long time from early on by Chris Date, honourable early relational explicator and proselytizer, primarily in his seminal still-current book An Introduction to Database Systems. Which now (2004 8th edition) thankfully presents the helpful relationally-oriented extended notion of distinguishing relation, row and "scalar" (non-relation non-row) domains:
This definition merely states that all [relation variables] are in 1NF
Eg: Maiers' classic The Theory of Relational Databases (1983):
The definition of atomic is hazy; a value that is atomic in one application could be non-atomic in another. For a general guideline, a value is non-atomic if the application deals with only a part of the value.
Eg: The current Wikipedia article on First NF (Normal Form) section Atomicity actually quotes from the introductory parts above. And then ignores the precise meaning. (Then it says something unintelligible about when the nonatomic turtles should stop.):
Codd states that the "values in the domains on which each relation is defined are required to be atomic with respect to the DBMS." Codd defines an atomic value as one that "cannot be decomposed into smaller pieces by the DBMS (excluding certain special functions)" meaning a field should not be divided into parts with more than one kind of data in it such that what one part means to the DBMS depends on another part of the same field.
Re "normalized" and "1NF"
When Codd used "normalize" in 1970, he meant eliminate relation-valued ("non-simple") domains from a relational database:
For this reason (and others to be cited below) the possibility of eliminating nonsimple domains appears worth investigating. There is, in fact, a very simple elimination procedure, which we shall call normalization.
Later the notion of "higher NFs" (involving FDs (functional dependencies) & then JDs (join dependencies)) arose and "normalize" took on a different meaning. Since Codd's original normalization paper, normalization theory has always given results relevant to all relations, not just those in Codd's 1NF. So one can "normalize" in the original sense of going from just relations to a "normalized" "1NF" without relation-valued columns. And one can "normalize" in the normalization-theory sense of going from a just-relations "1NF" to higher NFs while ignoring whether domains are relations. And "normalization" is commonly also used for the "hazy" notion of eliminating values with "parts". And "normalization" is also wrongly used for designing a relational version of a non-relational database (whether just relations and/or some other sense of "1NF").
Relational spirit is to eschew multiple columns with the same meaning or domains with interesting parts in favour of another base table. But we must always come to an informal ergonomic decision about when to stop representing parts and just treat a column as "atomic" (non-relation-valued) vs "nonatomic" (relation-valued).
Normalization in database management system