Are there disadvantages to using a generic varchar(255) for all text-based fields?
In storage, VARCHAR(255)
is smart enough to store only the length you need on a given row, unlike CHAR(255)
which would always store 255 characters.
But since you tagged this question with MySQL, I'll mention a MySQL-specific tip: as rows are copied from the storage engine layer to the SQL layer, VARCHAR
fields are converted to CHAR
to gain the advantage of working with fixed-width rows. So the strings in memory become padded out to the maximum length of your declared VARCHAR
column.
When your query implicitly generates a temporary table, for instance while sorting or GROUP BY
, this can use a lot of memory. If you use a lot of VARCHAR(255)
fields for data that doesn't need to be that long, this can make the temporary table very large.
You may also like to know that this "padding out" behavior means that a string declared with the utf8 character set pads out to three bytes per character even for strings you store with single-byte content (e.g. ascii or latin1 characters). And likewise utf8mb4 character set causes the string to pad out to four bytes per character in memory.
So a VARCHAR(255)
in utf8 storing a short string like "No opinion" takes 11 bytes on disk (ten lower-charset characters, plus one byte for length) but it takes 765 bytes in memory, and thus in temp tables or sorted results.
I have helped MySQL users who unknowingly created 1.5GB temp tables frequently and filled up their disk space. They had lots of VARCHAR(255)
columns that in practice stored very short strings.
It's best to define the column based on the type of data that you intend to store. It has benefits to enforce application-related constraints, as other folks have mentioned. But it has the physical benefits to avoid the memory waste I described above.
It's hard to know what the longest postal address is, of course, which is why many people choose a long VARCHAR
that is certainly longer than any address. And 255 is customary because it is the maximum length of a VARCHAR
for which the length can be encoded with one byte. It was also the maximum VARCHAR
length in MySQL older than 5.0.
In addition to the size and performance considerations of setting the size of a varchar (and possibly more important, as storage and processing get cheaper every second), the disadvantage of using varchar(255) "just because" is reduced data integrity.
Defining maximum limits for strings is a good thing to do to prevent longer than expected strings from entering the RDBMS and causing buffer overruns or exceptions/errors later when retrieving and parsing values from the database that are longer (more bytes) than expected.
For example, if you have a field that accepts two-character strings for country abbreviations then you have no conceivable reason to expect your users (in this context, programmers) to input full country names. Since you don't want them to enter "Antigua and Barbuda" (AG) or "Heard Island and McDonald Islands" (HM), you don't allow it at the database layer. Also, it is likely some programmers have not yet RTFMed the design documentation (which surely exists) to know not to do this.
Set the field to accept two characters and let the RDBMS deal with it (either gracefully by truncating or ungracefully by rejecting their SQL with an error).
Examples of real data that has no reason to exceed a certain length:
- Canadian Postal Codes are of the format A1A1A1 and are always 6 characters in length, even for Santa Claus (6 characters excludes the space that can be specified for legibility).
- email addresses - up to 64 bytes before the @, up to 255 bytes after. Never more, lest you break the Internet.
- North American Phone Numbers are never more than 10 digits (excluding the country code).
- Computers running (recent versions of) Windows cannot have computer names longer than 63 bytes, though more than 15 is not recommended and will break your Windows NT server farm.
- State abbreviations are 2 characters (like the country codes exampled above)
- UPS tracking numbers are either 18-, 12-, 11-, or 9-characters long. The 18-character numbers start with "1Z" and the 11-character numbers start with "T" which makes you wonder how they deliver all those packages if they don't know the difference between letters and numbers.
And so on...
Take the time to think about your data and its limits. If you're a architect, developer, or programmer, it's your job, after all.
By using a varchar(n) instead of varchar(255) you eliminate the problem where users (end-users, programmers, other programs) enter unexpectedly long data that will come back to haunt your code later.
And I didn't say you shouldn't also implement this restriction in the business logic code used by your application.
I'm with you. Fussy attention to detail is a pain in the neck and has limited value.
Once upon a time, disk was a precious commodity and we used to sweat bullets to optimize it. The price of storage has fallen by a factor of 1,000, making the time spent on squeezing every byte less valuable.
If you use only CHAR fields, you can get fixed-length rows. This can save some disk real-restate if you picked accurate sizes for fields. You might get more densely-packed data (fewer I/O's for table scans) and faster updates (easier to locate open spaces in a block for updates and inserts.)
However, if you over-estimate your sizes, or your actual data sizes are variable, you'll wind up wasting space with CHAR fields. The data will wind up less densely packed (leading to more I/O's for big retrievals).
Generally, the performance benefits from attempting to put a size on variable fields are minor. You can easily benchmark by using VARCHAR(255) compared with CHAR(x) to see if you can measure the difference.
However, sometimes, I need to provide a "small", "medium", "large" hint. So I use 16, 64, and 255 for the sizes.