Drupal - What's a good balance between reusing fields versus creating new ones in the context of fields scalability?
The amount of data in a field usually isn't a problem. If you're worried about that, look into alternative field storage plugins or write your own. For example MongoDB, which can deal with pretty much anything you put into it. It is for example used on http://examiner.com.
A real problem however is the number of fields you have. Because currently in Drupal 7, the complete field configuration of all fields, no matter if they're loaded or not, is fetched from the cache on every single request.
I've seen sites with 250+ fields, where loading and unserializing the field configuration takes 13MB+ memory.
Edit: The field info cache has been improved (see http://drupal.org/node/1040790 for details) with Drupal 7.22, only the fields of bundles that are displayed on a certain page are loaded from the cache and they're separate cache entries. That only works if there are no wrong API calls that request instances across multiple bundles.
I totaly agree with berdir. Here are my experiences with a project with millions of rows and 30-40 fields on some node types.
- The number of rows in a field table isn't a big problem for the read performance, as all fields are fetched by primary key.
- The number of fields per node type can quickly grow into big performance problems when writing new nodes. Having 30+ fields for one node type results into 60+ INSERT statements when you create a new node. This takes seconds to complete. If you're users creating a lot of data this will hit your performance. Bulk inserts of 1000 nodes will take almost an hour. If you have to update 100'000 nodes, this is a big problem.
- If you think the number of fields problem is going to hit you, you should seriously think about writing your own field storage or just don't use fields. (You can still make your node working with views with some extra effort.)
- A word about MongoDB. It's a very interesting project and I hope it's making it into the olymp of the big DBs. Unfortunately compared to the maturity of MySql or PgSql it's a baby. Be prepared to deal with a very young product.
If you are really worried about what will happen, then I think a simulation is in order.
Get an account at Rackspace Cloud, Amazon, Linode, or anywhere else you can easily spin up a VPS. Make two identical instances. Install Drupal on each. Create some dummy content types, and set up the fields one way in one system, and other other way in the other. Use the devel module to create a boatload of content. Adjust performance settings to make sure Drupal is caching as needed. Run mysqltuner and adjust MySQL on each per recomendations. Double check PHP and APC settings so that you aren't hitting swap and that you aren't churning the APC cache.
Once you get a good baseline configuration for each, start simulating traffic (both normal visitors and admin updates) with wget and drush, and then profile.
Simulations are never perfect, but they can get you going in the right direction.