Understanding MongoDB BSON Document size limit
To post a clarification answer here for those who get directed here by Google.
The document size includes everything in the document including the subdocuments, nested objects etc.
So a document of:
{
"_id": {},
"na": [1, 2, 3],
"naa": [
{ "w": 1, "v": 2, "b": [1, 2, 3] },
{ "w": 5, "b": 2, "h": [{ "d": 5, "g": 7 }, {}] }
]
}
Has a maximum size of 16 MB.
Subdocuments and nested objects are all counted towards the size of the document.
First off, this actually is being raised in the next version to 8MB
or 16MB
... but I think to put this into perspective, Eliot from 10gen (who developed MongoDB) puts it best:
EDIT: The size has been officially 'raised' to 16MB
So, on your blog example, 4MB is actually a whole lot.. For example, the full uncompresses text of "War of the Worlds" is only 364k (html): http://www.gutenberg.org/etext/36
If your blog post is that long with that many comments, I for one am not going to read it :)
For trackbacks, if you dedicated 1MB to them, you could easily have more than 10k (probably closer to 20k)
So except for truly bizarre situations, it'll work great. And in the exception case or spam, I really don't think you'd want a 20mb object anyway. I think capping trackbacks as 15k or so makes a lot of sense no matter what for performance. Or at least special casing if it ever happens.
-Eliot
I think you'd be pretty hard pressed to reach the limit ... and over time, if you upgrade ... you'll have to worry less and less.
The main point of the limit is so you don't use up all the RAM on your server (as you need to load all MB
s of the document into RAM when you query it.)
So the limit is some % of normal usable RAM on a common system ... which will keep growing year on year.
Note on Storing Files in MongoDB
If you need to store documents (or files) larger than 16MB
you can use the GridFS API which will automatically break up the data into segments and stream them back to you (thus avoiding the issue with size limits/RAM.)
Instead of storing a file in a single document, GridFS divides the file into parts, or chunks, and stores each chunk as a separate document.
GridFS uses two collections to store files. One collection stores the file chunks, and the other stores file metadata.
You can use this method to store images, files, videos, etc in the database much as you might in a SQL database. I have used this to even store multi gigabyte video files.
Many in the community would prefer no limit with warnings about performance, see this comment for a well reasoned argument: https://jira.mongodb.org/browse/SERVER-431?focusedCommentId=22283&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-22283
My take, the lead developers are stubborn about this issue because they decided it was an important "feature" early on. They're not going to change it anytime soon because their feelings are hurt that anyone questioned it. Another example of personality and politics detracting from a product in open source communities but this is not really a crippling issue.
I have not yet seen a problem with the limit that did not involve large files stored within the document itself. There are already a variety of databases which are very efficient at storing/retrieving large files; they are called operating systems. The database exists as a layer over the operating system. If you are using a NoSQL solution for performance reasons, why would you want to add additional processing overhead to the access of your data by putting the DB layer between your application and your data?
JSON is a text format. So, if you are accessing your data through JSON, this is especially true if you have binary files because they have to be encoded in uuencode, hexadecimal, or Base 64. The conversion path might look like
binary file <> JSON (encoded) <> BSON (encoded)
It would be more efficient to put the path (URL) to the data file in your document and keep the data itself in binary.
If you really want to keep these files of unknown length in your DB, then you would probably be better off putting these in GridFS and not risking killing your concurrency when the large files are accessed.