Multiple threads and memory

If there are really no writes in your 1MB block then yeah, each core can read from its own cache line without any problem as no writes are being committed and therefore no cache coherency problems arise.

In a multicore architecture, basically there is a cache for each core and a "Cache Coherence Protocol" which invalidates the cache on some cores which do not have the most up to date information. I think most processors implement the MOESI protocol for cache coherency.

Cache coherency is a complex topic that has been largely discussed (I specially like some articles by Joe Duffy here and here). The discussion nonetheless revolves around the possible performance penalties of code that, while being apparently lock free, can slow down due to the cache coherency protocol kicking in to maintain coherency across the processors caches, but, as long as there are no writes there's simply no coherency to maintain and thus no lost on performance.

Just to clarify, as said in the comment, RAM can't be accessed simultaneously since x86 and x64 architectures implement a single bus which is shared between cores with SMP guaranteeing the fairness accessing main memory. Nonetheless this situation is hidden by each core cache which allows each core to have its own copy of the data. For 1MB of data it would be possible to incur on some contention while the core update its cache but that would be negligible.

Some useful links:

  • Cache Coherence Protocols
  • Cache Coherence

Not only are different cores allowed to read from the same block of memory, they're allowed to write at the same time too. If it's "safe" or not, that's an entirely different story. You need to implement some sort of a guard in your code (usually done with semaphores or derivates of them) to guard against multiple cores fighting over the same block of memory in a way you don't specifically allow.

About the size of the memory a core reads at a time, that's usually a register's worth, 32 bits on a 32bit cpu, 64 bits for a 64bit cpu and so on. Even streaming is done dword by dword (look at memcpy for example).

About how concurrent multiple cores really are, every core uses a single bus to read and write to the memory, so accessing any resources (ram, external devices, the floating point processing unit) is one request at a time, one core at a time. The actual processing inside the core is completely concurrent however. DMA transfers also don't block the bus, concurrent transfers get queued and processed one at a time (I believe, not 100% sure on this).

edit: just to clarify, unlike the other reply here, I'm talking only about a no-cache scenario. Of course if the memory gets cached read-only access is completely concurrent.