Any faster way of copying arrays in C#?
If running on .NET Core, you may consider using source.AsSpan().CopyTo(destination)
(beware on Mono though).
Method | Job | Runtime | Mean | Error | StdDev | Ratio | RatioSD |
---------------- |----- |-------- |----------:|----------:|----------:|------:|--------:|
ArrayCopy | Clr | Clr | 60.08 ns | 0.8231 ns | 0.7699 ns | 1.00 | 0.00 |
SpanCopy | Clr | Clr | 99.31 ns | 0.4895 ns | 0.4339 ns | 1.65 | 0.02 |
BufferBlockCopy | Clr | Clr | 61.34 ns | 0.5963 ns | 0.5578 ns | 1.02 | 0.01 |
| | | | | | | |
ArrayCopy | Core | Core | 63.33 ns | 0.6843 ns | 0.6066 ns | 1.00 | 0.00 |
SpanCopy | Core | Core | 47.41 ns | 0.5399 ns | 0.5050 ns | 0.75 | 0.01 |
BufferBlockCopy | Core | Core | 59.89 ns | 0.4713 ns | 0.3936 ns | 0.94 | 0.01 |
| | | | | | | |
ArrayCopy | Mono | Mono | 149.82 ns | 1.6466 ns | 1.4596 ns | 1.00 | 0.00 |
SpanCopy | Mono | Mono | 347.87 ns | 2.0589 ns | 1.9259 ns | 2.32 | 0.02 |
BufferBlockCopy | Mono | Mono | 61.52 ns | 1.1691 ns | 1.0364 ns | 0.41 | 0.01 |
Use Buffer.BlockCopy. Its entire purpose is to perform fast (see Buffer):
This class provides better performance for manipulating primitive types than similar methods in the System.Array class.
Admittedly, I haven't done any benchmarks, but that's the documentation. It also works on multidimensional arrays; just make sure that you're always specifying how many bytes to copy, not how many elements, and also that you're working on a primitive array.
Also, I have not tested this, but you might be able to squeeze a bit more performance out of the system if you bind a delegate to System.Buffer.memcpyimpl
and call that directly. The signature is:
internal static unsafe void memcpyimpl(byte* src, byte* dest, int len)
It does require pointers, but I believe it's optimized for the highest speed possible, and so I don't think there's any way to get faster than that, even if you had assembly at hand.
Update:
Due to requests (and to satisfy my curiosity), I tested this:
using System;
using System.Diagnostics;
using System.Reflection;
unsafe delegate void MemCpyImpl(byte* src, byte* dest, int len);
static class Temp
{
//There really should be a generic CreateDelegate<T>() method... -___-
static MemCpyImpl memcpyimpl = (MemCpyImpl)Delegate.CreateDelegate(
typeof(MemCpyImpl), typeof(Buffer).GetMethod("memcpyimpl",
BindingFlags.Static | BindingFlags.NonPublic));
const int COUNT = 32, SIZE = 32 << 20;
//Use different buffers to help avoid CPU cache effects
static byte[]
aSource = new byte[SIZE], aTarget = new byte[SIZE],
bSource = new byte[SIZE], bTarget = new byte[SIZE],
cSource = new byte[SIZE], cTarget = new byte[SIZE];
static unsafe void TestUnsafe()
{
Stopwatch sw = Stopwatch.StartNew();
fixed (byte* pSrc = aSource)
fixed (byte* pDest = aTarget)
for (int i = 0; i < COUNT; i++)
memcpyimpl(pSrc, pDest, SIZE);
sw.Stop();
Console.WriteLine("Buffer.memcpyimpl: {0:N0} ticks", sw.ElapsedTicks);
}
static void TestBlockCopy()
{
Stopwatch sw = Stopwatch.StartNew();
sw.Start();
for (int i = 0; i < COUNT; i++)
Buffer.BlockCopy(bSource, 0, bTarget, 0, SIZE);
sw.Stop();
Console.WriteLine("Buffer.BlockCopy: {0:N0} ticks",
sw.ElapsedTicks);
}
static void TestArrayCopy()
{
Stopwatch sw = Stopwatch.StartNew();
sw.Start();
for (int i = 0; i < COUNT; i++)
Array.Copy(cSource, 0, cTarget, 0, SIZE);
sw.Stop();
Console.WriteLine("Array.Copy: {0:N0} ticks", sw.ElapsedTicks);
}
static void Main(string[] args)
{
for (int i = 0; i < 10; i++)
{
TestArrayCopy();
TestBlockCopy();
TestUnsafe();
Console.WriteLine();
}
}
}
The results:
Buffer.BlockCopy: 469,151 ticks
Array.Copy: 469,972 ticks
Buffer.memcpyimpl: 496,541 ticks
Buffer.BlockCopy: 421,011 ticks
Array.Copy: 430,694 ticks
Buffer.memcpyimpl: 410,933 ticks
Buffer.BlockCopy: 425,112 ticks
Array.Copy: 420,839 ticks
Buffer.memcpyimpl: 411,520 ticks
Buffer.BlockCopy: 424,329 ticks
Array.Copy: 420,288 ticks
Buffer.memcpyimpl: 405,598 ticks
Buffer.BlockCopy: 422,410 ticks
Array.Copy: 427,826 ticks
Buffer.memcpyimpl: 414,394 ticks
Now change the order:
Array.Copy: 419,750 ticks
Buffer.memcpyimpl: 408,919 ticks
Buffer.BlockCopy: 419,774 ticks
Array.Copy: 430,529 ticks
Buffer.memcpyimpl: 412,148 ticks
Buffer.BlockCopy: 424,900 ticks
Array.Copy: 424,706 ticks
Buffer.memcpyimpl: 427,861 ticks
Buffer.BlockCopy: 421,929 ticks
Array.Copy: 420,556 ticks
Buffer.memcpyimpl: 421,541 ticks
Buffer.BlockCopy: 436,430 ticks
Array.Copy: 435,297 ticks
Buffer.memcpyimpl: 432,505 ticks
Buffer.BlockCopy: 441,493 ticks
Now change the order again:
Buffer.memcpyimpl: 430,874 ticks
Buffer.BlockCopy: 429,730 ticks
Array.Copy: 432,746 ticks
Buffer.memcpyimpl: 415,943 ticks
Buffer.BlockCopy: 423,809 ticks
Array.Copy: 428,703 ticks
Buffer.memcpyimpl: 421,270 ticks
Buffer.BlockCopy: 428,262 ticks
Array.Copy: 434,940 ticks
Buffer.memcpyimpl: 423,506 ticks
Buffer.BlockCopy: 427,220 ticks
Array.Copy: 431,606 ticks
Buffer.memcpyimpl: 422,900 ticks
Buffer.BlockCopy: 439,280 ticks
Array.Copy: 432,649 ticks
or, in other words: they're very competitive; as a general rule, memcpyimpl
is fastest, but it's not necessarily worth worrying about.
You can use Array.Copy
.
EDIT
Array.Copy
does work for multidimensional arrays: see this topic.
For primitive type arrays (like double
) you can copy fast, even for multidimensional array with pointers.
In the code below I initialize a 2D array A[10,10]
with the values 1 through 100. Then I copy these values into a 1D array B[100]
unsafe class Program
{
static void Main(string[] args)
{
double[,] A = new double[10, 10];
for(int i = 0; i < 10; i++)
{
for(int j = 0; j < 10; j++)
{
A[i, j] = 10 * i + j + 1;
}
}
// A has { { 1 ,2 .. 10}, { 11, 12 .. 20}, .. { .. 99, 100} }
double[] B = new double[10 * 10];
if (A.Length == B.Length)
{
fixed (double* pA = A, pB = B)
{
for(int i = 0; i < B.Length; i++)
{
pB[i] = pA[i];
}
}
// B has {1, 2, 3, 4 .. 100}
}
}
}
How fast is it. My testing has shown it to be many times faster then native C# copy and Buffer.BlockCopy()
. You try it for your case and let us know.
Edit 1 I compared copying with four methods. 1) Two Nested loops, 2) One Serial loop, 3) Pointers, 4) BlockCopy. I measured the # of copies per tick for various size arrays.
N = 10x 10 (cpy/tck) Nested = 50, Serial = 33, Pointer = 100, Buffer = 16
N = 20x 20 (cpy/tck) Nested = 133, Serial = 40, Pointer = 400, Buffer = 400
N = 50x 50 (cpy/tck) Nested = 104, Serial = 40, Pointer = 2500, Buffer = 2500
N = 100x 100 (cpy/tck) Nested = 61, Serial = 41, Pointer = 10000, Buffer = 3333
N = 200x 200 (cpy/tck) Nested = 84, Serial = 41, Pointer = 40000, Buffer = 2666
N = 500x 500 (cpy/tck) Nested = 69, Serial = 41, Pointer = 125000, Buffer = 2840
N = 1000x1000 (cpy/tck) Nested = 33, Serial = 45, Pointer = 142857, Buffer = 1890
N = 2000x2000 (cpy/tck) Nested = 30, Serial = 43, Pointer = 266666, Buffer = 1826
N = 5000x5000 (cpy/tck) Nested = 21, Serial = 42, Pointer = 735294, Buffer = 1712
It is clear here who is the winner. Pointer copy is orders of magnitudes better than any other method.
Edit 2 Apparently I was unfairly taking advantage of a compiler/JIT optimization because when I moved the loops behind delegates to equalize the playing field the numbers changed dramatically.
N = 10x 10 (cpy/tck) Nested = 0, Serial = 0, Pointer = 0, Buffer = 0
N = 20x 20 (cpy/tck) Nested = 80, Serial = 14, Pointer = 100, Buffer = 133
N = 50x 50 (cpy/tck) Nested =147, Serial = 15, Pointer = 277, Buffer = 2500
N = 100x 100 (cpy/tck) Nested = 98, Serial = 15, Pointer = 285, Buffer = 3333
N = 200x 200 (cpy/tck) Nested =106, Serial = 15, Pointer = 272, Buffer = 3076
N = 500x 500 (cpy/tck) Nested =106, Serial = 15, Pointer = 276, Buffer = 3125
N = 1000x1000 (cpy/tck) Nested =101, Serial = 11, Pointer = 199, Buffer = 1396
N = 2000x2000 (cpy/tck) Nested =105, Serial = 9, Pointer = 186, Buffer = 1804
N = 5000x5000 (cpy/tck) Nested =102, Serial = 8, Pointer = 170, Buffer = 1673
The buffered copy is top here (thanks to @Mehrdad) with pointer copy second. The question now is why isn't pointer copy as fast as buffer methods?