How Slow Is Python Really (Part II)?
C++ x150 x450 x530
Instead of array I used bits (and dark magic).
Thanks @ace for the faster random function.
How does it work:
the first 15th bits of the integer s
represent the array S[15]
; the zeroes represent -1, the ones represent +1.
The array F
is build in a similar way. But with two bit for each symbol.
- 00 represent -1
- 01 and 10 represent 0
- 11 represent 1
Cause S
and F
have a different representation I have to interleave S
with itself to be comparable with F
.
- 0 (-1) became 00 (-1 in the representation of
F
) - 1 (+1) became 11 (+1 in the representation of
F
)
Now we can simply use Carnot to compute the inner product. Remember that one variable can only assume value 00 or 11
0 . 00 = 11 (-1 * -1 = +1)
0 . 01 = 10 (-1 * 0 = 0)
0 . 10 = 01 (-1 * 0 = 0)
0 . 11 = 00 (-1 * +1 = -1)
1 . 00 = 00 (+1 * -1 = -1)
1 . 10 = 10 (+1 * 0 = 0)
1 . 01 = 01 (+1 * 0 = 0)
1 . 11 = 11 (+1 * +1 = +1)
Looks like a not xor to me. :)
Sum the ones is just a game of shift and mask, nothing really complex.
#include <array>
#include <ctime>
// From standford bithacks
// http://graphics.stanford.edu/~seander/bithacks.html
inline int32_t interleaveBit(int32_t x)
{
static const uint32_t B[] = { 0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF };
x = (x | ( x << 8)) & B[3];
x = (x | ( x << 4)) & B[2];
x = (x | ( x << 2)) & B[1];
x = (x | ( x << 1)) & B[0];
return x | (x << 1);
}
inline int32_t sumOnes(int32_t v)
{
static int b[] = { 1, 0, 0, 1};
int s = 0;
for( int i = 0; i < 8; ++i)
{
const int a = 3&(v>>(i*2));
s += b[a];
}
return s;
}
inline int32_t sumArray(int32_t v)
{
static int b[] = { -1, 0, 0, 1};
int s = 0;
for( int i = 0; i < 8; ++i)
{
const int a = 3&(v>>(i*2));
s += b[a];
}
return s;
}
uint32_t x, y = 24252, z=57768, w=1564; //PRNG seeds
int32_t myRand()
{
uint32_t t;
t = x ^ (x<<1);
x = y;
y = z;
z = w;
w = w ^ ( w >> 19) ^ t ^ (t >> 8);
return w;
}
int main()
{
std::array<int, 8> leadingZero{0};
x = static_cast<int32_t>(time(nullptr)); // seed PRNG
const int maxS = 1 << 15;
for(int s = 0; s < maxS; ++s)
{
const int32_t x = interleaveBit(s);
for(int i = 0; i < 1000; ++i)
{
int32_t random;
do
{
random = 0xFFFF & myRand();
}while(sumOnes(random) == 0);
int j = 7;
while( j >= 0 )
{
const int32_t h = (x >> (j*2));
const int32_t l = 0xFFFF & (~(random ^ h)); // inner product
if(sumArray(l) == 0)
{
leadingZero[j]++;
} else
{
break;
}
j--;
}
}
}
for(int i = 7; i >= 0; --i)
{
printf("%d ", leadingZero[i]);
}
printf("\n");
return 0;
}
Here a sample output:
6332350 2525218 1041716 438741 192917 87159 41023 19908
real 0m0.372s
user 0m0.371s
sys 0m0.001s
The program has been compiled with:
gcc -std=c++11 -O3 -msse4.2 -Wall -lstdc++ 26371.cpp -o fastPy
on Fedora 20 with gcc 4.8.2 The Cpu is a i7 8core.
Probably I can gain some ms tweaking compiler parameters.
While this is the OP solution time on my machine:
time pypy 26371.py
[6330609, 2523914, 1040885, 439303, 192708, 86987, 40710, 19498]
real 0m53.061s
user 0m53.016s
sys 0m0.022s
Edit:
Just adding openmp and change the order of the for I have a x3 gain, leading to a x450 performance improvement against OP code. :D
In this case the leadingZero
array must be atomic. The random global... are random, they will be more random.
#pragma omp parallel for
for(int i = 0; i < 1000; ++i)
{
int32_t random;
do
{
random = 0xFFFF & myRand();
}while(sumOnes(random) == 0);
for(int s = 0; s < maxS; ++s)
{
const int32_t x = interleaveBit(s);
int j = 7;
while( j >= 0 )
{
const int32_t h = (x >> (j*2));
const int32_t l = 0xFFFF & (~(random ^ h)); // inner product
if( sumArray(l) == 0 )
{
leadingZero[j]++;
} else
{
break;
}
j--;
}
}
}
need to add -fopenmp
to the compiler flag
Edit:2 As suggester by user71404 I changed the sumOnes and sumArray functions and now it's uber fast.
real 0m0.101s
user 0m0.101s
sys 0m0.000s
With openmp is slower, cause the atomics add too much overhead.
real 0m0.253s
user 0m1.870s
sys 0m0.001s
Without atomics is even faster, but I get wrong result.
2137992 1147218 619297 321243 155815 70946 32919 15579
real 0m0.048s
user 0m0.338s
sys 0m0.001s
To understand sumArray consider that 16 bit represent and array of 8 numbers.
00 have no 1 and represent -1
01 and 10 have one 1 and represent 0
11 have two 1s and represent 1
So that built-in count the number of bit set to 1 [http://en.wikipedia.org/wiki/Hamming_weight] and to each group we remove 1. Cool.
sumOnes is just black magic.
Here the latest compile flags and code.
gcc -std=c++11 -mfpmath=sse -O3 -flto -march=native -funroll-loops -Wall -lstdc++
#include <cstdint>
#include <cstdio>
#include <ctime>
inline int32_t interleaveBit(int32_t x)
{
static const uint32_t B[] = { 0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF };
x = (x | ( x << 8)) & B[3];
x = (x | ( x << 4)) & B[2];
x = (x | ( x << 2)) & B[1];
x = (x | ( x << 1)) & B[0];
return x | (x << 1);
}
inline int32_t sumOnes(int32_t v)
{
/* 0xAAAA == 0b1010 1010 1010 1010 */
return !!(0xAAAA & (v ^ ~(v << 1)));
}
inline int32_t sumArray(int32_t v)
{
return __builtin_popcount(v) - 8;
}
uint32_t x, y = 24252, z = 57768, w = 1564; //PRNG seeds
int32_t myRand()
{
uint32_t t;
t = x ^ (x << 1);
x = y;
y = z;
z = w;
w = w ^ ( w >> 19) ^ t ^ (t >> 8);
return w;
}
int main()
{
int leadingZero[8] = { 0 };
x = static_cast<int32_t>(time(nullptr)); // seed PRNG
const int maxS = 1 << 15;
for( int i = 0; i < 1000; ++i )
{
int32_t random;
do
{
random = 0xFFFF & myRand();
} while(sumOnes(random) == 0 );
for( int s = 0; s < maxS; ++s )
{
const int32_t x = interleaveBit(s);
int j = 7;
while( j >= 0 )
{
const int32_t h = (x >> (j * 2));
const int32_t l = 0xFFFF & (~(random ^ h)); // inner product
if( sumArray(l) == 0 )
{
leadingZero[j]++;
} else
{
break;
}
j--;
}
}
}
printf("[%d, %d, %d, %d, %d, %d, %d, %d]\n",
leadingZero[7], leadingZero[6],
leadingZero[5], leadingZero[4],
leadingZero[3], leadingZero[2],
leadingZero[1], leadingZero[0]);
return 0;
}
C++ bit magic
~16ms multithreaded, 56ms singlethreaded. ~4000 speedup.
(speedup is based on multithreaded code on my i7-2820QM and the 1 min 9 seconds mentioned in the question. Since the OP's system has worse single threaded performance than my CPU but better multi threaded perfiormance i expect this number to be accurate)
The multithreaded part is quite inefficient due to the spawning of threads. I could probably do better by leveraging my custom job library but that one has bugs under unix systems.. For an explanation and almost identical code without threading refer to https://codegolf.stackexchange.com/a/26485/20965.
edit
I gave each thread it's own RNG and cut down the bit length to 32 which reduced the runtime by a few ms.
#include <iostream>
#include <bitset>
#include <random>
#include <chrono>
#include <stdint.h>
#include <cassert>
#include <array>
#include <tuple>
#include <memory>
#include <thread>
#include <future>
#include <string.h>
#ifdef _MSC_VER
uint32_t popcnt( uint32_t x ){ return _mm_popcnt_u32(x); }
#else
uint32_t popcnt( uint32_t x ){ return __builtin_popcount(x); }
#endif
void convolve()
{
static const unsigned threadCount = 32;
static const unsigned n = 8;
static const unsigned m = 8;
static const unsigned totalIters = 1000;
static_assert( n <= 16, "packing of F fails when n > 16.");
static uint32_t fmask = (1 << n) -1; fmask |= fmask << 16;
std::array< uint32_t, m * threadCount > out;
std::vector< std::future<void> > threads;
for( int threadId = 0; threadId < threadCount; threadId++)
{
threads.emplace_back( std::async( [&, threadId]
{
std::random_device rd;
std::knuth_b gen(rd());
uint32_t nextRandomNumber = gen();
const unsigned iters = totalIters / threadCount;
std::array< uint32_t, m > leadingZeros;
for( auto& x : leadingZeros )
x = 0;
for( unsigned i = 0; i < iters; i++ )
{
// generate random bit mess
uint32_t F;
do {
// this funky looking construction shortens the dependancy chain of F
F = nextRandomNumber & fmask;
nextRandomNumber = gen();
} while ( 0 == ((F % (1 << n)) ^ (F >> 16 )) );
// Assume F is an array with interleaved elements such that F[0] || F[16] is one element
// here MSB(F) & ~LSB(F) returns 1 for all elements that are positive
// and ~MSB(F) & LSB(F) returns 1 for all elements that are negative
// this results in the distribution ( -1, 0, 0, 1 )
// to ease calculations we generate r = LSB(F) and l = MSB(F)
uint32_t r = F % ( 1 << n );
// modulo is required because the behaviour of the leftmost bit is implementation defined
uint32_t l = ( F >> 16 ) % ( 1 << n );
uint32_t posBits = l & ~r;
uint32_t negBits = ~l & r;
assert( (posBits & negBits) == 0 );
uint32_t mask = posBits | negBits;
uint32_t totalBits = popcnt( mask );
// if the amount of -1 and +1's is uneven, sum(S*F) cannot possibly evaluate to 0
if ( totalBits & 1 )
continue;
uint32_t adjF = posBits & ~negBits;
uint32_t desiredBits = totalBits / 2;
uint32_t S = (1 << (n + m -1));
// generate all possible N+1 bit strings
// 1 = +1
// 0 = -1
while ( S-- )
{
for( int shift = 0; shift < m; shift++ )
{
uint32_t s = (S >> shift) % ( 1 << n );
auto firstBits = (s & mask) ^ adjF;
if ( desiredBits == popcnt( firstBits ) )
{
leadingZeros[shift] = leadingZeros[shift] + 1;
}
else
{
break;
}
}
}
}
memcpy( out.data() + (threadId * m), leadingZeros.data(), sizeof( leadingZeros[0] ) * m );
} ));
};
std::array< uint32_t, m > leadingZeros;
for( auto& x : leadingZeros )
x = 0;
for( auto& thread : threads )
{
thread.wait();
}
for( int i = 0; i < (threadCount * m); i++ )
{
leadingZeros[i % m] += out[i];
}
for( auto x : leadingZeros )
std::cout << x << ", ";
std::cout << std::endl;
}
int main()
{
typedef std::chrono::high_resolution_clock clock;
int rounds = 100;
// do some rounds to get the cpu up to speed..
for( int i = 0; i < rounds / 10; i++ )
{
convolve();
}
auto start = clock::now();
for( int i = 0; i < rounds; i++ )
convolve();
auto end = clock::now();
double seconds = std::chrono::duration_cast< std::chrono::microseconds >( end - start ).count() / 1000000.0;
std::cout << seconds/rounds*1000 << " msec/round" << std::endl;
return 0;
}
Sample output:
6317312, 2515072, 1034368, 434048, 190144, 85200, 39804, 19168,
6226944, 2481408, 1031168, 438080, 192896, 86816, 40484, 19490,
6321152, 2524672, 1045376, 442880, 195680, 88464, 41656, 20212,
6330624, 2517504, 1031104, 430208, 187696, 83976, 38976, 18708,
6304768, 2510336, 1030720, 433056, 190880, 86824, 40940, 19840,
6272512, 2494720, 1028160, 432352, 189168, 84752, 39540, 19052,
6233600, 2507520, 1046912, 447008, 198224, 89984, 42092, 20292,
Julia: 0.7s, 120x faster
As user20768 demonstrated, a straightforward port of the code to Julia is about twice as fast as PyPy. But we can do a lot better than that.
function pleadingzerocounts(; n = 8,
m = 8,
iters = 1000)
@parallel (+) for S = 1:2^(8+8-1)
leading_counts = zeros(Int, m)
F = Array(Int, n)
for i = 1:iters
flag = 0
while flag == 0
for i = 1:n
v = (1-(rand(Int8)&3))%2
@inbounds F[i] = v
flag += v & 1
end
end
for j = 1:m
sum = 0
for i = 1:n
@inbounds sum += S & (1 << (j + i - 2)) > 0 ? F[i] : -F[i]
end
sum == 0 ?
(leading_counts[j] += 1) :
break
end
end
leading_counts
end
end
function main()
# Warm up the JIT
pleadingzerocounts()
# Then go for real
println(@time pleadingzerocounts())
end
You can run this using julia -p 8 -e 'require("golf.jl");main()'
(the 8 is the number of processes, you might want to play around with it). On the latest Julia prerelease this takes 0.7s vs. 1m22s for PyPy.
If you have enough cores on your computer, and perhaps spin up a few AWS instances, you should be able to shave off some more :)