How slow is Python really? (Or how fast is your language?)
Python2.7 + Numpy 1.8.1: 10.242 s
Fortran 90+: 0.029 s 0.003 s 0.022 s 0.010 s
Damn straight you lost your bet! Not a drop of parallelization here too, just straight Fortran 90+.
EDIT I've taken Guy Sirton's algorithm for permuting the array S
(good find :D). I apparently also had the -g -traceback
compiler flags active which were slowing this code down to about 0.017s. Currently, I am compiling this as
ifort -fast -o convolve convolve_random_arrays.f90
For those who don't have ifort
, you can use
gfortran -O3 -ffast-math -o convolve convolve_random_arrays.f90
EDIT 2: The decrease in run-time is because I was doing something wrong previously and got an incorrect answer. Doing it the right way is apparently slower. I still can't believe that C++ is faster than mine, so I'm probably going to spend some time this week trying to tweak the crap out of this to speed it up.
EDIT 3: By simply changing the RNG section using one based on BSD's RNG (as suggested by Sampo Smolander) and eliminating the constant divide by m1
, I cut the run-time to the same as the C++ answer by Guy Sirton. Using static arrays (as suggested by Sharpie) drops the run-time to under the C++ run-time! Yay Fortran! :D
EDIT 4 Apparently this doesn't compile (with gfortran) and run correctly (incorrect values) because the integers are overstepping their limits. I've made corrections to ensure it works, but this requires one to have either ifort 11+ or gfortran 4.7+ (or another compiler that allows iso_fortran_env
and the F2008 int64
kind).
Here's the code:
program convolve_random_arrays
use iso_fortran_env
implicit none
integer(int64), parameter :: a1 = 1103515245
integer(int64), parameter :: c1 = 12345
integer(int64), parameter :: m1 = 2147483648
real, parameter :: mi = 4.656612873e-10 ! 1/m1
integer, parameter :: n = 6
integer :: p, pmax, iters, i, nil(0:1), seed
!integer, allocatable :: F(:), S(:), FS(:)
integer :: F(n), S(n+1), FS(2)
!n = 6
!allocate(F(n), S(n+1), FS(2))
iters = 1000
nil = 0
!call init_random_seed()
S = -1
pmax = 2**(n+1)
do p=1,pmax
do i=1,iters
F = rand_int_array(n)
if(all(F==0)) then
do while(all(F==0))
F = rand_int_array(n)
enddo
endif
FS = convolve(F,S)
if(FS(1) == 0) then
nil(0) = nil(0) + 1
if(FS(2) == 0) nil(1) = nil(1) + 1
endif
enddo
call permute(S)
enddo
print *,"first zero:",nil(0)
print *," both zero:",nil(1)
contains
pure function convolve(x, h) result(y)
!x is the signal array
!h is the noise/impulse array
integer, dimension(:), intent(in) :: x, h
integer, dimension(abs(size(x)-size(h))+1) :: y
integer:: i, j, r
y(1) = dot_product(x,h(1:n-1))
y(2) = dot_product(x,h(2:n ))
end function convolve
pure subroutine permute(x)
integer, intent(inout) :: x(:)
integer :: i
do i=1,size(x)
if(x(i)==-1) then
x(i) = 1
return
endif
x(i) = -1
enddo
end subroutine permute
function rand_int_array(i) result(x)
integer, intent(in) :: i
integer :: x(i), j
real :: y
do j=1,i
y = bsd_rng()
if(y <= 0.25) then
x(j) = -1
else if (y >= 0.75) then
x(j) = +1
else
x(j) = 0
endif
enddo
end function rand_int_array
function bsd_rng() result(x)
real :: x
integer(int64) :: b=3141592653
b = mod(a1*b + c1, m1)
x = real(b)*mi
end function bsd_rng
end program convolve_random_arrays
I suppose the question now is will you stop using slow-as-molasses Python and use fast-as-electrons-can-move Fortran ;).
Python 2.7 - 0.882s 0.283s
(OP's original: 6.404s)
Edit: Steven Rumbalski's optimization by precomputing F values. With this optimization cpython beats pypy's 0.365s.
import itertools
import operator
import random
n=6
iters = 1000
firstzero = 0
bothzero = 0
choicesF = filter(any, itertools.product([-1, 0, 0, 1], repeat=n))
for S in itertools.product([-1,1], repeat = n+1):
for i in xrange(iters):
F = random.choice(choicesF)
if not sum(map(operator.mul, F, S[:-1])):
firstzero += 1
if not sum(map(operator.mul, F, S[1:])):
bothzero += 1
print "firstzero", firstzero
print "bothzero", bothzero
OP's original code uses such tiny arrays there is no benefit to using Numpy, as this pure python implementation demonstrates. But see also this numpy implementation which is three times faster again than my code.
I also optimize by skipping the rest of the convolution if the first result isn't zero.
C++ bit magic
0.84ms with simple RNG, 1.67ms with c++11 std::knuth
0.16ms with slight algorithmic modification (see edit below)
The python implementation runs in 7.97 seconds on my rig. So this is 9488 to 4772 times faster depending on what RNG you choose.
#include <iostream>
#include <bitset>
#include <random>
#include <chrono>
#include <stdint.h>
#include <cassert>
#include <tuple>
#if 0
// C++11 random
std::random_device rd;
std::knuth_b gen(rd());
uint32_t genRandom()
{
return gen();
}
#else
// bad, fast, random.
uint32_t genRandom()
{
static uint32_t seed = std::random_device()();
auto oldSeed = seed;
seed = seed*1664525UL + 1013904223UL; // numerical recipes, 32 bit
return oldSeed;
}
#endif
#ifdef _MSC_VER
uint32_t popcnt( uint32_t x ){ return _mm_popcnt_u32(x); }
#else
uint32_t popcnt( uint32_t x ){ return __builtin_popcount(x); }
#endif
std::pair<unsigned, unsigned> convolve()
{
const uint32_t n = 6;
const uint32_t iters = 1000;
unsigned firstZero = 0;
unsigned bothZero = 0;
uint32_t S = (1 << (n+1));
// generate all possible N+1 bit strings
// 1 = +1
// 0 = -1
while ( S-- )
{
uint32_t s1 = S % ( 1 << n );
uint32_t s2 = (S >> 1) % ( 1 << n );
uint32_t fmask = (1 << n) -1; fmask |= fmask << 16;
static_assert( n < 16, "packing of F fails when n > 16.");
for( unsigned i = 0; i < iters; i++ )
{
// generate random bit mess
uint32_t F;
do {
F = genRandom() & fmask;
} while ( 0 == ((F % (1 << n)) ^ (F >> 16 )) );
// Assume F is an array with interleaved elements such that F[0] || F[16] is one element
// here MSB(F) & ~LSB(F) returns 1 for all elements that are positive
// and ~MSB(F) & LSB(F) returns 1 for all elements that are negative
// this results in the distribution ( -1, 0, 0, 1 )
// to ease calculations we generate r = LSB(F) and l = MSB(F)
uint32_t r = F % ( 1 << n );
// modulo is required because the behaviour of the leftmost bit is implementation defined
uint32_t l = ( F >> 16 ) % ( 1 << n );
uint32_t posBits = l & ~r;
uint32_t negBits = ~l & r;
assert( (posBits & negBits) == 0 );
// calculate which bits in the expression S * F evaluate to +1
unsigned firstPosBits = ((s1 & posBits) | (~s1 & negBits));
// idem for -1
unsigned firstNegBits = ((~s1 & posBits) | (s1 & negBits));
if ( popcnt( firstPosBits ) == popcnt( firstNegBits ) )
{
firstZero++;
unsigned secondPosBits = ((s2 & posBits) | (~s2 & negBits));
unsigned secondNegBits = ((~s2 & posBits) | (s2 & negBits));
if ( popcnt( secondPosBits ) == popcnt( secondNegBits ) )
{
bothZero++;
}
}
}
}
return std::make_pair(firstZero, bothZero);
}
int main()
{
typedef std::chrono::high_resolution_clock clock;
int rounds = 1000;
std::vector< std::pair<unsigned, unsigned> > out(rounds);
// do 100 rounds to get the cpu up to speed..
for( int i = 0; i < 10000; i++ )
{
convolve();
}
auto start = clock::now();
for( int i = 0; i < rounds; i++ )
{
out[i] = convolve();
}
auto end = clock::now();
double seconds = std::chrono::duration_cast< std::chrono::microseconds >( end - start ).count() / 1000000.0;
#if 0
for( auto pair : out )
std::cout << pair.first << ", " << pair.second << std::endl;
#endif
std::cout << seconds/rounds*1000 << " msec/round" << std::endl;
return 0;
}
Compile in 64-bit for extra registers. When using the simple random generator the loops in convolve() run without any memory access, all variables are stored in the registers.
How it works: rather than storing S
and F
as in-memory arrays, it is stored as bits in an uint32_t.
For S
, the n
least significant bits are used where an set bit denotes an +1 and an unset bit denotes an -1.
F
requires at least 2 bits to create an distribution of [-1, 0, 0, 1]. This is done by generating random bits and examining the 16 least significant (called r
) and 16 most significant bits (called l
). If l & ~r
we assume that F is +1, if ~l & r
we assume that F
is -1. Otherwise F
is 0. This generates the distribution we're looking for.
Now we have S
, posBits
with an set bit on every location where F == 1 and negBits
with an set bit on every location where F == -1.
We can prove that F * S
(where * denotes multiplication) evaluates to +1 under the condition (S & posBits) | (~S & negBits)
. We can also generate similar logic for all cases where F * S
evaluates to -1. And finally, we know that sum(F * S)
evaluates to 0 if and only if there is an equal amount of -1's and +1's in the result. This is very easy to calculate by simply comparing the number of +1 bits and -1 bits.
This implementation uses 32 bit ints, and the maximum n
accepted is 16. It is possible to scale the implementation to 31 bits by modifying the random generate code, and to 63 bits by using uint64_t instead of uint32_t.
edit
The folowing convolve function:
std::pair<unsigned, unsigned> convolve()
{
const uint32_t n = 6;
const uint32_t iters = 1000;
unsigned firstZero = 0;
unsigned bothZero = 0;
uint32_t fmask = (1 << n) -1; fmask |= fmask << 16;
static_assert( n < 16, "packing of F fails when n > 16.");
for( unsigned i = 0; i < iters; i++ )
{
// generate random bit mess
uint32_t F;
do {
F = genRandom() & fmask;
} while ( 0 == ((F % (1 << n)) ^ (F >> 16 )) );
// Assume F is an array with interleaved elements such that F[0] || F[16] is one element
// here MSB(F) & ~LSB(F) returns 1 for all elements that are positive
// and ~MSB(F) & LSB(F) returns 1 for all elements that are negative
// this results in the distribution ( -1, 0, 0, 1 )
// to ease calculations we generate r = LSB(F) and l = MSB(F)
uint32_t r = F % ( 1 << n );
// modulo is required because the behaviour of the leftmost bit is implementation defined
uint32_t l = ( F >> 16 ) % ( 1 << n );
uint32_t posBits = l & ~r;
uint32_t negBits = ~l & r;
assert( (posBits & negBits) == 0 );
uint32_t mask = posBits | negBits;
uint32_t totalBits = popcnt( mask );
// if the amount of -1 and +1's is uneven, sum(S*F) cannot possibly evaluate to 0
if ( totalBits & 1 )
continue;
uint32_t adjF = posBits & ~negBits;
uint32_t desiredBits = totalBits / 2;
uint32_t S = (1 << (n+1));
// generate all possible N+1 bit strings
// 1 = +1
// 0 = -1
while ( S-- )
{
// calculate which bits in the expression S * F evaluate to +1
auto firstBits = (S & mask) ^ adjF;
auto secondBits = (S & ( mask << 1 ) ) ^ ( adjF << 1 );
bool a = desiredBits == popcnt( firstBits );
bool b = desiredBits == popcnt( secondBits );
firstZero += a;
bothZero += a & b;
}
}
return std::make_pair(firstZero, bothZero);
}
cuts the runtime to 0.160-0.161ms. Manual loop unroll (not pictured above) makes that 0.150. The less trivial n=10, iter=100000 case runs under 250ms. I'm sure i can get it under 50ms by leveraging additional cores but that's too easy.
This is done by making the inner loop branch free and swapping the F and S loop.
If bothZero
is not required i can cut down the run time to 0.02ms by sparsely looping over all possible S arrays.