Why is x4.0 faster than x4 in Python 3?

Why is x**4.0 faster than x**4 in Python 3^*?

Python 3 int objects are a full fledged object designed to support an arbitrary size; due to that fact, they are handled as such on the C level (see how all variables are declared as PyLongObject * type in long_pow). This also makes their exponentiation a lot more trickier and tedious since you need to play around with the ob_digit array it uses to represent its value to perform it. (Source for the brave. -- See: Understanding memory allocation for large integers in Python for more on PyLongObjects.)

Python float objects, on the contrary, can be transformed to a C double type (by using PyFloat_AsDouble) and operations can be performed using those native types. This is great because, after checking for relevant edge-cases, it allows Python to use the platforms' pow (C's pow, that is) to handle the actual exponentiation:

/* Now iv and iw are finite, iw is nonzero, and iv is
 * positive and not equal to 1.0.  We finally allow
 * the platform pow to step in and do the rest.
 */
errno = 0;
PyFPE_START_PROTECT("pow", return NULL)
ix = pow(iv, iw);

where iv and iw are our original PyFloatObjects as C doubles.

For what it's worth: Python 2.7.13 for me is a factor 2~3 faster, and shows the inverse behaviour.

The previous fact also explains the discrepancy between Python 2 and 3 so, I thought I'd address this comment too because it is interesting.

In Python 2, you're using the old int object that differs from the int object in Python 3 (all int objects in 3.x are of PyLongObject type). In Python 2, there's a distinction that depends on the value of the object (or, if you use the suffix L/l):

# Python 2
type(30)  # <type 'int'>
type(30L) # <type 'long'>

The <type 'int'> you see here does the same thing floats do, it gets safely converted into a C long when exponentiation is performed on it (The int_pow also hints the compiler to put 'em in a register if it can do so, so that could make a difference):

static PyObject *
int_pow(PyIntObject *v, PyIntObject *w, PyIntObject *z)
{
    register long iv, iw, iz=0, ix, temp, prev;
/* Snipped for brevity */

this allows for a good speed gain.

To see how sluggish <type 'long'>s are in comparison to <type 'int'>s, if you wrapped the x name in a long call in Python 2 (essentially forcing it to use long_pow as in Python 3), the speed gain disappears:

# <type 'int'>
(python2) ➜ python -m timeit "for x in range(1000):" " x**2"       
10000 loops, best of 3: 116 usec per loop
# <type 'long'> 
(python2) ➜ python -m timeit "for x in range(1000):" " long(x)**2"
100 loops, best of 3: 2.12 msec per loop

Take note that, though the one snippet transforms the int to long while the other does not (as pointed out by @pydsinger), this cast is not the contributing force behind the slowdown. The implementation of long_pow is. (Time the statements solely with long(x) to see).

[...] it doesn't happen outside of the loop. [...] Any idea about that?

This is CPython's peephole optimizer folding the constants for you. You get the same exact timings either case since there's no actual computation to find the result of the exponentiation, only loading of values:

dis.dis(compile('4 ** 4', '', 'exec'))
  1           0 LOAD_CONST               2 (256)
              3 POP_TOP
              4 LOAD_CONST               1 (None)
              7 RETURN_VALUE

Identical byte-code is generated for '4 ** 4.' with the only difference being that the LOAD_CONST loads the float 256.0 instead of the int 256:

dis.dis(compile('4 ** 4.', '', 'exec'))
  1           0 LOAD_CONST               3 (256.0)
              2 POP_TOP
              4 LOAD_CONST               2 (None)
              6 RETURN_VALUE

So the times are identical.

^{*All of the above apply solely for CPython, the reference implementation of Python. Other implementations might perform differently.}

If we look at the bytecode, we can see that the expressions are purely identical. The only difference is a type of a constant that will be an argument of BINARY_POWER. So it's most certainly due to an int being converted to a floating point number down the line.

>>> def func(n):
...    return n**4
... 
>>> def func1(n):
...    return n**4.0
... 
>>> from dis import dis
>>> dis(func)
  2           0 LOAD_FAST                0 (n)
              3 LOAD_CONST               1 (4)
              6 BINARY_POWER
              7 RETURN_VALUE
>>> dis(func1)
  2           0 LOAD_FAST                0 (n)
              3 LOAD_CONST               1 (4.0)
              6 BINARY_POWER
              7 RETURN_VALUE

Update: let's take a look at Objects/abstract.c in the CPython source code:

PyObject *
PyNumber_Power(PyObject *v, PyObject *w, PyObject *z)
{
    return ternary_op(v, w, z, NB_SLOT(nb_power), "** or pow()");
}

PyNumber_Power calls ternary_op, which is too long to paste here, so here's the link.

It calls the nb_power slot of x, passing y as an argument.

Finally, in float_pow() at line 686 of Objects/floatobject.c we see that arguments are converted to a C double right before the actual operation:

static PyObject *
float_pow(PyObject *v, PyObject *w, PyObject *z)
{
    double iv, iw, ix;
    int negate_result = 0;

    if ((PyObject *)z != Py_None) {
        PyErr_SetString(PyExc_TypeError, "pow() 3rd argument not "
            "allowed unless all arguments are integers");
        return NULL;
    }

    CONVERT_TO_DOUBLE(v, iv);
    CONVERT_TO_DOUBLE(w, iw);
    ...

Why is x4.0 faster than x4 in Python 3?

Tags:

Python

Performance

Python 3.X

Python Internals

Python 3.5

Related

Recent Posts

Why is x**4.0 faster than x**4 in Python 3?

Tags:

Python

Performance

Python 3.X

Python Internals

Python 3.5

Related

Why is x4.0 faster than x4 in Python 3?