I gave a talk on “Python specialized bytecode” on Pykonik #70 where I also made a walkthrough over the “pycjail returns” challenge from ångstrom CTF 2024. The video can be found here and its slides here.

In this blog post, I am going to do a TL;DR of the idea of specialized bytecode in Python. For details on “Python jails” or “pycjail returns” challenge solution, check out the talk linked above :).

Python specialized bytecode

Apart from going over a capture the flag cybersecurity competition challenge, the main point of the talk is that there is an ongoing effort to make CPython faster and a huge part of it is detailed in PEP-659. Some of it is already implemented in Python 3.11 and 3.12. The gist of it is that they added “specialized bytecode” and some tracing for how functions are used. Now, when a function is “hot”, aka: it is used lots of times, its bytecode will be optimized with “specialized bytecode” instructions.

This can be seen below:

In [1]: import dis  # import module for disassembling Python bytecode

In [2]: def add(x, y):
   ...:     return x + y
   ...:

In [3]: dis.dis(add, adaptive=True, show_caches=True)
  1           0 RESUME                   0

  2           2 LOAD_FAST__LOAD_FAST     0 (x)
              4 LOAD_FAST                1 (y)
              6 BINARY_OP                0 (+)
              8 CACHE                    0 (counter: 17)
             10 RETURN_VALUE

When we first disassemble this function, one of its opcode was already optimized from LOAD_FAST to LOAD_FAST__LOAD_FAST. This is a “superoperator” or “superopcode” which works faster than executing two LOAD_FAST operations. The other LOAD_FAST instruction needs to be kept there since LOAD_FAST__LOAD_FAST figures out the second load argument from it (the name of variable to fetch, which is y; this can be seen here in the CPython code).

Now, lets see what will happen when we execute the add function with int arguments lots of times:

In [4]: for i in range(1000): add(i, i)

In [5]: dis.dis(add, adaptive=True, show_caches=True)
  1           0 RESUME                   0

  2           2 LOAD_FAST__LOAD_FAST     0 (x)
              4 LOAD_FAST                1 (y)
              6 BINARY_OP_ADD_INT        0 (+)
              8 CACHE                    0 (counter: 832)
             10 RETURN_VALUE

The BINARY_OP instruction was replaced with BINARY_OP_ADD_INT which adds two integers faster. Of course the instruction still checks for argument types and if they aren’t integers, a deoptimized opcode is executed (which dispatches the execution based on argument types). This can actually be seen in CPython’s C implementation for this opcode:

        TARGET(BINARY_OP_ADD_INT) {
            PyObject *right = stack_pointer[-1];
            PyObject *left = stack_pointer[-2];
            PyObject *sum;
            #line 385 "Python/bytecodes.c"
            // HERE we deoptimize the opcode if both args 
            // are not integers (CPython's PyLong type)
            DEOPT_IF(!PyLong_CheckExact(left), BINARY_OP);
            DEOPT_IF(Py_TYPE(right) != Py_TYPE(left), BINARY_OP);
            // (...)

Now, what will happen if we now execute the same function with string arguments many times?

In [6]: for i in range(1000): add("Hello", " world")

In [7]: dis.dis(add, adaptive=True, show_caches=True)
  1           0 RESUME                   0

  2           2 LOAD_FAST__LOAD_FAST     0 (x)
              4 LOAD_FAST                1 (y)
              6 BINARY_OP_ADD_UNICODE     0 (+)
              8 CACHE                    0 (counter: 832)
             10 RETURN_VALUE

As we can see, the function got optimized for a case when both arguments are unicode strings and so BINARY_OP_ADD_UNICODE is used.

Want to learn more?

If you want to learn more about all of this, I recommend you going through the talk as well as reading the Python 3.11 release’s “What’s new” section which also describes all the speed ups achieved with this approach.

Comments