Writing Sequential Code

In the software Introduction, we showed how ordinary-looking Python code could automatically be parallelized. For example,

>>> import pydfmux
>>> hwm = pydfmux.load_session('''
... !HardwareMap
... - !Dfmux { serial: "004" }
... - !Dfmux { serial: "019" }
... ''')

>>> dfmuxes = hwm.query(pydfmux.Dfmux)
>>> print dfmuxes.get_fir_stage()
[6, 6]

In this example, we dispatched the same get_fir_stage() call to two IceBoards in parallel. This is a simple but ideal case for dispatching calls The same example could have been coded sequentially:

>>> dfmuxes = hwm.query(pydfmux.Dfmux)
>>> for d in dfmuxes:
...   print d.get_fir_stage()
6
6

As you can imagine, the parallel version performs better with a large number of Dfmuxes.

This example, however, is too trivial to be really useful. In Algorithms, we described how to succinctly parallelize code, dispatching it asynchronously to a number of IceBoard resources. In the following sections, we focus on semantics of sequential dispatch, which is often more efficient for single-board interactions than just parallelizing.

Sequential Calls

Issuing calls in parallel is not always the best approach. For example: let’s say we wish to reset a bunch of channel parameters to 0 on one or more Dfmux boards. In this case, we’re likely to make a huge number of calls to each board, and it’s most efficient to arrange these calls board-by-board.

Important

Blindly parallelizing is not optimal because it stresses the ARM. There are several possible bottlenecks in the system, but one of the prime suspects is the ability of the ARM to manage huge bursts of network traffic. The ARM’s application server uses a small number of threads, and serves one request at a time. It is much better to minimize the number of requests, and to make each one as useful as possible.

For example, consider the following sequential code:

>>> import pydfmux
>>> hwm = pydfmux.load_session('!HardwareMap [ !Dfmux { serial: "004" } ]')

>>> d = hwm.query(pydfmux.Dfmux).one()

>>> def clear_channel(d):
...    for mezz in d.mezzanines:
...        for m in mezz.modules:
...            for c in m.channels:
...                c.set_frequency(0, c.iceboard.UNITS.HZ, d.TARGET.CARRIER)
...                c.set_frequency(0, c.iceboard.UNITS.HZ, d.TARGET.DEMOD)
...                c.set_amplitude(0, c.iceboard.UNITS.NORMALIZED, d.TARGET.CARRIER)
...                c.set_amplitude(0, c.iceboard.UNITS.NORMALIZED, d.TARGET.NULLER)
>>> clear_channel(d)

This invocation takes 3.7s on a single board. It could have been dispatched on multiple boards as follows:

>>> dfmuxes = hwm.query(pydfmux.Dfmux)
>>> dfmuxes.call_with(clear_channel)

When called against multiple boards, we would still get parallel behaviour. However, in both single- and multiple-board cases, the calls are extremely inefficient.

Why is this example inefficient? Simple: it initiates 4,096 calls per board. Each of these calls involves setting up and tearing down a short-lived, call-specific HTTP session, incurring significant overhead on the PC and especially on the IceBoard’s ARM core.

It’s also possible to re-use a single HTTP session for multiple commands. This can be done as follows:

>>> import pydfmux
>>> hwm = pydfmux.load_session('!HardwareMap [ !Dfmux { serial: "004" } ]')

>>> d = hwm.query(pydfmux.Dfmux).one()

>>> def clear_channel(d):
...    with d.tuber_context() as ctx:
...        for (i, mezz) in d.mezzanine.items():
...            for (j, m) in mezz.module.items():
...                for (k, c) in m.channel.items():
...                    ctx.set_frequency(0, c.iceboard.UNITS.HZ, d.TARGET.CARRIER, k, j, i)
...                    ctx.set_frequency(0, c.iceboard.UNITS.HZ, d.TARGET.DEMOD, k, j, i)
...                    ctx.set_amplitude(0, c.iceboard.UNITS.NORMALIZED, d.TARGET.CARRIER, k, j, i)
...                    ctx.set_amplitude(0, c.iceboard.UNITS.NORMALIZED, d.TARGET.NULLER, k, j, i)
>>> clear_channel(d)

The with invocation creates a Context Manager, which Python provides as a mechanism to “wrap” blocks of code with entry and exit behaviour. Within this context block, any function calls made against the context variable ctx are not actually executed – they’re merely queued, returning a placeholder variable. When we leave the context, the calls are dispatched as a single HTTP interaction, with much lower overhead.

This invocation takes 1.2s, for a speedup of ~3x. More importantly, it imposes less load on each ARM, the network, and the dispatching PC. All of these are possible bottlenecks as the size of the experiment varies.

You can, of course, still issue the top-level call in parallel as before:

>>> dfmuxes = hwm.query(pydfmux.Dfmux)
>>> dfmuxes.call_with(clear_channel)

This combination (parallel calls at the top level, and use of context managers to issue serial calls on each dfmux) is an efficient combination. Other considerations aside, it’s best to parallelize calls across boards, and to combine calls on a single board using a context manager. Of course, it’s more important to produce legible code that performs adequately.

Using Tuber Contexts

Above, we used a context manager to speed up calls. However, in this example, none of the calls actually returned anything. What if we needed the result? For example, imagine querying temperatures from a number of on-board sensors:

>>> import pydfmux
>>> hwm = pydfmux.load_session('!HardwareMap [ !Dfmux { serial: "004" } ]')
>>> d = hwm.query(pydfmux.Dfmux).one()

>>> sensors = (
...     d.TEMPERATURE_SENSOR.MB_PHY, d.TEMPERATURE_SENSOR.MB_ARM,
...     d.TEMPERATURE_SENSOR.MB_FPGA, d.TEMPERATURE_SENSOR.MB_FPGA_DIE,
...     d.TEMPERATURE_SENSOR.MB_POWER)

>>> with d.tuber_context() as ctx:
...     results = {s: d.get_motherboard_temperature(s) for s in sensors}

What does “results” contain in this case? It can’t be a dictionary of temperatures, since the temperatures themselves weren’t actually when the dictionary was created.

>>> print results
{'MOTHERBOARD_TEMPERATURE_ARM': <Future at 0x7f0258e61ad0 state=finished returned float>,
 'MOTHERBOARD_TEMPERATURE_FPGA': <Future at 0x7f0258e61dd0 state=finished returned float>,
 'MOTHERBOARD_TEMPERATURE_FPGA_DIE': <Future at 0x7f0258e61150 state=finished returned float>,
 'MOTHERBOARD_TEMPERATURE_PHY': <Future at 0x7f0257e56690 state=finished returned float>,
 'MOTHERBOARD_TEMPERATURE_POWER': <Future at 0x7f0258e61fd0 state=finished returned float>}

Each function call returns a Future object instead of the expected numeric type. Futures are a standard way (used by tornado, concurrent.futures, and Python 3.4’s asyncio package) of providing a placeholder for a result that isn’t available yet.

The numeric results can be retrieved by querying the Future objects:

>>> actual_results = {k:v.result() for k,v in results.items()}
{'MOTHERBOARD_TEMPERATURE_FPGA': 30.5, 'MOTHERBOARD_TEMPERATURE_FPGA_DIE': 57.7989151000977, 'MOTHERBOARD_TEMPERATURE_PHY': 39.5, 'MOTHERBOARD_TEMPERATURE_ARM': 41.0, 'MOTHERBOARD_TEMPERATURE_POWER': 31.0}

If a call generates an exception, this exception will be raised either when the context closes, or when the corresponding Future’s result() method is invoked (whichever happens first.) The exception can also be retrieved using the Future’s exception() method.

Advanced Tuber Contexts

Behind the scenes, our asynchronous code (e.g. Tuber contexts, and parallel dispatch within the HardwareMap code) make heavy use of Tornado event loops. Occasionally, it is useful to expose these event loops directly. The following example is taken from the on-board event-monitoring code.

Tip

In general, you do not have to write code like this directly. However, at least some developers within the collaboration need to be aware that it exists and is occasionally well-motivated.

@tornado.gen.coroutine
def check_motherboard():
    try:
        # It might eventually be useful to have multiple set points per rail here.
        MB_TEMPERATURE_ACTIONS = {
            ib.TEMPERATURE_SENSOR.MB_PHY: (80, None),
            ib.TEMPERATURE_SENSOR.MB_ARM: (80, None),
            ib.TEMPERATURE_SENSOR.MB_FPGA: (80, fpga_panic),
            ib.TEMPERATURE_SENSOR.MB_FPGA_DIE: (80, fpga_panic),
            ib.TEMPERATURE_SENSOR.MB_POWER: (80, power_panic),
        }

        # Build 'results' a dictionary of temperature values.
        results = {}
        with ib.tuber_context() as ctx:
            for (sensor, (limit, action)) in MB_TEMPERATURE_ACTIONS.items():
                results[sensor] = {
                        "future": ctx.get_motherboard_temperature(sensor),
                        "sensor": sensor,
                        "limit": limit,
                        "action": action,
                }
            yield ctx._tuber_flush_async()

        # Check against permitted limits
        for (sensor, d) in results.items():
            temp = d['future'].result()
            limit = d['limit']
            action = d['action']

            if temp > limit:
                print "Eek! %f > %f" % (temp, limit)

                if action:
                    action()

    except Exception as e:
        logger.critical("Eek! %r" % e)

@coroutine
def power_panic():
    logger.critical("Power supply panic!")

    # Make sure mezzanines are powered off.
    with ib.tuber_context() as ctx:
        ctx.set_mezzanine_power(False, 1)
        ctx.set_mezzanine_power(False, 2)
        yield ctx._tuber_flush_async()

if __name__=='__main__':
   iol = tornado.ioloop.IOLoop.instance()

   # Install a temperature checker
   temp_cb = tornado.ioloop.PeriodicCallback(check_motherboard, 10000)
   temp_cb.start()

   # Enter event loop (never returns)
   iol.start()

There are two critical features in this example:

  1. First, the __main__ function is structured asynchronously; it creates an I/O loop, and calls methods decorated with the @coroutine decorator. This is the conventional way of writing asynchonous code, common to most such Python frameworks.

    Important

    We go to some trouble to hide event loops, even though it is recommended that an asynchronous program’s top level conform to the above pattern (with “iol.start()” controlling dispatch, and with @coroutine-decorated methods reaching down wherever an asynchronous control path exists.) It is unclear to me whether we would be better off requiring top-level scripts to be explicitly asynchronous – I expect it would be an uphill battle.

  2. Second, the context manager code includes odd-looking yield statements. Rather than hiding the context manager magic, these yields expose it to higher-level code.

    In this example, the _tuber_flush_async() method converts the context manager’s queue of pending functions into an asynchronous call, and returns control to the top-level Python dispatcher (where it can continue running other scheduled tasks, e.g. monitoring voltage rails.)

    The top-level dispatcher will resume execution of our code block once another piece of code yields control and our asynchronous results are available.