Blog

Performance Counter Performance

Most of my time these past few months has been taken up almost completely with the preparations for releasing 3.7 of GNU Radio. It's a huge task and has needed a lot of work and careful scrutiny.

But I managed to find some time today to look into a question that's been bugging me for a bit now. A few months ago, we introduced the Performance Counters into GNU Radio. These are a way of internally probing statistics about the running radio system. We have currently defined 5 PCs for the GNU Radio blocks, and each time a block's work function is called, the instantaneous, average, and variance of all five PCs are calculated and stored. The PCs are really useful for performance analysis of the radio, possibly even leading to an understanding of how to dynamically adjust the flowgraph to alleviate performance issues. But the question that keeps coming up is, "what is the performance cost of calculating the performance counters?"

So I sat down to figure that out, along with a little side project I was also interested in. My methodology was just to create a simple, finite flowgraph that would take some time to process. I would then compare the run time of the flowgraph with the performance counters disabled at compile time, disabled at runtime, and enabled. Disabling at compile time uses #ifdef statements to remove the calls from the scheduler completely, so it's like the PCs aren't there at all (in fact, they aren't). Disabling at runtime means that the PCs are compiled into GNU Radio but they can be disabled with a configuration file or environmental variable. This means we do a simple runtime if() check on this configuration value to determine if we calculate the counters or not.

The hypothesis here is that disabling at compile time is the baseline control where we add no extra computation. Compiling them in but turning them off at runtime through the config file will take a small hit because of the extra time to perform the if check. Finally, actually computing the PCs will take the most amount of time to run the flowgraph because of the all of the calculations required for the three statistics of each of the 5 PCs for every block.

The flowgraph used is shown here and is a very simple script:

-------------------------------------------------------------------------------

from gnuradio import gr, blocks, filter
import scipy
def main():
    N = 1e9
    taps = scipy.random.random(100)
    
    src = blocks.null_source(gr.sizeof_gr_complex)
    hed = blocks.head(gr.sizeof_gr_complex, int(N))
    op  = filter.fir_filter_ccf(1, taps)
    snk = blocks.null_sink(gr.sizeof_gr_complex)
    op.set_processor_affinity([7,])
    src.set_processor_affinity([6,])
    hed.set_processor_affinity([6,])
    #snk.set_processor_affinity([6,])
    
    tb = gr.top_block()
    tb.connect(src, hed, op, snk)
    tb.run()
if __name__ == "__main__":
    main()

-------------------------------------------------------------------------------

It sets up a null source and sink and a filter of some arbitrary taps and some length. The flowgraph is run for N items (1 billion here). I then just used the Linux time command to run and time the flowgraph. I then keep the real time for each of 10 runs.


PCs Disabled PCs On PCs Off Affinity Off

54.61 54.632 55.141 62.648

54.35 54.655 54.451 61.206

54.443 55.319 54.388 61.787

54.337 54.718 54.348 62.355

54.309 54.415 55.467 61.729

55.055 54.318 54.314 61.526

54.281 54.651 54.581 62.036

54.935 54.487 55.226 61.788

54.316 54.595 54.283 61.817

54.956 54.785 54.593 62.255
Avg. 54.5592 54.6575 54.6792 61.9147
Min. 54.281 54.318 54.283 61.206
Max. 55.055 55.319 55.467 62.648





% Diff. Avg 0.000 0.180 0.220 13.482
% Diff. Min 0.000 0.068 0.004 12.758
% Diff. Max 0.000 0.480 0.748 13.792

 

The results were just about as predicted, but also somewhat surprising, but in a good way. As predicted, having the PCs compiled out of the scheduler was in fact the fastest. If we look only at the minimum run times out of the 10, then turning the PCs off at runtime was the next fastest, and doing the computations was the slowest. But what's nice to see here is that we're talking much less than 1% difference in speed. Somewhat surprisingly, though, on average and looking at the max values, the runtime disabling performed worse than enabling the PCs. I can only gather that there was some poor branch prediction going on here.

Whatever the reasons for all of this, the take-away is clear: the Performance Counters barely impact the performance of a flowgraph. At least for small graphs. But if we're calculating the PCs on all blocks all the time, what happens when we increase the number of blocks in the flowgraph? I'll address that in a second.

First, I wanted to look at what we can do with the thread affinity concept also recently introduced. I have an 8-core machine and the source, sink, and head block take very little processing power. So I pushed all of those onto one core and gave the FIR filter its own core to use. All of the PC tests were done setting this thread affinity. So then I turned the thread affinity off while computing the compile-time disabled experiments. The results are fairly shocking. We see a decrease in the speed of this flowgraph of around 13%! Just by forcing the OS to keep the threads locked to a specific core. We still need to study this more, such as what happens when we have more blocks than cores as well as how best to map the blocks to the cores, but the evidence here of it's benefits is pretty exciting.

Now, back to the many-block problem. For this, I will generate 20 FIR filters, each with different taps (so we can't have cache hiding issues). For this, because we have more blocks than I have cores, I'm not going to study the thread affinity concept. I'll save that for later. Also, I reduced N to 100 million to save a bit of time since the results are being pretty consistent.


PCs Disabled PCs Off PCs On

29.670 29.868 29.725

29.754 29.721 29.787

29.702 29.735 29.800

29.708 29.700 29.767

29.592 29.932 30.023

29.595 29.877 29.784

29.564 29.964 29.873

29.685 29.926 29.794

29.719 29.973 29.902

29.646 29.881 29.861
Avg. 29.664 29.858 29.832
Min. 29.564 29.700 29.725
Max. 29.754 29.973 30.023




% Diff. Avg 0.000 0.655 0.567
% Diff. Min 0.000 0.460 0.545
% Diff. Max 0.000 0.736 0.904

 

These results show us that even for a fairly complex flowgraph with over 20 blocks, the performance penalty is around a half a percent to close to a percent in the worst case.

My takeaway from this is that we are probably being a bit over-zealous by having both the compile-time and run-time configuration of the performance counters. It also looks as though the runtime toggle of the PCs on and off seems to almost hurt more than it helps. This might make me change the defaults to have the PCs enabled at compile time by default instead of being disabled. For users who want to get that half a percent more efficiency in their graph, they can go through the trouble of disabling it manually.