Benchmarking Compilers on a Cortex-M85 MCU with the SIMD Helium Instruction Set

Have you heard about the Helium instruction set in Cortex-M52, M55, and M85 MCUs? Helium is a SIMD (Single Instruction Multiple Data) instruction set that allows executing the same instruction on multiple data points in a single clock cycle. This is particularly advantageous for UI-related tasks where thousands of pixels need to be filled or blended efficiently. Let’s explore how different compilers can utilize this powerful instruction set!

Glossary

Let’s start with a glossary to clarify some key terms:

Test Configuration

Hardware

We tested the Helium instruction set on a Renesas EK-RA8D1 development board, featuring:

Software

During the benchmark, we tested GCC, LLVM, and Ac6 compilers in various configurations. Ready-to-use LVGL projects for the Renesas EK-RA8D1 with each compiler can be found at the following links:

LVGL

For maximum speed, LVGL was configured in partial rendering mode with a 64 kB buffer placed in TCM. Other memory options were slower and couldn’t fully utilize the power of Helium.

LV_USE_OS was set to LV_OS_NONE to eliminate any overhead from FreeRTOS in LVGL’s rendering pipeline.

We used lv_demo_benchmark to measure performance.

LVGL was slightly modified to measure rendering times with 0.1 ms precision instead of 1 ms.

Results

The following chart shows the differences in average rendering times using the LVGL benchmark demo:

Benchmark results in various configurations

So, what do the results tell us?

  1. GCC doesn’t support Helium, so it served as our baseline reference.
  2. LLVM 17 and LLVM 18 both support Helium. The results were slightly faster than GCC, though LLVM 18 was marginally slower than LLVM 17.
  3. Enabling Arm2D provided approximately a 20% performance boost. In this case, LLVM 18 was slightly faster than LLVM 17.
  4. Ac6, with Helium support disabled (similar to GCC), was slightly faster than GCC.
  5. Ac6 with Helium support but without Arm2D utilized Helium better than LLVM 17 or 18.
  6. Ac6 combined with Arm2D resulted in a 26% performance boost.

Conclusion

Rendering times were around 10 ms across all configurations on a 480x854 screen. This indicates that even without a GPU, software rendering on this MCU is sufficiently fast for most use cases.

This suggests that MCUs, even those designed for non-UI applications (e.g., motor control), can effectively drive screens with higher resolutions and rich graphics. To maximize the performance of your MCU, you can switch to LLVM and add Arm2D, which are available for free. For those needing even more performance, Arm’s commercial Ac6 compiler is an excellent option. (Evaluation version available.)