PYNQ overlay to accelerate some Python functions.
This project aims at accelerating Python functions from the OpenCV library using PYNQ. We implemented the threshold function with binary mode, which sets the values of the input array to a predefined value if its intensity exceed a certain threshold, otherwise sets it to zero. The erode function is a work in progress.
The main goal of this project are :
- Develop a custom IP block to compute a function;
- Manipulate large arrays using the DMA;
- Integrate the IP block into an overlay for a specific target platform;
- Develop a driver to run the accelerated function;
- 1) Context
- 2) Requirements
- 3) Content
- 4) Running the App
- 5) Results
- 6) Using PYNQ peripherals
- 7) Axes of improvement
PYNQ (Python Productivity for Zynq) is an open-source project from Xilinx® that allows the use of Python language and libraries through Jupyter Notebook. We use the compliant PYNQ-Z2 board. It contains a Programmable logic equivalent to Artix-7 FPGA (Field-Programmable Gate Array) that we are going to configure.
As defined in the PYNQ documentation, "overlays, or hardware libraries, are programmable FPGA designs that extend the user application from the Processing System of the Zynq into the Programmable Logic. They can be used to accelerate a software application [...]. An FPGA overlay is a virtual reconfigurable architecture that overlays on top of the physical FPGA configurable fabric".
This tutorial explains how to develop your own overlay with Vivado and PYNQ. This documentation page may help.
List of requirements for this project :
Actions :
- Setup the PYNQ-Z2 board by following this guide. The boot files are located here.
- Install on Vivado the Vivado board files for PYNQ-Z2. In case the link is broken, try this one. You may need to use Win32 Disk Imager, Putty and a micro SD to SD adapter.
In the Jupyter notebook that corresponds to our application, we first call the OpenCV function for thresholding. Then, we implemented a remake to do the same thing but without calling the library.
Our goal is to configure an overlay that contains a custom IP block that accelerates this function. We call this overlay in the notebook. The Vivado files for the custom IP block can be found here, and the ones for the whole overlay here.
We had to follow these steps to turn the Python code into a kernel code :
- Identify the function to accelerate ('threshold' in our case);
- Convert the code to a Python function that doesn't call any library;
- Convert this raw Python function to C++ (check the program here);
- Adapt the C++ code to the target board (check the program here);
The Zynq-7000 device is referenced by the code xc7z020clg400-1.
The following steps describe how to create and use the overlay :
- Sythesize the custom IP block from the kernel code and export the RTL (Register-Transfer Language) as a Vivado IP;
- Integrate the custom IP block in a PYNQ-Z2 overlay;
- Export the block design (check the files here);
- Call the overlay in the notebook;
- Develop a Python Driver class to easily interact with the overlay.
This tutorial video shows how to create the custom IP block. Click on the thumbnail to play the video. You may have to download the video.
The overlay is composed of three main IP blocks :
This tutorial video shows how to integrate the custom IP block to the PYNQ-Z2 overlay. Click on the thumbnail to play the video. You may have to download the video.
Here is the final block design :
It contains a hierarchy ('threshold') for our custom IP block :
Before calling the threshold function, we need to preprocess the image to convert it to grayscale, and then to an array, and finally flatten it and get its length. The postprocessing part reshapes the output data and converts it to an image.
Once the PYNQ-Z2 card is setup, connect to the Jupyter notebook through the network. Connect the PYNQ-Z2 to Ethernet, and connect the HDMI-in port to a machine through an HDMI cable.
In our case :
- URL address : 10.104.210.46:9090
- Login/password : xilinx/xilinx
You just have to download the archive of the notebook project and place it in your own Jupyter space.
For now, only the threshold function of binary type is implemented.
- Input image :
- Output image :
Let's compare the overlay performances to the original OpenCV function and the Python function remake, using the same input image and parameters. We study the duration across 5 runs.
OpenCV | Remake | Overlay |
---|---|---|
5.5 ms | 37146.5 ms | 2846.7 ms |
6.4 ms | 37387.0 ms | 2901.8 ms |
5.5 ms | 36541.3 ms | 2890.1 ms |
6.9 ms | 37202.1 ms | 2881.7 ms |
6.8 ms | 37069.5 ms | 2887.6 ms |
The average durations after this test are the following :
OpenCV | Remake | Overlay |
---|---|---|
6.2 ms | 37069.3 ms | 2881.6 ms |
We can deduce that in this test, the overlay function is almost 13 times faster than the remake function. However, it is still around 465 times slower than the original OpenCV implementation. This result was expected given that OpenCV is an optimized library.
Furthermore, we obtain the same output data for the three functions, which emans that the results are accurate. To check it, we computed the mean value of the absolute difference between the output arrays, two by two.
So as to explore the PYNQ field of possibilities, we replaced the part of loading an image from a folder by an input stream coming from the HDMI port. To do so, we connected the HDMI port to a computer opened on a webpage, and followed the HDMI-in tutorial. We also worked with GPIO using the RGB LED, whose driver source code can be found here.
- Make changes to the kernel code of the kernel code of the threshold function to further reduce the execution time;
- Improve the design of the Python driver class;
- Make the threshold function more customizable by allowing the choice of the thresholding technique;
- Implement the IP block of the erode function;
- Combine the custom IP blocks with the RGB LED and the HDMI in a same overlay to avoid switching between them. Either start from the whole PYNQ-Z2 base overlay, adding the custom IP, or start from zero. This tutorial may help;
- Configure the HDMI-out peripheral to display the results on an external screen;
- Apply the function continuously on a video stream.