-
Notifications
You must be signed in to change notification settings - Fork 8
Converter
The Converter
class provides static methods for image format conversions, memory allocation, and configuration settings.
Supported Conversions:
- RGB, BGR, RGBA, BGRA <=> YUV I420
- YUV NV12 => RGB, BGR, RGBA, BGRA
- YUV NV12 => YUV I420
The Converter
class is designed to provide efficient image conversion and memory management functionalities. It leverages SIMD native implementations(AVX2, SSE, NEON) for optimum conversions and offers various configurations to optimize.
- RGB/BGR/RGBA/BGRA to YUV420P conversion.
- YUV420P to RGB conversion.
- YUV NV12 to RGB conversion.
- YUV NV12 to YUV I420 conversion.
- Image downscaling.
- Aligned native memory allocation and freeing.
- Global configuration settings.
- SIMD optimizations (SSE, NEON, AVX2).
Method | Return Type | Description |
---|---|---|
SetConfig(ConverterConfig config) |
void |
Sets global configuration for the converter. If feature not supported flag will be overwritten |
GetCurrentConfig() |
ConverterConfig |
Gets the current configuration from native side. |
SetOption(ConverterOption option, int value) |
void |
Sets a configuration option. Useful when you need to set single parameter |
AllocAllignedNative(int size) |
IntPtr |
Allocates 64 byte aligned native memory. |
FreeAllignedNative(IntPtr p) |
void |
Frees native memory allocated by AllocAllignedNative . |
Rgb2Yuv(RgbImage from, YuvImage yuv) |
void |
Converts RGB, BGR, RGBA or BGRA to YUV420P. |
Yuv2Rgb(YuvImage yuv, RgbImage image) |
void |
Converts YUV420P image to RGB, BGR, RGBA or BGRA format. |
Yuv2Rgb(YUVImagePointer yuv, RgbImage image) |
void |
Converts YUV420P image to RGB format using YUVImagePointer . |
Yuv2Rgb(YUVNV12ImagePointer yuv, RgbImage image) |
void |
Converts YUV NV12 planar image to RGB format. |
YuvNV12toYV12(YUVNV12ImagePointer nv12, YuvImage yv12) |
void |
Converts YUV NV12 planar image to YUV I420. |
Downscale(RgbImage from, RgbImage to, int multiplier) |
void |
Downscales an image by a given factor. Width and Height is divided by the factor |
Name | Description |
---|---|
NumThreads | Number of chunks that image is divided and sent to threadpool. Defaults to 1 on arm systems |
EnableSSE | Allows use of SSE SIMD implementations of Converter operations. Does nothing on ARM. |
EnableNeon | Allows use of NEON SIMD implementations of Converter operations. Does nothing on x86 systems. |
EnableAvx2 | Allows use of AVX2 SIMD implementations of Converter operations. Does nothing on ARM. |
EnableAvx512 | Not supported yet. |
EnableCustomThreadPool | Enables use of Custom Threadpool. On windows you can optionally use the Windows pool provided on ppl.h. Depending hardware performance may vary. Does nothing on other platforms. |
EnableDebugPrints | EnablesDebugPrints |
ForceNaiveConversion | For test purposes only, when no SIMD enabled, uses Fixed point approximation naive converter. |
ConverterConfig config = Converter.GetCurrentConfig();
config.NumThreads = 4;
Converter.SetConfig(config);
Converter.SetOption(ConverterOption.EnableAvx2, 1);
Converter.SetOption(ConverterOption.NumThreads, 8);
RgbImage rgbImage = new RgbImage(ImageFormat.Rgb, 800, 600);
YuvImage yuvImage = new YuvImage(800, 600);
Converter.Rgb2Yuv(rgbImage, yuvImage);
YuvImage yuvImage = new YuvImage(800, 600);
RgbImage rgbImage = new RgbImage(ImageFormat.Rgb, 800, 600);
Converter.Yuv2Rgb(yuvImage, rgbImage);
RgbImage fromImage = new RgbImage(ImageFormat.Rgb, 1600, 1200);
RgbImage toImage = new RgbImage(ImageFormat.Rgb, 800, 600);
Converter.Downscale(fromImage, toImage, 2);
IntPtr nativePtr = Converter.AllocAllignedNative(1024 * 1024); // Allocate 1MB, alligns to 64 bytes
Converter.FreeAllignedNative(nativePtr);
-
Parallelization:
- The color format conversion process (RGB to YUV and vice versa) supports optional parallelization via thread configuration.
- Using a single thread minimizes CPU cycle consumption (reduces context switching, cache thrashing) and maximizes efficiency, but may result in slower conversion times.
- Conversion performance is highly dependent on factors such as image size, system memory speed, L3 cache size, core IPC (instructions per clock), cache performance, and other system-specific characteristics. Therefore, performance may vary depending o processor architecture.
- Especially in ARM systems NEON implementations are very efficient, your bottleneck will be memory speed and cache. In my tests, parallelization on ARM systems gave little to no benefit.
- On windows there is an option switch to use either windows thread pool(
ppl.h
) or custom thread pool. Custom thread pool performs significantly better on AMD system I have tested over windows and offer same performance on Intel system. Non windows platforms will always use custom pool. - Setting NumThreads to 1 or 0 will disable the threadpool.
-
RGB to YUV Conversion SIMD Support:
- SIMD (Single Instruction, Multiple Data) implementations are significantly faster than Naïve or compiler auto vectorized implementations.
- SIMD support can be configured for RGB to YUV conversions.
- By default, the highest supported instruction set (e.g., AVX2, SSE) is automatically selected at runtime. i.e. If AVX2 is available, the SSE version will not be executed.
- Neon instruction sets are utilized on ARM architectures, and are inactive on x86 systems and vice versa.
H264Sharp conversion operations are up to 2.9x faster than OpenCV implementations.
1080p 5000 Iterations of RGB -> YUV and YUV -> RGB, CustomThreadPool
AMD Ryzen 7 3700X Desktop CPU
#Threads | OpenCV (ms) | H264Sharp (ms) |
---|---|---|
1 | 11919 | 4899 |
2 | 6205 | 2479 |
4 | 3807 | 1303 |
8 | 2543 | 822 |
16 | 2462 | 824 |
Intel i7 10600U Laptop CPU
1080p 5000 Iterations of RGB -> YUV and YUV -> RGB, CustomThreadPool
#Threads | OpenCV (ms) | H264Sharp (ms) |
---|---|---|
1 | 11719 | 6010 |
2 | 6600 | 3210 |
4 | 4304 | 2803 |
8 | 3560 | 1839 |
1080p, 1000 iterations.
Pixel 6 Pro, Google Tensor SOC
#Threads | Yuv2Rgb (ms) | Rgb2Yuv (ms) |
---|---|---|
1 | 523 | 634 |
2 | 402 | 635 |
4 | 429 | 638 |
8 | 466 | 653 |
This development was extremely fun and challenging. SIMD gains with instruction level parallelism over Naïve, Auto Vectorized or Table based implementations was outstanding.
- On Pixel 6 Neon vs Default implementation, Encoding and Decoding 1000 frames takes 7640ms on NEON and 12688ms on Default implementation. Notice that this is encode and decode operation and bulk of the time is spent on encoder-decoder.
I have implemented a expandable thread pool where tasks(regions of image) is delegated to threads and work stealing is employed. Each thread gets only 1 chunk of image, per image, and if they finish their job, they can steal from other thread's tasks.
There are some interesting observations I would like to share when I was benchmarking across various platforms:
-
On AMD systems such as Ryzen, for some reason I could not understand, Custom thread pool with futex(and windows equivalent) thread wake up was significantly faster than Microsoft thread pool implemented on
ppl.h
. -
- For example 5000 iterations of YUV->RGB-YUV conversion with 16 threads yields 894ms on Custom pool and 2125ms on Microsoft pool on AMD Ryzen 7 3700X. and this trend followed on later generations of Ryzen.
-
On Intel systems Both pools perform similar.
-
On ARM system where the cache is limited, on raspberry pi 5(which is miles behind Google Tensor chip of Pixel) parallelization is actually yields worse performance than single thread(more thread, worse it gets).
-
Interestingly, Ryzen systems thread wakeup time (WaitOnAddress and WakeByAddressSingle) was faster (about 20-25 microseconds) compared to high end intel i9 13800hx (about 50-60, and can go up to 200). Yet Intel wins by sheer power.
-
And on the test with Raspberry Pi which uses Ubuntu server as OS, the latency was surprisingly lower (8-9 microseconds.)
If you have read up until this point, you are awesome Cheers!