Converter

Converter Class

The Converter class provides static methods for image format conversions, memory allocation, and configuration settings.

Supported Conversions:

RGB, BGR, RGBA, BGRA <=> YUV I420
YUV NV12 => RGB, BGR, RGBA, BGRA
YUV NV12 => YUV I420

Overview

The Converter class is designed to provide efficient image conversion and memory management functionalities. It leverages SIMD native implementations(AVX2, SSE, NEON) for optimum conversions and offers various configurations to optimize.

Features:

RGB/BGR/RGBA/BGRA to YUV420P conversion.
YUV420P to RGB conversion.
YUV NV12 to RGB conversion.
YUV NV12 to YUV I420 conversion.
Image downscaling.
Aligned native memory allocation and freeing.
Global configuration settings.
SIMD optimizations (SSE, NEON, AVX2).

Static Methods

Method	Return Type	Description
`SetConfig(ConverterConfig config)`	`void`	Sets global configuration for the converter. If feature not supported flag will be overwritten
`GetCurrentConfig()`	`ConverterConfig`	Gets the current configuration from native side.
`SetOption(ConverterOption option, int value)`	`void`	Sets a configuration option. Useful when you need to set single parameter
`AllocAllignedNative(int size)`	`IntPtr`	Allocates 64 byte aligned native memory.
`FreeAllignedNative(IntPtr p)`	`void`	Frees native memory allocated by `AllocAllignedNative`.
`Rgb2Yuv(RgbImage from, YuvImage yuv)`	`void`	Converts RGB, BGR, RGBA or BGRA to YUV420P.
`Yuv2Rgb(YuvImage yuv, RgbImage image)`	`void`	Converts YUV420P image to RGB, BGR, RGBA or BGRA format.
`Yuv2Rgb(YUVImagePointer yuv, RgbImage image)`	`void`	Converts YUV420P image to RGB format using `YUVImagePointer`.
`Yuv2Rgb(YUVNV12ImagePointer yuv, RgbImage image)`	`void`	Converts YUV NV12 planar image to RGB format.
`YuvNV12toYV12(YUVNV12ImagePointer nv12, YuvImage yv12)`	`void`	Converts YUV NV12 planar image to YUV I420.
`Downscale(RgbImage from, RgbImage to, int multiplier)`	`void`	Downscales an image by a given factor. Width and Height is divided by the factor

Converter Options

Name	Description
NumThreads	Number of chunks that image is divided and sent to threadpool. Defaults to 1 on arm systems
EnableSSE	Allows use of SSE SIMD implementations of Converter operations. Does nothing on ARM.
EnableNeon	Allows use of NEON SIMD implementations of Converter operations. Does nothing on x86 systems.
EnableAvx2	Allows use of AVX2 SIMD implementations of Converter operations. Does nothing on ARM.
EnableAvx512	Not supported yet.
EnableCustomThreadPool	Enables use of Custom Threadpool. On windows you can optionally use the Windows pool provided on ppl.h. Depending hardware performance may vary. Does nothing on other platforms.
EnableDebugPrints	EnablesDebugPrints
ForceNaiveConversion	For test purposes only, when no SIMD enabled, uses Fixed point approximation naive converter.

Usage Examples

Setting Converter Configuration

ConverterConfig config = Converter.GetCurrentConfig();
config.NumThreads = 4;
Converter.SetConfig(config);

Setting Converter Options

Converter.SetOption(ConverterOption.EnableAvx2, 1);
Converter.SetOption(ConverterOption.NumThreads, 8);

RGB to YUV Conversion

RgbImage rgbImage = new RgbImage(ImageFormat.Rgb, 800, 600);
YuvImage yuvImage = new YuvImage(800, 600);
Converter.Rgb2Yuv(rgbImage, yuvImage);

YUV to RGB Conversion

YuvImage yuvImage = new YuvImage(800, 600);
RgbImage rgbImage = new RgbImage(ImageFormat.Rgb, 800, 600);
Converter.Yuv2Rgb(yuvImage, rgbImage);

Image Downscaling

RgbImage fromImage = new RgbImage(ImageFormat.Rgb, 1600, 1200);
RgbImage toImage = new RgbImage(ImageFormat.Rgb, 800, 600);
Converter.Downscale(fromImage, toImage, 2);

Allocating and Freeing Native Memory

IntPtr nativePtr = Converter.AllocAllignedNative(1024 * 1024); // Allocate 1MB, alligns to 64 bytes
Converter.FreeAllignedNative(nativePtr);

Remarks

Parallelization:
- The color format conversion process (RGB to YUV and vice versa) supports optional parallelization via thread configuration.
- Using a single thread minimizes CPU cycle consumption (reduces context switching, cache thrashing) and maximizes efficiency, but may result in slower conversion times.
- Conversion performance is highly dependent on factors such as image size, system memory speed, L3 cache size, core IPC (instructions per clock), cache performance, and other system-specific characteristics. Therefore, performance may vary depending o processor architecture.
- Especially in ARM systems NEON implementations are very efficient, your bottleneck will be memory speed and cache. In my tests, parallelization on ARM systems gave little to no benefit.
- On windows there is an option switch to use either windows thread pool(ppl.h) or custom thread pool. Custom thread pool performs significantly better on AMD system I have tested over windows and offer same performance on Intel system. Non windows platforms will always use custom pool.
- Setting NumThreads to 1 or 0 will disable the threadpool.
RGB to YUV Conversion SIMD Support:
- SIMD (Single Instruction, Multiple Data) implementations are significantly faster than Naïve or compiler auto vectorized implementations.
- SIMD support can be configured for RGB to YUV conversions.
- By default, the highest supported instruction set (e.g., AVX2, SSE) is automatically selected at runtime. i.e. If AVX2 is available, the SSE version will not be executed.
- Neon instruction sets are utilized on ARM architectures, and are inactive on x86 systems and vice versa.

Converter Benchmarks

H264Sharp conversion operations are up to 2.9x faster than OpenCV implementations.

1080p 5000 Iterations of RGB -> YUV and YUV -> RGB, CustomThreadPool
AMD Ryzen 7 3700X Desktop CPU

#Threads	OpenCV _(ms)	H264Sharp _(ms)
1	11919	4899
2	6205	2479
4	3807	1303
8	2543	822
16	2462	824

Intel i7 10600U Laptop CPU
1080p 5000 Iterations of RGB -> YUV and YUV -> RGB, CustomThreadPool

#Threads	OpenCV _(ms)	H264Sharp _(ms)
1	11719	6010
2	6600	3210
4	4304	2803
8	3560	1839

Arm Parallelization Performance

1080p, 1000 iterations.
Pixel 6 Pro, Google Tensor SOC

#Threads	Yuv2Rgb _(ms)	Rgb2Yuv _(ms)
1	523	634
2	402	635
4	429	638
8	466	653

Developer Notes:

This development was extremely fun and challenging. SIMD gains with instruction level parallelism over Naïve, Auto Vectorized or Table based implementations was outstanding.

On Pixel 6 Neon vs Default implementation, Encoding and Decoding 1000 frames takes 7640ms on NEON and 12688ms on Default implementation. Notice that this is encode and decode operation and bulk of the time is spent on encoder-decoder.

Custom Thread Pool

I have implemented a expandable thread pool where tasks(regions of image) is delegated to threads and work stealing is employed. Each thread gets only 1 chunk of image, per image, and if they finish their job, they can steal from other thread's tasks.

There are some interesting observations I would like to share when I was benchmarking across various platforms:

Custom Thread Pool vs Microsoft Thread Pool

On AMD systems such as Ryzen, for some reason I could not understand, Custom thread pool with futex(and windows equivalent) thread wake up was significantly faster than Microsoft thread pool implemented on ppl.h.
- For example 5000 iterations of YUV->RGB-YUV conversion with 16 threads yields 894ms on Custom pool and 2125ms on Microsoft pool on AMD Ryzen 7 3700X. and this trend followed on later generations of Ryzen.
On Intel systems Both pools perform similar.
On ARM system where the cache is limited, on raspberry pi 5(which is miles behind Google Tensor chip of Pixel) parallelization is actually yields worse performance than single thread(more thread, worse it gets).

Thread wake up times

Interestingly, Ryzen systems thread wakeup time (WaitOnAddress and WakeByAddressSingle) was faster (about 20-25 microseconds) compared to high end intel i9 13800hx (about 50-60, and can go up to 200). Yet Intel wins by sheer power.
And on the test with Raspberry Pi which uses Ubuntu server as OS, the latency was surprisingly lower (8-9 microseconds.)

If you have read up until this point, you are awesome Cheers!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Converter

Converter Class

Table of Contents

Overview

Features:

Static Methods

Converter Options

Usage Examples

Setting Converter Configuration

Setting Converter Options

RGB to YUV Conversion

YUV to RGB Conversion

Image Downscaling

Allocating and Freeing Native Memory

Remarks

Converter Benchmarks

Arm Parallelization Performance

Developer Notes:

Custom Thread Pool

Custom Thread Pool vs Microsoft Thread Pool

Thread wake up times

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally