Skip to content

Converter

ReferenceType edited this page Mar 14, 2025 · 11 revisions

Converter Class

The Converter class provides static methods for image format conversions, memory allocation, and configuration settings.

Supported Conversions:

  • RGB, BGR, RGBA, BGRA <=> YUV I420
  • YUV NV12 => RGB, BGR, RGBA, BGRA
  • YUV NV12 => YUV I420

Table of Contents


Overview

The Converter class is designed to provide efficient image conversion and memory management functionalities. It leverages SIMD native implementations(AVX2, SSE, NEON) for optimum conversions and offers various configurations to optimize.

Features:

  • RGB/BGR/RGBA/BGRA to YUV420P conversion.
  • YUV420P to RGB conversion.
  • YUV NV12 to RGB conversion.
  • YUV NV12 to YUV I420 conversion.
  • Image downscaling.
  • Aligned native memory allocation and freeing.
  • Global configuration settings.
  • SIMD optimizations (SSE, NEON, AVX2).

Static Methods

Method Return Type Description
SetConfig(ConverterConfig config) void Sets global configuration for the converter. If feature not supported flag will be overwritten
GetCurrentConfig() ConverterConfig Gets the current configuration from native side.
SetOption(ConverterOption option, int value) void Sets a configuration option. Useful when you need to set single parameter
AllocAllignedNative(int size) IntPtr Allocates 64 byte aligned native memory.
FreeAllignedNative(IntPtr p) void Frees native memory allocated by AllocAllignedNative.
Rgb2Yuv(RgbImage from, YuvImage yuv) void Converts RGB, BGR, RGBA or BGRA to YUV420P.
Yuv2Rgb(YuvImage yuv, RgbImage image) void Converts YUV420P image to RGB, BGR, RGBA or BGRA format.
Yuv2Rgb(YUVImagePointer yuv, RgbImage image) void Converts YUV420P image to RGB format using YUVImagePointer.
Yuv2Rgb(YUVNV12ImagePointer yuv, RgbImage image) void Converts YUV NV12 planar image to RGB format.
YuvNV12toYV12(YUVNV12ImagePointer nv12, YuvImage yv12) void Converts YUV NV12 planar image to YUV I420.
Downscale(RgbImage from, RgbImage to, int multiplier) void Downscales an image by a given factor. Width and Height is divided by the factor

Converter Options

Name Description
NumThreads Number of chunks that image is divided and sent to threadpool. Defaults to 1 on arm systems
EnableSSE Allows use of SSE SIMD implementations of Converter operations. Does nothing on ARM.
EnableNeon Allows use of NEON SIMD implementations of Converter operations. Does nothing on x86 systems.
EnableAvx2 Allows use of AVX2 SIMD implementations of Converter operations. Does nothing on ARM.
EnableAvx512 Not supported yet.
EnableCustomThreadPool Enables use of Custom Threadpool. On windows you can optionally use the Windows pool provided on ppl.h. Depending hardware performance may vary. Does nothing on other platforms.
EnableDebugPrints EnablesDebugPrints
ForceNaiveConversion For test purposes only, when no SIMD enabled, uses Fixed point approximation naive converter.

Usage Examples

Setting Converter Configuration

ConverterConfig config = Converter.GetCurrentConfig();
config.NumThreads = 4;
Converter.SetConfig(config);

Setting Converter Options

Converter.SetOption(ConverterOption.EnableAvx2, 1);
Converter.SetOption(ConverterOption.NumThreads, 8);

RGB to YUV Conversion

RgbImage rgbImage = new RgbImage(ImageFormat.Rgb, 800, 600);
YuvImage yuvImage = new YuvImage(800, 600);
Converter.Rgb2Yuv(rgbImage, yuvImage);

YUV to RGB Conversion

YuvImage yuvImage = new YuvImage(800, 600);
RgbImage rgbImage = new RgbImage(ImageFormat.Rgb, 800, 600);
Converter.Yuv2Rgb(yuvImage, rgbImage);

Image Downscaling

RgbImage fromImage = new RgbImage(ImageFormat.Rgb, 1600, 1200);
RgbImage toImage = new RgbImage(ImageFormat.Rgb, 800, 600);
Converter.Downscale(fromImage, toImage, 2);

Allocating and Freeing Native Memory

IntPtr nativePtr = Converter.AllocAllignedNative(1024 * 1024); // Allocate 1MB, alligns to 64 bytes
Converter.FreeAllignedNative(nativePtr);

Remarks

  • Parallelization:

    • The color format conversion process (RGB to YUV and vice versa) supports optional parallelization via thread configuration.
    • Using a single thread minimizes CPU cycle consumption (reduces context switching, cache thrashing) and maximizes efficiency, but may result in slower conversion times.
    • Conversion performance is highly dependent on factors such as image size, system memory speed, L3 cache size, core IPC (instructions per clock), cache performance, and other system-specific characteristics. Therefore, performance may vary depending o processor architecture.
    • Especially in ARM systems NEON implementations are very efficient, your bottleneck will be memory speed and cache. In my tests, parallelization on ARM systems gave little to no benefit.
    • On windows there is an option switch to use either windows thread pool(ppl.h) or custom thread pool. Custom thread pool performs significantly better on AMD system I have tested over windows and offer same performance on Intel system. Non windows platforms will always use custom pool.
    • Setting NumThreads to 1 or 0 will disable the threadpool.
  • RGB to YUV Conversion SIMD Support:

    • SIMD (Single Instruction, Multiple Data) implementations are significantly faster than Naïve or compiler auto vectorized implementations.
    • SIMD support can be configured for RGB to YUV conversions.
    • By default, the highest supported instruction set (e.g., AVX2, SSE) is automatically selected at runtime. i.e. If AVX2 is available, the SSE version will not be executed.
    • Neon instruction sets are utilized on ARM architectures, and are inactive on x86 systems and vice versa.

Converter Benchmarks

H264Sharp conversion operations are up to 2.9x faster than OpenCV implementations.

1080p 5000 Iterations of RGB -> YUV and YUV -> RGB, CustomThreadPool
AMD Ryzen 7 3700X Desktop CPU

#Threads OpenCV (ms) H264Sharp (ms)
1 11919 4899
2 6205 2479
4 3807 1303
8 2543 822
16 2462 824

Intel i7 10600U Laptop CPU
1080p 5000 Iterations of RGB -> YUV and YUV -> RGB, CustomThreadPool

#Threads OpenCV (ms) H264Sharp (ms)
1 11719 6010
2 6600 3210
4 4304 2803
8 3560 1839

Arm Parallelization Performance

1080p, 1000 iterations.
Pixel 6 Pro, Google Tensor SOC

#Threads Yuv2Rgb (ms) Rgb2Yuv (ms)
1 523 634
2 402 635
4 429 638
8 466 653

Developer Notes:

This development was extremely fun and challenging. SIMD gains with instruction level parallelism over Naïve, Auto Vectorized or Table based implementations was outstanding.

  • On Pixel 6 Neon vs Default implementation, Encoding and Decoding 1000 frames takes 7640ms on NEON and 12688ms on Default implementation. Notice that this is encode and decode operation and bulk of the time is spent on encoder-decoder.

Custom Thread Pool

I have implemented a expandable thread pool where tasks(regions of image) is delegated to threads and work stealing is employed. Each thread gets only 1 chunk of image, per image, and if they finish their job, they can steal from other thread's tasks.

There are some interesting observations I would like to share when I was benchmarking across various platforms:

Custom Thread Pool vs Microsoft Thread Pool

  • On AMD systems such as Ryzen, for some reason I could not understand, Custom thread pool with futex(and windows equivalent) thread wake up was significantly faster than Microsoft thread pool implemented on ppl.h.

    • For example 5000 iterations of YUV->RGB-YUV conversion with 16 threads yields 894ms on Custom pool and 2125ms on Microsoft pool on AMD Ryzen 7 3700X. and this trend followed on later generations of Ryzen.
  • On Intel systems Both pools perform similar.

  • On ARM system where the cache is limited, on raspberry pi 5(which is miles behind Google Tensor chip of Pixel) parallelization is actually yields worse performance than single thread(more thread, worse it gets).

Thread wake up times

  • Interestingly, Ryzen systems thread wakeup time (WaitOnAddress and WakeByAddressSingle) was faster (about 20-25 microseconds) compared to high end intel i9 13800hx (about 50-60, and can go up to 200). Yet Intel wins by sheer power.

  • And on the test with Raspberry Pi which uses Ubuntu server as OS, the latency was surprisingly lower (8-9 microseconds.)

If you have read up until this point, you are awesome Cheers!