pocket-ai -- A Portable Toolkit for building AI Infra.
https://github.com/cjmcv/pocket-ai
- 
engine/cl: A small computing framework based on opencl. This framework is designed to help you quickly call Opencl API to do the calculations you need.
 - 
engine/vk: A small computing framework based on vulkan. This framework is designed to help you quickly call vulkan's computing API to do the calculations you need.
 - 
engine/graph: A small multitasking scheduler that can quickly build efficient pipelines for your multiple tasks.
 - 
engine/infer: A tiny inference engine for microprocessors, with a library size of only 10K+.
 - 
eval/llm: A small tool is used to quickly verify whether the end-to-end calculation results are correct when accelerating and optimizing the large language model (LLM) inference engine.
 - 
Other small tools.
 
sglang, lighteval, cutlass, vllm, mlc-llm
cux -- An experimental framework for performance analysis and optimization of CUDA kernel functions.
https://github.com/cjmcv/hpc/tree/master/0-frameworks/cux
tag: cuda / simd / openmp.
mrpc -- Mini-RPC, based on asio.
https://github.com/cjmcv/hpc/tree/master/0-frameworks/mrpc
tag: distributed computing.
cuda
- base_graph : Record the basic usage of cuda graph.
 - base_unified_memory : A simple task consumer using threads and streams with all data in Unified Memory.
 - base_zero_copy : Zero Copy.
 - gemm_fp16_wmma : Gemm fp16 - wmma
 - gemm_fp32 : Gemm fp32 - cuda core
 - reduce_fp32.cu : Based on warp reduce.
 - marlin_kernel : Reading notes on the Marlin Kernel.
 
vulkan
- gemm_fp32 : Gemm fp32.
 
opencl
- basic_demo : Introduce the basic calling method and process of OpenCL API (without using pocket-ai).
 - gemm_f32 : Gemm fp32 for Discrete graphics card.
 - gemm_mobile_f32 : Gemm fp32 for integrated graphics card.
 
neon
- gemm_fp32 : Gemm fp32.
 - gemm_int8 : Gemm int8.
 - matrix_transpose : Matrix Transpose.
 
sse/avx
- linear : Linear operator (fp32/bf16/int8)
 - matrix_transpose : Matrix Transpose (int32/fp32)
 - vector_scan : Scan. Prefix Sum.
 
mpi/mpi4py
- alg_matrix_multiply : gemm: C = A * B.
 - base_broadcast_scatter_gather : Record the basic usage of Bcast, Scatter, Gather and Allgather.
 - base_group : Group communication.
 - base_hello_world : Environment Management Routines.
 - base_reduce_alltoall_scan : Record the basic usage of Reduce, Allreduce, Alltoall, Scan and Exscan.
 - base_send_recv : Record the basic usage of MPI_Send/MPI_Recv and MPI_ISend/MPI_IRecv.
 - base_type_contiguous : Send and receive custom types of data by using MPI_Type_contiguous.
 - base_type_struct : Send and receive custom types of data by using MPI_Type_struct.
 - util_bandwidth_test : Test bandwidth by point-to-point communications.
 - py_base_broadcast_scatter_gather : Record the basic usage of Bcast, Scatter, Gather and Allgather.
 - py_base_reduce_scan : Record the basic usage of Reduce and Scan.
 - py_base_send_recv : Record the basic usage of Send and Recv.
 
std
- alg_quick_sort: Quick sort using std::thread.
 - alg_vector_dot_product: Vector dot product: h_result = SUM(A * B). Record the basic usage of std::tread and std::sync.
 - base_async: Record the basic usage of std::async.
 - util_blocking_queue: Blocking queue. Mainly implemented by thread, queue and condition_variable.
 - util_internal_thread: Internal Thread. Mainly implemented by std::thread.
 - util_thread_pool: Thread Pool. Mainly implemented by thread, queue, future and condition_variable.
 
openmp
- alg_matrix_multiply : gemm: C = A * B.
 - alg_pi_calculate : Calculate PI using parallel, for and reduction.
 - base_flush : Records the basic usage of flush.
 - base_mutex : Mutex operation in openmp, including critical, atomic, lock.
 - base_parallel_for : Parallel and For.
 - base_schedule : Records the basic usage of schedule.
 - base_sections_single : Records the basic usage of Sections and Single.
 - base_synchronous : Synchronous operation in openmp, including barrier, ordered and master.
 
tbb
- base_allocator : The basic use of allocator.
 - base_atomic : The basic use of atomic.
 - base_concurrent_hash_map : The basic use of concurrent_hash_map.
 - base_concurrent_queue : The basic use of concurrent queue.
 - base_mutex : The basic use of mutex in tbb.
 - base_parallel_for : The basic use of parallel_for.
 - base_parallel_reduce : The basic use of parallel_reduce.
 - base_parallel_scan : The basic use of parallel_scan.
 - base_parallel_sort : The basic use of base_parallel_sort.
 - base_task_scheduler : The basic use of base_task_scheduler.
 - count_strings : Count strings. Use the concurrent_hash_map.
 
libco
asyncio
- base_future: Record the basic usage of future.
 - base_gather: Use gather to execute tasks in parallel.
 - base_hello_world: Hello world. Record the basic usage of async, await and loop.
 - base_loop_chain: Executes nested coroutines.