Learning and practice of high performance computing and ai infra

Application

pocket-ai -- A Portable Toolkit for building AI Infra.

https://github.com/cjmcv/pocket-ai

engine/cl: A small computing framework based on opencl. This framework is designed to help you quickly call Opencl API to do the calculations you need.
engine/vk: A small computing framework based on vulkan. This framework is designed to help you quickly call vulkan's computing API to do the calculations you need.
engine/graph: A small multitasking scheduler that can quickly build efficient pipelines for your multiple tasks.
engine/infer: A tiny inference engine for microprocessors, with a library size of only 10K+.
eval/llm: A small tool is used to quickly verify whether the end-to-end calculation results are correct when accelerating and optimizing the large language model (LLM) inference engine.
Other small tools.

Reading Notes

ai-infra-notes

sglang, lighteval, cutlass, vllm, mlc-llm

Practice

cux -- An experimental framework for performance analysis and optimization of CUDA kernel functions.

https://github.com/cjmcv/hpc/tree/master/0-frameworks/cux

tag: cuda / simd / openmp.

mrpc -- Mini-RPC, based on asio.

https://github.com/cjmcv/hpc/tree/master/0-frameworks/mrpc

tag: distributed computing.

Learning

Heterogeneous computing

cuda

base_graph : Record the basic usage of cuda graph.
base_unified_memory : A simple task consumer using threads and streams with all data in Unified Memory.
base_zero_copy : Zero Copy.
gemm_fp16_wmma : Gemm fp16 - wmma
gemm_fp32 : Gemm fp32 - cuda core
reduce_fp32.cu : Based on warp reduce.
marlin_kernel : Reading notes on the Marlin Kernel.

vulkan

gemm_fp32 : Gemm fp32.

opencl

basic_demo : Introduce the basic calling method and process of OpenCL API (without using pocket-ai).
gemm_f32 : Gemm fp32 for Discrete graphics card.
gemm_mobile_f32 : Gemm fp32 for integrated graphics card.

SIMD

neon

gemm_fp32 : Gemm fp32.
gemm_int8 : Gemm int8.
matrix_transpose : Matrix Transpose.

sse/avx

linear : Linear operator (fp32/bf16/int8)
matrix_transpose : Matrix Transpose (int32/fp32)
vector_scan : Scan. Prefix Sum.

Distributed computing

mpi/mpi4py

alg_matrix_multiply : gemm: C = A * B.
base_broadcast_scatter_gather : Record the basic usage of Bcast, Scatter, Gather and Allgather.
base_group : Group communication.
base_hello_world : Environment Management Routines.
base_reduce_alltoall_scan : Record the basic usage of Reduce, Allreduce, Alltoall, Scan and Exscan.
base_send_recv : Record the basic usage of MPI_Send/MPI_Recv and MPI_ISend/MPI_IRecv.
base_type_contiguous : Send and receive custom types of data by using MPI_Type_contiguous.
base_type_struct : Send and receive custom types of data by using MPI_Type_struct.
util_bandwidth_test : Test bandwidth by point-to-point communications.
py_base_broadcast_scatter_gather : Record the basic usage of Bcast, Scatter, Gather and Allgather.
py_base_reduce_scan : Record the basic usage of Reduce and Scan.
py_base_send_recv : Record the basic usage of Send and Recv.

Thread

std

alg_quick_sort: Quick sort using std::thread.
alg_vector_dot_product: Vector dot product: h_result = SUM(A * B). Record the basic usage of std::tread and std::sync.
base_async: Record the basic usage of std::async.
util_blocking_queue: Blocking queue. Mainly implemented by thread, queue and condition_variable.
util_internal_thread: Internal Thread. Mainly implemented by std::thread.
util_thread_pool: Thread Pool. Mainly implemented by thread, queue, future and condition_variable.

openmp

alg_matrix_multiply : gemm: C = A * B.
alg_pi_calculate : Calculate PI using parallel, for and reduction.
base_flush : Records the basic usage of flush.
base_mutex : Mutex operation in openmp, including critical, atomic, lock.
base_parallel_for : Parallel and For.
base_schedule : Records the basic usage of schedule.
base_sections_single : Records the basic usage of Sections and Single.
base_synchronous : Synchronous operation in openmp, including barrier, ordered and master.

tbb

base_allocator : The basic use of allocator.
base_atomic : The basic use of atomic.
base_concurrent_hash_map : The basic use of concurrent_hash_map.
base_concurrent_queue : The basic use of concurrent queue.
base_mutex : The basic use of mutex in tbb.
base_parallel_for : The basic use of parallel_for.
base_parallel_reduce : The basic use of parallel_reduce.
base_parallel_scan : The basic use of parallel_scan.
base_parallel_sort : The basic use of base_parallel_sort.
base_task_scheduler : The basic use of base_task_scheduler.
count_strings : Count strings. Use the concurrent_hash_map.

Coroutines

libco

asyncio

base_future: Record the basic usage of future.
base_gather: Use gather to execute tasks in parallel.
base_hello_world: Hello world. Record the basic usage of async, await and loop.
base_loop_chain: Executes nested coroutines.

Name		Name	Last commit message	Last commit date
Latest commit History 420 Commits
0-frameworks		0-frameworks
coroutine		coroutine
cuda		cuda
llvm		llvm
mpi		mpi
opencl		opencl
openmp		openmp
pocket-ai @ 90b5a25		pocket-ai @ 90b5a25
simd		simd
std		std
tbb		tbb
vulkan		vulkan
z-docs/images		z-docs/images
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Learning and practice of high performance computing and ai infra

Application

Reading Notes

Practice

Learning

Heterogeneous computing

SIMD

Distributed computing

Thread

Coroutines

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

cjmcv/hpc

Folders and files

Latest commit

History

Repository files navigation

Learning and practice of high performance computing and ai infra

Application

Reading Notes

Practice

Learning

Heterogeneous computing

SIMD

Distributed computing

Thread

Coroutines

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages