Skip to content

Commit ee82cf1

Browse files
committed
Add CommandBlock, use device VBO
1 parent 23f4b50 commit ee82cf1

File tree

11 files changed

+411
-15
lines changed

11 files changed

+411
-15
lines changed

guide/src/SUMMARY.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,4 +39,6 @@
3939
- [Memory Allocation](memory/README.md)
4040
- [Vulkan Memory Allocator](memory/vma.md)
4141
- [Buffers](memory/buffers.md)
42-
- [Host Vertex Buffer](memory/host_vertex_buffer.md)
42+
- [Vertex Buffer](memory/vertex_buffer.md)
43+
- [Command Block](memory/command_block.md)
44+
- [Device Buffers](memory/device_buffers.md)

guide/src/memory/command_block.md

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# Command Block
2+
3+
Long-lived vertex buffers perform better when backed by Device memory, especially for 3D meshes. Data is transferred to device buffers in two steps:
4+
5+
1. Allocate a host buffer and copy the data to its mapped memory
6+
1. Allocate a device buffer, record a Buffer Copy operation and submit it
7+
8+
The second step requires a command buffer and queue submission (_and_ waiting for the submitted work to complete). Encapsulate this behavior into a class, it will also be used for creating images:
9+
10+
```cpp
11+
class CommandBlock {
12+
public:
13+
explicit CommandBlock(vk::Device device, vk::Queue queue,
14+
vk::CommandPool command_pool);
15+
16+
[[nodiscard]] auto command_buffer() const -> vk::CommandBuffer {
17+
return *m_command_buffer;
18+
}
19+
20+
void submit_and_wait();
21+
22+
private:
23+
vk::Device m_device{};
24+
vk::Queue m_queue{};
25+
vk::UniqueCommandBuffer m_command_buffer{};
26+
};
27+
```
28+
29+
The constructor takes an existing command pool created for such ad-hoc allocations, and the queue for submission later. This way it can be passed around after creation and used by other code.
30+
31+
```cpp
32+
CommandBlock::CommandBlock(vk::Device const device, vk::Queue const queue,
33+
vk::CommandPool const command_pool)
34+
: m_device(device), m_queue(queue) {
35+
// allocate a UniqueCommandBuffer which will free the underlying command
36+
// buffer from its owning pool on destruction.
37+
auto allocate_info = vk::CommandBufferAllocateInfo{};
38+
allocate_info.setCommandPool(command_pool)
39+
.setCommandBufferCount(1)
40+
.setLevel(vk::CommandBufferLevel::ePrimary);
41+
// all the current VulkanHPP functions for UniqueCommandBuffer allocation
42+
// return vectors.
43+
auto command_buffers = m_device.allocateCommandBuffersUnique(allocate_info);
44+
m_command_buffer = std::move(command_buffers.front());
45+
46+
// start recording commands before returning.
47+
auto begin_info = vk::CommandBufferBeginInfo{};
48+
begin_info.setFlags(vk::CommandBufferUsageFlagBits::eOneTimeSubmit);
49+
m_command_buffer->begin(begin_info);
50+
}
51+
```
52+
53+
`submit_and_wait()` resets the unique command buffer at the end, to free it from its command pool:
54+
55+
```cpp
56+
void CommandBlock::submit_and_wait() {
57+
if (!m_command_buffer) { return; }
58+
59+
// end recording and submit.
60+
m_command_buffer->end();
61+
auto submit_info = vk::SubmitInfo2KHR{};
62+
auto const command_buffer_info =
63+
vk::CommandBufferSubmitInfo{*m_command_buffer};
64+
submit_info.setCommandBufferInfos(command_buffer_info);
65+
auto fence = m_device.createFenceUnique({});
66+
m_queue.submit2(submit_info, *fence);
67+
68+
// wait for submit fence to be signaled.
69+
static constexpr auto timeout_v =
70+
static_cast<std::uint64_t>(std::chrono::nanoseconds(30s).count());
71+
auto const result = m_device.waitForFences(*fence, vk::True, timeout_v);
72+
if (result != vk::Result::eSuccess) {
73+
std::println(stderr, "Failed to submit Command Buffer");
74+
}
75+
// free the command buffer.
76+
m_command_buffer.reset();
77+
}
78+
```
79+
80+
## Multithreading considerations
81+
82+
Instead of blocking the main thread on every Command Block's `submit_and_wait()`, you might be wondering if command block usage could be multithreaded. The answer is yes! But with some extra work: each thread will require its own command pool - just using one owned (unique) pool per Command Block (with no need to free the buffer) is a good starting point. All queue operations need to be synchronized, ie a critical section protected by a mutex. This includes Swapchain acquire/present calls, and Queue submissions. A `class Queue` value type that stores a copy of the `vk::Queue` and a pointer/reference to its `std::mutex` - and wraps the submit call - can be passed to command blocks. Just this much will enable asynchronous asset loading etc, as each loading thread will use its own command pool, and queue submissions all around will be critical sections. `VmaAllocator` is internally synchronized (can be disabled at build time), so performing allocations through the same allocator on multiple threads is safe.
83+
84+
For multi-threaded rendering, use a Secondary command buffer per thread to record rendering commands, accumulate and execute them in the main (Primary) command buffer currently in `RenderSync`. This is not particularly helpful unless you have thousands of expensive draw calls and dozens of render passes, as recording even a hundred draws will likely be faster on a single thread.

guide/src/memory/device_buffers.md

Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
# Device Buffers
2+
3+
This guide will only use device buffers for vertex buffers, where both vertex and index data will be strung together in a single VBO. The create function can thus take the data and perform the buffer copy operation before returning. In essence this return value is a "GPU const" buffer. To enable utilizing separate spans for vertices and indices (instead of forcing allocation of a contiguous bytestream and copying the data), the create function takes a slightly awkward span of spans:
4+
5+
```cpp
6+
// disparate byte spans.
7+
using ByteSpans = std::span<std::span<std::byte const> const>;
8+
9+
// returns a Device Buffer with each byte span sequentially written.
10+
[[nodiscard]] auto create_device_buffer(VmaAllocator allocator,
11+
vk::BufferUsageFlags usage,
12+
CommandBlock command_block,
13+
ByteSpans const& byte_spans) -> Buffer;
14+
```
15+
16+
Implement `create_device_buffer()`:
17+
18+
```cpp
19+
auto vma::create_device_buffer(VmaAllocator allocator,
20+
vk::BufferUsageFlags usage,
21+
CommandBlock command_block,
22+
ByteSpans const& byte_spans) -> Buffer {
23+
auto const total_size = std::accumulate(
24+
byte_spans.begin(), byte_spans.end(), 0uz,
25+
[](std::size_t const n, std::span<std::byte const> bytes) {
26+
return n + bytes.size();
27+
});
28+
29+
// create staging Host Buffer with TransferSrc usage.
30+
auto staging_buffer = create_host_buffer(
31+
allocator, vk::BufferUsageFlagBits::eTransferSrc, total_size);
32+
33+
// create the Device Buffer, ensuring TransferDst usage.
34+
usage |= vk::BufferUsageFlagBits::eTransferDst;
35+
auto allocation_ci = VmaAllocationCreateInfo{};
36+
allocation_ci.usage = VMA_MEMORY_USAGE_AUTO_PREFER_DEVICE;
37+
allocation_ci.flags =
38+
VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT;
39+
auto ret = create_buffer(allocator, allocation_ci, usage, total_size);
40+
41+
// can't do anything if either buffer creation failed.
42+
if (!staging_buffer.get().buffer || !ret.get().buffer) { return {}; }
43+
44+
// copy byte spans into staging buffer.
45+
auto dst = staging_buffer.get().mapped_span();
46+
for (auto const bytes : byte_spans) {
47+
std::memcpy(dst.data(), bytes.data(), bytes.size());
48+
dst = dst.subspan(bytes.size());
49+
}
50+
51+
// record buffer copy operation.
52+
auto buffer_copy = vk::BufferCopy2{};
53+
buffer_copy.setSize(total_size);
54+
auto copy_buffer_info = vk::CopyBufferInfo2{};
55+
copy_buffer_info.setSrcBuffer(staging_buffer.get().buffer)
56+
.setDstBuffer(ret.get().buffer)
57+
.setRegions(buffer_copy);
58+
command_block.command_buffer().copyBuffer2(copy_buffer_info);
59+
60+
// submit and wait.
61+
// waiting here is necessary to keep the staging buffer alive while the GPU
62+
// accesses it through the recorded commands.
63+
// this is also why the function takes ownership of the passed CommandBlock
64+
// instead of just referencing it / taking a vk::CommandBuffer.
65+
command_block.submit_and_wait();
66+
67+
return ret;
68+
}
69+
```
70+
71+
Add a command block pool to `App`, and a helper function to create command blocks:
72+
73+
```cpp
74+
void App::create_cmd_block_pool() {
75+
auto command_pool_ci = vk::CommandPoolCreateInfo{};
76+
command_pool_ci
77+
.setQueueFamilyIndex(m_gpu.queue_family)
78+
// this flag indicates that the allocated Command Buffers will be
79+
// short-lived.
80+
.setFlags(vk::CommandPoolCreateFlagBits::eTransient);
81+
m_cmd_block_pool = m_device->createCommandPoolUnique(command_pool_ci);
82+
}
83+
84+
auto App::create_command_block() const -> CommandBlock {
85+
return CommandBlock{*m_device, m_queue, *m_cmd_block_pool};
86+
}
87+
```
88+
89+
Update `create_vertex_buffer()` to create a quad with indices:
90+
91+
```cpp
92+
template <typename T>
93+
[[nodiscard]] constexpr auto to_byte_array(T const& t) {
94+
return std::bit_cast<std::array<std::byte, sizeof(T)>>(t);
95+
}
96+
97+
// ...
98+
void App::create_vertex_buffer() {
99+
// vertices of a quad.
100+
static constexpr auto vertices_v = std::array{
101+
Vertex{.position = {-0.5f, -0.5f}, .color = {1.0f, 0.0f, 0.0f}},
102+
Vertex{.position = {0.5f, -0.5f}, .color = {0.0f, 1.0f, 0.0f}},
103+
Vertex{.position = {0.5f, 0.5f}, .color = {0.0f, 0.0f, 1.0f}},
104+
Vertex{.position = {-0.5f, 0.5f}, .color = {1.0f, 1.0f, 0.0f}},
105+
};
106+
static constexpr auto indices_v = std::array{
107+
0u, 1u, 2u, 2u, 3u, 0u,
108+
};
109+
static constexpr auto vertices_bytes_v = to_byte_array(vertices_v);
110+
static constexpr auto indices_bytes_v = to_byte_array(indices_v);
111+
static constexpr auto total_bytes_v =
112+
std::array<std::span<std::byte const>, 2>{
113+
vertices_bytes_v,
114+
indices_bytes_v,
115+
};
116+
// we want to write total_bytes_v to a Device VertexBuffer | IndexBuffer.
117+
m_vbo = vma::create_device_buffer(m_allocator.get(),
118+
vk::BufferUsageFlagBits::eVertexBuffer |
119+
vk::BufferUsageFlagBits::eIndexBuffer,
120+
create_command_block(), total_bytes_v);
121+
}
122+
```
123+
124+
Update `draw()`:
125+
126+
```cpp
127+
void App::draw(vk::CommandBuffer const command_buffer) const {
128+
m_shader->bind(command_buffer, m_framebuffer_size);
129+
// single VBO at binding 0 at no offset.
130+
command_buffer.bindVertexBuffers(0, m_vbo.get().buffer, vk::DeviceSize{});
131+
// u32 indices after offset of 4 vertices.
132+
command_buffer.bindIndexBuffer(m_vbo.get().buffer, 4 * sizeof(Vertex),
133+
vk::IndexType::eUint32);
134+
// m_vbo has 6 indices.
135+
command_buffer.drawIndexed(6, 1, 0, 0, 0);
136+
}
137+
```
138+
139+
![VBO Quad](./vbo_quad.png)
File renamed without changes.

guide/src/memory/host_vertex_buffer.md renamed to guide/src/memory/vertex_buffer.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
# Host Vertex Buffer
1+
# Vertex Buffer
22

3-
The goal here is to move the hard-coded vertices in the shader to application code. For the time being we will use an ad-hoc Host type `vma::Buffer` and focus more on the rest of the infrastructure like vertex attributes.
3+
The goal here is to move the hard-coded vertices in the shader to application code. For the time being we will use an ad-hoc Host `vma::Buffer` and focus more on the rest of the infrastructure like vertex attributes.
44

55
First add a new header, `vertex.hpp`:
66

@@ -97,6 +97,6 @@ command_buffer.bindVertexBuffers(0, m_vbo->get_raw().buffer,
9797
command_buffer.draw(3, 1, 0, 0);
9898
```
9999

100-
You should see the same triangle as before. But now we can use whatever set of vertices we like! The Primitive Topology is Triange List by default, so every three vertices in the array is drawn as a triangle, eg for 9 vertices: `[[0, 1, 2], [3, 4, 5], [6, 7, 8]]`, where each inner `[]` represents a triangle comprised of the vertices at those indices.
100+
You should see the same triangle as before. But now we can use whatever set of vertices we like! The Primitive Topology is Triange List by default, so every three vertices in the array is drawn as a triangle, eg for 9 vertices: `[[0, 1, 2], [3, 4, 5], [6, 7, 8]]`, where each inner `[]` represents a triangle comprised of the vertices at those indices. Try playing around with customized vertices and topologies, use Render Doc to debug unexpected outputs / bugs.
101101

102102
Host Vertex Buffers are useful for primitives that are temporary and/or frequently changing, such as UI objects. A 2D framework can use such VBOs exclusively: a simple approach would be a pool of buffers per virtual frame where for each draw a buffer is obtained from the current virtual frame's pool and vertices are copied in.

src/app.cpp

Lines changed: 44 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
#include <app.hpp>
22
#include <vertex.hpp>
3+
#include <bit>
34
#include <cassert>
45
#include <chrono>
56
#include <fstream>
@@ -12,6 +13,11 @@ namespace lvk {
1213
using namespace std::chrono_literals;
1314

1415
namespace {
16+
template <typename T>
17+
[[nodiscard]] constexpr auto to_byte_array(T const& t) {
18+
return std::bit_cast<std::array<std::byte, sizeof(T)>>(t);
19+
}
20+
1521
[[nodiscard]] auto locate_assets_dir() -> fs::path {
1622
// look for '<path>/assets/', starting from the working
1723
// directory and walking up the parent directory tree.
@@ -83,6 +89,7 @@ void App::run() {
8389
create_render_sync();
8490
create_imgui();
8591
create_shader();
92+
create_cmd_block_pool();
8693

8794
create_vertex_buffer();
8895

@@ -254,26 +261,49 @@ void App::create_shader() {
254261
m_shader.emplace(shader_ci);
255262
}
256263

264+
void App::create_cmd_block_pool() {
265+
auto command_pool_ci = vk::CommandPoolCreateInfo{};
266+
command_pool_ci
267+
.setQueueFamilyIndex(m_gpu.queue_family)
268+
// this flag indicates that the allocated Command Buffers will be
269+
// short-lived.
270+
.setFlags(vk::CommandPoolCreateFlagBits::eTransient);
271+
m_cmd_block_pool = m_device->createCommandPoolUnique(command_pool_ci);
272+
}
273+
257274
void App::create_vertex_buffer() {
258-
// vertices previously hard-coded in the vertex shader.
275+
// vertices of a quad.
259276
static constexpr auto vertices_v = std::array{
260277
Vertex{.position = {-0.5f, -0.5f}, .color = {1.0f, 0.0f, 0.0f}},
261278
Vertex{.position = {0.5f, -0.5f}, .color = {0.0f, 1.0f, 0.0f}},
262-
Vertex{.position = {0.0f, 0.5f}, .color = {0.0f, 0.0f, 1.0f}},
279+
Vertex{.position = {0.5f, 0.5f}, .color = {0.0f, 0.0f, 1.0f}},
280+
Vertex{.position = {-0.5f, 0.5f}, .color = {1.0f, 1.0f, 0.0f}},
263281
};
264-
// we want to write vertices_v to a Host VertexBuffer.
265-
m_vbo = vma::create_host_buffer(m_allocator.get(),
266-
vk::BufferUsageFlagBits::eVertexBuffer,
267-
sizeof(vertices_v));
268-
269-
// host buffers have a memory-mapped pointer available to memcpy data to.
270-
std::memcpy(m_vbo.get().mapped, vertices_v.data(), sizeof(vertices_v));
282+
static constexpr auto indices_v = std::array{
283+
0u, 1u, 2u, 2u, 3u, 0u,
284+
};
285+
static constexpr auto vertices_bytes_v = to_byte_array(vertices_v);
286+
static constexpr auto indices_bytes_v = to_byte_array(indices_v);
287+
static constexpr auto total_bytes_v =
288+
std::array<std::span<std::byte const>, 2>{
289+
vertices_bytes_v,
290+
indices_bytes_v,
291+
};
292+
// we want to write total_bytes_v to a Device VertexBuffer | IndexBuffer.
293+
m_vbo = vma::create_device_buffer(m_allocator.get(),
294+
vk::BufferUsageFlagBits::eVertexBuffer |
295+
vk::BufferUsageFlagBits::eIndexBuffer,
296+
create_command_block(), total_bytes_v);
271297
}
272298

273299
auto App::asset_path(std::string_view const uri) const -> fs::path {
274300
return m_assets_dir / uri;
275301
}
276302

303+
auto App::create_command_block() const -> CommandBlock {
304+
return CommandBlock{*m_device, m_queue, *m_cmd_block_pool};
305+
}
306+
277307
void App::main_loop() {
278308
while (glfwWindowShouldClose(m_window.get()) == GLFW_FALSE) {
279309
glfwPollEvents();
@@ -450,7 +480,10 @@ void App::draw(vk::CommandBuffer const command_buffer) const {
450480
m_shader->bind(command_buffer, m_framebuffer_size);
451481
// single VBO at binding 0 at no offset.
452482
command_buffer.bindVertexBuffers(0, m_vbo.get().buffer, vk::DeviceSize{});
453-
// m_vbo has 3 vertices.
454-
command_buffer.draw(3, 1, 0, 0);
483+
// u32 indices after offset of 4 vertices.
484+
command_buffer.bindIndexBuffer(m_vbo.get().buffer, 4 * sizeof(Vertex),
485+
vk::IndexType::eUint32);
486+
// m_vbo has 6 indices.
487+
command_buffer.drawIndexed(6, 1, 0, 0, 0);
455488
}
456489
} // namespace lvk

src/app.hpp

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
#pragma once
2+
#include <command_block.hpp>
23
#include <dear_imgui.hpp>
34
#include <gpu.hpp>
45
#include <resource_buffering.hpp>
@@ -38,9 +39,11 @@ class App {
3839
void create_imgui();
3940
void create_allocator();
4041
void create_shader();
42+
void create_cmd_block_pool();
4143
void create_vertex_buffer();
4244

4345
[[nodiscard]] auto asset_path(std::string_view uri) const -> fs::path;
46+
[[nodiscard]] auto create_command_block() const -> CommandBlock;
4447

4548
void main_loop();
4649

@@ -70,6 +73,8 @@ class App {
7073
std::optional<Swapchain> m_swapchain{};
7174
// command pool for all render Command Buffers.
7275
vk::UniqueCommandPool m_render_cmd_pool{};
76+
// command pool for all Command Blocks.
77+
vk::UniqueCommandPool m_cmd_block_pool{};
7378
// Sync and Command Buffer for virtual frames.
7479
Buffered<RenderSync> m_render_sync{};
7580
// Current virtual frame index.

0 commit comments

Comments
 (0)