Written before

Maybe it was to test the limits of my current abilities, prove to myself that I haven’t forgotten the essence of engines, and prepare for spring recruitment; maybe it was inspired by Animal Well, Games104, etc.; or maybe it was because recently I dissected Godot and Jolt’s design and it didn’t seem that difficult. Anyway, I started writing my own small engine, and at the turn of the new year, I achieved some initial results, hence this series.

Most readers here should be insiders, so I’ll briefly introduce the features: tree-structured organization of objects, components adding functionality, scene serialization for saving and loading. Rendering uses OpenGL + GPU Instancing, simulation uses a thread pool with work stealing. Let me also explain the name and icon. The name is the same as the Pokémon Ditto, because I initially planned to focus on physics simulation, as I personally value interactive experiences in games, which technically translates to simulation and animation in graphics. The icon image is mainly inspired by Duoling from Rock Kingdom, a pet released around the time I quit that browser game. I really liked its background setting as “the legendary collector of souls, shaping its own body by absorbing the souls of other pets.” When designing the engine’s functional architecture, I often drew inspiration from mature engines.

img

Preparation

First, before writing, you should have a basic understanding of engine architecture and your own capabilities. According to Games104, an engine should contain 5 layers:

  • Function Layer: Provides basic services on the engine surface. Rendering, Game Tick, values, etc.
  • Tool Layer: Provides interactive editing content for developers.
  • Core Layer: Provides underlying support. For example, memory management and vector calculations commonly used in 3D engines.
  • Resource Layer: Manages creation assets.
  • Platform Layer: Provides cross-platform support.

The platform layer is obviously beyond the scope of a small engine like mine. Handling various images, videos, audio, models in the resource layer would be very time-consuming and labor-intensive; maybe using libraries could help, but let’s set that aside for now. The other three layers are essential for the engine to run. The current project code is also organized according to this architecture.

1
2
3
4
5
6
7
8
9
10
Ditto/
├── Engine/
├── Core/ # Core layer
├── Graphics/ # Graphics/Rendering system (Function layer)
├── Physics/ # Physics system (Function layer)
└── Resources/ # Resource layer
├── Editor/ # Tool layer
├── Assets/ # Project assets
├── 3rdParty/ # Third-party libraries
└── ...

Interface

Written before

With the basic architecture in place, the first step is naturally to handle rendering. No matter what you do, it only gets seen when rendered. Compared to the scene, I worked on the UI interface first, because the UI is on top of the scene and is seen first, and having an Editor interface makes the whole thing look more like an engine.

Preparation

Implementing various text, buttons, input boxes by myself would be too time-consuming and labor-intensive, and the aesthetics would be questionable. So I went with IMGUI, which has multiple underlying options like OpenGL, DirectX, Vulkan. All controls are manually specified and implemented, and performance is excellent. There is also a dedicated website demo showing what it can achieve. The basic usage is as follows, somewhat similar to Unity Editor writing.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
ImGui::Begin("Settings Panel", nullptr, ImGuiWindowFlags_AlwaysAutoResize);

static bool showExtra = false;
ImGui::Checkbox("Show Advanced Options", &showExtra);

static float brightness = 0.8f;
ImGui::SliderFloat("Brightness", &brightness, 0.0f, 1.0f, "%.2f");

static int quality = 2;
const char* levels[] = { "Low", "Medium", "High" };
ImGui::Combo("Quality", &quality, levels, IM_ARRAYSIZE(levels));

if (showExtra) {
ImGui::Separator();
static float color[3] = { 1.0f, 1.0f, 1.0f };
ImGui::ColorEdit3("Background Color", color);
if (ImGui::Button("Reset")) {
color[0] = color[1] = color[2] = 1.0f;
}
ImGui::SameLine();
ImGui::Text("Click to restore default");
}

ImGui::End();

Drawing

Generally speaking, an engine will have interfaces like Tool, Scene, Game, Hierarchy, File, Inspector, etc. I don’t want to implement the resource layer, so I’ll ban File; I don’t want to implement multiple views, so I’ll merge Scene and Game. Roughly here I only need to implement the Tool toolbar, Hierarchy, Scene view, and Inspector.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
void Editor::Draw()
{
ImGui_ImplOpenGL3_NewFrame();
ImGui_ImplGlfw_NewFrame();
ImGui::NewFrame();

DrawToolbar();
if (showHierarchy) DrawHierarchy();
if (showScene) DrawScene();
if (showInspector) DrawInspector();
DrawPopups();

ImGui::Render();
ImGui_ImplOpenGL3_RenderDrawData(ImGui::GetDrawData());
}

Drawing Toolbar

Starting from the top toolbar, since various extensions are not considered, the toolbar only has functions for saving/loading scenes and showing/hiding interfaces.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
void Editor::DrawToolbar()
{
if (ImGui::BeginMainMenuBar())
{
if (ImGui::BeginMenu("File"))
{
if (ImGui::MenuItem("Save Scene", "Ctrl+S")) showSavePopup = true;
if (ImGui::MenuItem("Load Scene", "Ctrl+L")) showLoadPopup = true;
ImGui::EndMenu();
}
if (ImGui::BeginMenu("View"))
{
if (ImGui::MenuItem("Toggle Hierarchy", NULL, showHierarchy)) showHierarchy = !showHierarchy;
if (ImGui::MenuItem("Toggle Scene", NULL, showScene)) showScene = !showScene;
if (ImGui::MenuItem("Toggle Inspector", NULL, showInspector)) showInspector = !showInspector;
ImGui::EndMenu();
}

ImGui::SameLine(ImGui::GetWindowWidth() * 0.4f);
if (engine->state == Engine::State::Edit)
{
ImGui::PushStyleColor(ImGuiCol_Button, ImVec4(0.2f, 0.7f, 0.2f, 1.0f));
ImGui::PushStyleColor(ImGuiCol_ButtonHovered, ImVec4(0.3f, 0.8f, 0.3f, 1.0f));
if (ImGui::Button("Play")) engine->SetEngineState(Engine::State::Play);
ImGui::PopStyleColor(2);
}
else
{
ImGui::PushStyleColor(ImGuiCol_Button, ImVec4(0.8f, 0.2f, 0.2f, 1.0f));
ImGui::PushStyleColor(ImGuiCol_ButtonHovered, ImVec4(0.9f, 0.3f, 0.3f, 1.0f));
if (ImGui::Button("Stop")) engine->SetEngineState(Engine::State::Edit);
ImGui::PopStyleColor(2);
}

ImGui::SameLine();
if (engine->state != Engine::State::Stop)
{
if (ImGui::Button("Pause")) engine->SetEngineState(Engine::State::Stop);
}
else if (ImGui::Button("Conti"))
engine->SetEngineState(Engine::State::Play);

float windowWidth = ImGui::GetWindowWidth(), infoWidth = 300.0f;
ImGui::SameLine(windowWidth - infoWidth);
ImGui::Text("Scene:"); ImGui::SameLine();

ImGui::PushItemWidth(150.0f);
if (ImGui::InputText("##SceneName", sceneNameBuffer, sizeof(sceneNameBuffer), ImGuiInputTextFlags_EnterReturnsTrue))
if (engine && engine->scene) engine->scene->name = sceneNameBuffer;
ImGui::PopItemWidth();

ImGui::EndMainMenuBar();
}
}

Drawing Functional Panels

After the toolbar, the next three functional panels are Hierarchy, Scene view, and Inspector.

First is the Hierarchy panel. Since I use a tree hierarchy, the logic here is a bit more complex.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
void Editor::DrawHierarchy()
{
float menuBarHeight = ImGui::GetFrameHeight();
float windowWidth = ImGui::GetIO().DisplaySize.x;
float windowHeight = ImGui::GetIO().DisplaySize.y - menuBarHeight;
float hierarchyWidth = 0.125f * windowWidth; // takes 1/8 width

ImGui::SetNextWindowPos(ImVec2(0, menuBarHeight));
ImGui::SetNextWindowSize(ImVec2(hierarchyWidth, windowHeight));
ImGui::Begin("Hierarchy");

if (ImGui::BeginPopupContextWindow())
{
if (ImGui::MenuItem("Create Cube"))
{
GameObject* cube = new GameObject("Cube");
cube->AddComponent<RendererComponent>(RendererComponent::Type::Cube);
engine->scene->gameObjects.push_back(cube);
selectedObject = cube;
}
if (ImGui::MenuItem("Create Sphere"))
{
GameObject* sphere = new GameObject("Sphere");
sphere->AddComponent<RendererComponent>(RendererComponent::Type::Sphere);
engine->scene->gameObjects.push_back(sphere);
selectedObject = sphere;
}
ImGui::EndPopup();
}

if (ImGui::BeginDragDropTarget())
{
if (const ImGuiPayload* payload = ImGui::AcceptDragDropPayload("GAMEOBJECT"))
{
GameObject* droppedObj = *(GameObject**)payload->Data;
if (droppedObj)
{
if (droppedObj->parent) droppedObj->RemoveFromParent();
else
{
auto& rootList = engine->scene->gameObjects;
auto it = std::find(rootList.begin(), rootList.end(), droppedObj);
if (it != rootList.end()) rootList.erase(it);
}
engine->scene->gameObjects.push_back(droppedObj);
droppedObj->parent = nullptr;
}
}
ImGui::EndDragDropTarget();
}

for (GameObject* obj : engine->scene->gameObjects) DrawGameObjectNode(obj);

ImGui::End();
}

Next is the Scene view. There’s no special logic, just recording FPS and PPF (physics time per frame), since the engine focuses on physics and involves multithreading, requiring fine-grained comparison.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
void Editor::DrawScene()
{
float menuBarHeight = ImGui::GetFrameHeight();
float windowWidth = ImGui::GetIO().DisplaySize.x;
float windowHeight = ImGui::GetIO().DisplaySize.y - menuBarHeight;
float hierarchyWidth = 0.125f * windowWidth;
float sceneWidth = 0.625f * windowWidth; // takes 1/2 width

ImGui::SetNextWindowPos(ImVec2(hierarchyWidth, menuBarHeight));
ImGui::SetNextWindowSize(ImVec2(sceneWidth, windowHeight));
ImGui::Begin("Scene");
ImGui::Text("Scene View");

frame++; deltaTime += engine->deltaTime;
if (deltaTime > 1.0f)
{
fps = frame / deltaTime;
ppf = 1e6f * engine->physicsTime / engine->physicsCnt;
frame = 0; deltaTime = 0; engine->physicsCnt = 0; engine->physicsTime = 0;
}
else if (deltaTime < 0) deltaTime = 0;
ImVec2 windowPos = ImGui::GetWindowPos();
ImVec2 windowSize = ImGui::GetWindowSize();

ImGui::GetForegroundDrawList()->AddText(
ImVec2(windowPos.x + windowSize.x - 80, windowPos.y + 20),
IM_COL32(0, 255, 0, 255), ("FPS: " + std::to_string((int)fps)).c_str()
);
if (engine->state == Engine::State::Play)
ImGui::GetForegroundDrawList()->AddText(
ImVec2(windowPos.x + windowSize.x - 80, windowPos.y + 40),
IM_COL32(0, 255, 0, 255), ("PPF: " + std::to_string((int)ppf)).c_str()
);
ImGui::End();
}

Finally, the Inspector panel. Specific components are drawn by the object itself.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
void Editor::DrawInspector()
{
float menuBarHeight = ImGui::GetFrameHeight();
float windowWidth = ImGui::GetIO().DisplaySize.x;
float windowHeight = ImGui::GetIO().DisplaySize.y - menuBarHeight;
float hierarchyWidth = 0.125f * windowWidth;
float sceneWidth = 0.625f * windowWidth;
float inspectorWidth = 0.25f * windowWidth;

ImGui::SetNextWindowPos(ImVec2(hierarchyWidth + sceneWidth, menuBarHeight));
ImGui::SetNextWindowSize(ImVec2(inspectorWidth, windowHeight));
ImGui::Begin("Inspector");

if (!selectedObject) { ImGui::End(); return; }

if (engine->state == Engine::State::Play) ImGui::BeginDisabled();
selectedObject->OnInspectorGUI();
if (engine->state == Engine::State::Play) ImGui::EndDisabled();

ImGui::End();
}

Written behind

After that, the prototypes of various interfaces come out. Here you could use IMGUI’s main branch or the docker branch; with docker you could also implement window docking, etc. But I’m lazy, so for now it’s like this.

Rendering

Written before

With the interface written, it’s time to fill it with content. Since I’m more familiar with OpenGL, I directly use OpenGL for rendering. If unfamiliar, you can refer to LearnOpenGL, which I think is simpler than Dx, not to mention Vk. For rendering models, simple geometric shapes can directly hardcode vertex/normal information, but doing that would ruin the resource layer. So I made a basic Blender model reader, then render with OpenGL.

Model Reading

Ignore the mtl file (storing lighting factors) that comes with newer Blender exports when exporting obj. The format is roughly vertices v, normals vn, textures vt, face indices f. Note that current modeling software seems to default to quadrilaterals, need to specifically set triangulated output.

1
2
3
4
5
6
7
8
v 0.500000 0.500000 -0.500000
...
vn -0.0000 1.0000 -0.0000
...
vt 0.875000 0.500000
...
f 5/1/1 3/2/1 1/3/1
...

Textures are not considered for now. For passing to OpenGL, since there may be same vertices with different normals, EBO can’t be used well, so in the engine directly use vector storage.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
struct ModelData
{
std::string modelName; int vertexCount;
std::vector<float> vertexData;
ModelData(const std::string& path);
struct FaceIndices { int posIdx, texIdx, normIdx; };
FaceIndices ParseFaceIndices(const std::string& token);
};

ModelData::ModelData(const std::string& path)
{
std::ifstream file(path);

std::vector<glm::vec3> positions, normals;
//std::vector<glm::vec2> texCoords;

std::string line;
int lineNum = 0;

while (std::getline(file, line))
{
lineNum++;

if (!line.empty() && line.back() == '\r') line.pop_back();
if (line.empty() || line[0] == '#') continue;

std::istringstream lineStream(line);
std::string prefix;
lineStream >> prefix;

if (prefix == "v")
{
glm::vec3 pos;
if (lineStream >> pos.x >> pos.y >> pos.z) positions.push_back(pos);
}
else if (prefix == "vn")
{
glm::vec3 norm;
if (lineStream >> norm.x >> norm.y >> norm.z) normals.push_back(glm::normalize(norm));
}
else if (prefix == "f")
{
std::vector<std::string> faceTokens;
std::string token;

while (lineStream >> token) faceTokens.push_back(token);

for (const auto& faceToken : faceTokens)
{
FaceIndices indices = ParseFaceIndices(faceToken);

glm::vec3 pos = glm::vec3(0.0f);
if (indices.posIdx >= 0 && indices.posIdx < positions.size()) pos = positions[indices.posIdx];
vertexData.push_back(pos.x); vertexData.push_back(pos.y); vertexData.push_back(pos.z);

glm::vec3 norm = glm::vec3(0.0f, 1.0f, 0.0f);
if (indices.normIdx >= 0 && indices.normIdx < normals.size()) norm = normals[indices.normIdx];
vertexData.push_back(norm.x); vertexData.push_back(norm.y); vertexData.push_back(norm.z);
}
}
}

file.close();
vertexCount = vertexData.size() / 6;
}

Scene Rendering

Main Flow

In each frame’s rendering loop, data (model matrices, colors) are first collected into corresponding instancing batches by geometry type, then these collected CPU-side data (model matrices and colors) are uploaded to GPU’s Shader Storage Buffer Objects (SSBO) for use by vertex shaders.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
void Scene::Render(Shader* shader, const glm::mat4& view, const glm::mat4& projection,
const glm::vec3& viewPos, int viewportWidth, int viewportHeight)
{
CollectRenderData();
UpdateSSBOs();

glUseProgram(shader->id);
shader->SetUniformMat4("view", view);
shader->SetUniformMat4("projection", projection);
shader->SetUniformVec3("viewPos", viewPos);
shader->SetUniformVec3("lightCol", GetLightColor());
shader->SetUniformVec3("lightDir", GetLightDirection());
shader->SetUniform1f("lightIntensity", GetLightIntensity());

for (auto& pair : geometryBatches)
{
GeometryInstances* batch = pair.second;
if (batch->instanceCount == 0) continue;
auto geoIt = baseGeometries.find(batch->type);
if (geoIt == baseGeometries.end()) continue;

const BaseGeometry& geometry = geoIt->second;

glBindBuffer(GL_SHADER_STORAGE_BUFFER, batch->modelSSBO);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, batch->modelSSBO);

glBindBuffer(GL_SHADER_STORAGE_BUFFER, batch->colorSSBO);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 1, batch->colorSSBO);
glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0);

glBindVertexArray(geometry.VAO);

if (geometry.indexCount > 0)
glDrawElementsInstanced(GL_TRIANGLES, geometry.indexCount, GL_UNSIGNED_INT, 0, batch->instanceCount);
else
glDrawArraysInstanced(GL_TRIANGLES, 0, geometry.vertexCount, batch->instanceCount);

glBindVertexArray(0);
}
}

Object Components

When it comes to rendering, we must mention the engine’s GameObject and Component design, mainly using compile-time constraints, perfect forwarding, and variadic templates. Each object/component is responsible for maintaining its own data and providing its specialized interface/serialization methods.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
struct Component{...};

template<typename T>
concept DerivedFromComponent = std::derived_from<T, Component>;

struct GameObject
{
template<DerivedFromComponent T, typename... Args>
T* AddComponent(Args&&... args)
{
T* newComp = new T(std::forward<Args>(args)...);
newComp->gameObject = this;
components.push_back(newComp);
compMask += newComp->index;
return newComp;
}

template<DerivedFromComponent T>
T* GetComponent() const
{
for (Component* comp : components)
if (T* result = dynamic_cast<T*>(comp))
return result;
return nullptr;
}
};

struct TransformComponent : Component
{
glm::vec3 position, rotation, scale, forward;
glm::mat4 model; mutable glm::mat4 worldModel;
};

struct LightComponent : Component
{
glm::vec3 color;
float intensity;
};

struct RendererComponent : Component
{
enum Type { Cube, Sphere };
Type type; glm::vec4 color;
};

struct RigidbodyComponent : Component
{
enum Type { Static, Dynamic };
Type type; float mass; bool useGravity; float damp, angularDamp;
glm::vec3 velocity, angularVelocity; glm::mat3 inertia, inverseInertia;
};

GPU Instancing

Currently the engine supports limited geometries, plus focusing on physics simulation involving many identical objects, GPU Instancing is definitely worth trying. Currently divided into geometry base data and instance model matrices/colors, organized with unordered_map, the data collection process is omitted.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
struct Scene
{
std::vector<GameObject*> gameObjects;

std::unordered_map<RendererComponent::Type, BaseGeometry> baseGeometries;
std::unordered_map<RendererComponent::Type, GeometryInstances*> geometryBatches;
};

struct BaseGeometry
{
GLuint VAO = 0, VBO = 0, EBO = 0;
uint32_t vertexCount = 0, indexCount = 0;
};

struct GeometryInstances
{
RendererComponent::Type type;
std::vector<glm::mat4> modelMatrices;
std::vector<glm::vec4> instanceColors;
GLuint modelSSBO = 0, colorSSBO = 0;
size_t instanceCount = 0;
};

Shader Processing

Specific encapsulation can refer to LearnOpenGL, not much introduction here, only introduce the SSBO used for GPU Instancing. SSBO is a new feature introduced in OpenGL 4.0, slightly slower than SBO but doesn’t require specifying capacity. Both SBO and SSBO have special alignment rules (std140 and std430), need extra attention when writing.

Vertex Shader has no special processing, just pos goes through MVP, normal goes through transpose inverse.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#version 460 core

layout(location = 0) in vec3 aPos;
layout(location = 1) in vec3 aNormal;

layout(std430, binding = 0) readonly buffer ModelMatrices { mat4 model[];};
layout(std430, binding = 1) readonly buffer InstanceColors { vec4 color[];};

out vec3 pos;
out vec3 normal;
out vec4 vertexColor;

uniform mat4 view;
uniform mat4 projection;

void main()
{
mat4 instanceModel = model[gl_InstanceID];
vertexColor = color[gl_InstanceID];

vec4 worldPos = instanceModel * vec4(aPos, 1.0); pos = worldPos.xyz;
normal = normalize(transpose(inverse(mat3(instanceModel))) * aNormal);

gl_Position = projection * view * worldPos;
}

Fragment Shader uses basic Phong, since PBR requires roughness/metalness textures.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#version 460 core

in vec3 pos;
in vec3 normal;
in vec4 vertexColor;
out vec4 col;

uniform vec3 lightCol;
uniform vec3 lightDir;
uniform vec3 viewPos;
uniform float lightIntensity;

const float ambientStrength = 0.3;
const float diffuseStrength = 0.5;
const float specularStrength = 0.2;
const int shininess = 32;

void main()
{
vec3 ambient = ambientStrength * lightCol;
vec3 diffuse = diffuseStrength * max(0, dot(normal, -lightDir)) * lightCol;

vec3 viewDir = normalize(viewPos - pos);
vec3 reflectDir = reflect(lightDir, normal);
float spec = pow(max(0, dot(viewDir, reflectDir)), shininess);
vec3 specular = specularStrength * spec * lightCol;

vec3 lighting = (ambient + diffuse + specular) * lightIntensity;
col = vec4(lighting * vertexColor.xyz, vertexColor.w);
}

Written behind

With this, the Edit module of the engine is basically complete. Next is the performance after clicking Play, i.e., the physics layer. We physics programmers are like this: first set up various rendering environments before starting work, eventually becoming proficient in both physics and rendering…

Simulation

Written before

As mentioned earlier, we need to handle the behavior during Play, so let’s start directly. Currently, only rigid bodies are supported, because soft bodies would disrupt the mesh structure, which would break the previously written GPU Instancing (GPU Instancing: Yeah, what would I eat?). So for now, the main work of physics is collision handling. Broad phase uses AABB, narrow phase ideally would use SAT for these geometric bodies, but for extensibility I’ll go with GJK-EPA.

Preparation

Here’s a brief introduction to the aforementioned collision detection algorithms. For convenience, we’ll use 2D convex polygon examples and only cover the algorithm ideas.

Separating Axis Theorem (SAT)

The algorithm idea is simple: find an axis that separates the two objects; if none found, they collide. Using the edges of the objects as axes is the fastest. In 3D, the complexity is O(n+m) where n and m are the number of edges.

img

GJK-EPA

GJK involves simplices and Minkowski difference. For collision detection between the blue and green regions, we get the orange region. The orange region contains the origin, so a collision occurs. The shape of the orange region comes from the shapes of the two regions (imagine the blue region as an empty frame, the green region as a constraint – constrain a vertex of the blue region to an edge of the green region, and translate the blue region along the green region’s edges to get the red region shape), and its position comes from the relative positions of the two objects (the difference between C and D is (1,1), so point J is (1,1)). In 3D, the complexity is near constant, reaching O(n+m) only in extreme cases.

EPA is similar to GJK but instead of finding the origin, it finds the edge closest to the origin to determine penetration depth. In 3D, the complexity is O(k(n+m+f)).

img

Physics Stepping

Main Flow

The main flow is roughly as follows. Since we’re dealing with rigid bodies, we use the impulse approach, with position correction as an auxiliary measure to prevent continuous penetration.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
void Physics::UpdatePhysics(float dt)
{
t += dt; if (t < deltaTime) return;
int steps = glm::min(3.0f, t / deltaTime); t = fmod(t, deltaTime);

for (int step = 0; step < steps; ++step)
{
// 1. Clear previous frame's collision data
collisionData.clear(); colliderPairs.clear();
// 2. Integrate forces (update velocity and angular velocity)
IntegrateForce(deltaTime);
// 3. Update BVH tree
if (bvhTree) bvhTree->UpdateBVHTree();
// 4. Broad phase collision detection
HandleBroadCollisions();
// 5. Narrow phase collision detection
HandleNarrowCollisions();
// 6. Multiple iterations of sequential impulse solving
for (int iter = 0; iter < iterations; ++iter) SolveCollisions(iter);
// 7. Apply position correction
ApplyPositionCorrections();
}
// Currently physics runs at 60 FPS, rendering does not. So after each physics update, update the model matrices.
for (auto collider : colliders) collider->transform->UpdateTransform();
}

Updating the BVH Tree

Here’s a brief look at the BVH tree update process. For broad-phase AABB detection, I constructed an AABB BVH tree based on SAH. Each physics frame, it updates and samples for reconstruction to maintain a relatively optimal structure.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
void BVHTree::UpdateBVHTree() 
{
if (!root) return;
UpdateAllAABBs(root);
SampleAndRebuild();
}

void BVHTree::UpdateAllAABBs(BVHNode* node)
{
if (!node) return;

if (node->isLeaf)
{
Collider* collider = node->data.leaf.collider;
if (collider && collider->isDirty) { collider->UpdateWorldAABB(); node->aabb = collider->aabb; }
}
else
{
UpdateAllAABBs(node->data.child.left);
UpdateAllAABBs(node->data.child.right);
node->UpdateAABB();
}
}

void BVHTree::SampleAndRebuild()
{
if (leafNodes.empty()) return;
BVHNode* nodeToReinsert = leafNodes[currentSampleIndex];
ReinsertNode(nodeToReinsert);
currentSampleIndex = (currentSampleIndex + 1) % leafNodes.size();
}

Broad Phase Detection

With the BVH tree, each time we get an AABB, we can query the tree directly. But to avoid writing duplicate collision pairs, we use a flag to only include the current and its preceding AABBs.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
void Physics::HandleBroadCollisions()
{
for (Collider* collider : colliders)
{
if (collider->rigidbody->type == RigidbodyComponent::Dynamic)
{
std::vector<Collider*> potentialCollisions;
if (bvhTree) potentialCollisions = bvhTree->Query(collider->aabb);

for (Collider* other : potentialCollisions)
{
if (other == collider) continue;

bool alreadyExists = false;
for (const auto& pair : colliderPairs)
if ((pair.first == collider && pair.second == other) || (pair.first == other && pair.second == collider))
{
alreadyExists = true; break;
}

if (!alreadyExists)
if (other->rigidbody->type == RigidbodyComponent::Dynamic || other->rigidbody->type == RigidbodyComponent::Static)
colliderPairs.push_back({ collider, other });
}
}
}
}

Narrow Phase Detection

For narrow phase, we use the previously mentioned GJK-EPA. However, compared to 2D, 3D is significantly more complex. For example, when constructing the simplex, we need to handle different dimensional cases separately:

  • 2 points (line segment): Determine the origin’s position relative to the line segment, update direction or degenerate to a single point.
  • 3 points (triangle): Determine the origin’s position relative to each edge of the triangle, update simplex and direction.
  • 4 points (tetrahedron): Check if the origin is inside the tetrahedron (i.e., relative to each face’s normal direction). If the origin is inside, return true; otherwise, discard the farthest face and update direction.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
void Physics::HandleNarrowCollisions()
{
for (auto& pair : colliderPairs)
{
Collider* colliderA = pair.first, * colliderB = pair.second;
CollisionInfo collisionInfo = GJK_CheckCollision(colliderA, colliderB);

if (collisionInfo.flag && collisionInfo.depth > 1e-3)
collisionData.push_back(CollisionData(colliderA, colliderB, collisionInfo));
}
}

CollisionInfo GJK_CheckCollision(Collider* colliderA, Collider* colliderB)
{
CollisionInfo result;
glm::vec3 direction = glm::vec3(1, 0, 0); // Initial direction

std::vector<SupportPoint> simplex;
simplex.push_back(GetMinkowskiSupport(colliderA, colliderB, direction));
direction = -simplex[0].point;

for (int i = 0; i < 50; ++i) // Limit iterations to prevent deadlock
{
SupportPoint next = GetMinkowskiSupport(colliderA, colliderB, direction);
if (glm::dot(next.point, direction) < 0) return result; // Didn't pass origin, no collision

simplex.push_back(next);
if (UpdateSimplex(simplex, direction)) EPA(simplex, colliderA, colliderB);
}
return result;
}

Impulse Solving

After obtaining collision information (collision points and depth), we need to use impulses to push back. Here are the steps in more detail:

  • Sequential Solving and Iteration: Each frame, collisions are processed in descending order of penetration depth, and multiple iterations (e.g., 4) are used to gradually converge and reduce error accumulation.
  • Normal Impulse: Calculate the normal impulse based on relative velocity, restitution coefficient, and penetration depth, used to separate objects and compensate for penetration.
  • Tangential Friction: Calculate static and dynamic friction separately based on relative tangential velocity, limiting the tangential impulse within the friction cone to simulate surface resistance.
  • Rigid Body Motion Update: Impulses change both linear and angular velocities, achieving realistic rotational responses through world-space inertia tensor conversion.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
void Physics::SolveCollisions(int iter)
{
for (auto& data : collisionData) data.processed = false;

std::sort(collisionData.begin(), collisionData.end(),
[](const CollisionData& a, const CollisionData& b) { return a.info.depth > b.info.depth; });

for (auto& data : collisionData)
{
if (data.processed) continue;

ApplyImpulse(data.colliderA, data.colliderB, data.info.normal, data.info.contactPointA, data.info.contactPointB, data.info.depth, iter);

data.processed = true;
}
}

void Physics::ApplyImpulse(Collider* a, Collider* b, const glm::vec3& normal,
const glm::vec3& contactPointA, const glm::vec3& contactPointB,
float penetrationDepth, int iteration)
{
float invMassA = (a->rigidbody->type == RigidbodyComponent::Dynamic) ? 1.0f / a->rigidbody->mass : 0.0f;
float invMassB = (b->rigidbody->type == RigidbodyComponent::Dynamic) ? 1.0f / b->rigidbody->mass : 0.0f;

glm::mat3 invInertiaA = CalculateWorldInverseInertia(a), invInertiaB = CalculateWorldInverseInertia(b);
glm::vec3 rA = contactPointA - a->transform->position, rB = contactPointB - b->transform->position;

glm::vec3 velA = a->rigidbody->velocity + glm::cross(a->rigidbody->angularVelocity, rA);
glm::vec3 velB = b->rigidbody->velocity + glm::cross(b->rigidbody->angularVelocity, rB);
glm::vec3 relativeVel = velB - velA;

float normalVel = glm::dot(relativeVel, normal);
if (normalVel > 0.2f) return;

glm::vec3 crossA = glm::cross(rA, normal);
glm::vec3 crossB = glm::cross(rB, normal);

glm::vec3 invInertiaCrossA = invInertiaA * crossA;
glm::vec3 invInertiaCrossB = invInertiaB * crossB;

float termA = invMassA + glm::dot(crossA, invInertiaCrossA);
float termB = invMassB + glm::dot(crossB, invInertiaCrossB);

float denominator = termA + termB;
if (denominator == 0.0f) return;

float biasFactor = 0.3f * (1.0f - float(iteration) / iterations);
float bias = biasFactor * penetrationDepth / deltaTime;
float j = -(1.0f + restitution) * normalVel + bias;
j = glm::max(0.0f, j / denominator);

glm::vec3 impulse = j * normal;

if (a->rigidbody->type == RigidbodyComponent::Dynamic)
{
a->rigidbody->velocity -= impulse * invMassA;
a->rigidbody->angularVelocity += invInertiaA * glm::cross(rA, impulse);
}

if (b->rigidbody->type == RigidbodyComponent::Dynamic)
{
b->rigidbody->velocity += impulse * invMassB;
b->rigidbody->angularVelocity += invInertiaB * glm::cross(rB, -impulse);
}

glm::vec3 tangent = relativeVel - normal * normalVel;
float tangentLen = glm::length(tangent);

if (tangentLen > 1e-3)
{
tangent = glm::normalize(tangent);
float tangentVel = glm::dot(relativeVel, tangent);

glm::vec3 crossAT = glm::cross(rA, tangent);
glm::vec3 crossBT = glm::cross(rB, tangent);

glm::vec3 invInertiaCrossAT = invInertiaA * crossAT;
glm::vec3 invInertiaCrossBT = invInertiaB * crossBT;

float termAT = invMassA + glm::dot(crossAT, invInertiaCrossAT);
float termBT = invMassB + glm::dot(crossBT, invInertiaCrossBT);

float denominatorT = termAT + termBT;

if (denominatorT != 0.0f)
{
float jt = -tangentVel / denominatorT;

float friction = (fabs(tangentVel) < 0.01f) ? staticFriction : dynamicFriction;
float maxFriction = friction * fabs(j);
jt = glm::clamp(jt, -maxFriction, maxFriction);

glm::vec3 tangentImpulse = jt * tangent;

if (a->rigidbody->type == RigidbodyComponent::Dynamic)
a->rigidbody->velocity -= tangentImpulse * invMassA,
a->rigidbody->angularVelocity += invInertiaA * glm::cross(rA, tangentImpulse);
if (b->rigidbody->type == RigidbodyComponent::Dynamic)
b->rigidbody->velocity += tangentImpulse * invMassB,
b->rigidbody->angularVelocity += invInertiaB * glm::cross(rB, -tangentImpulse);
}
}
}

Position Correction

The final step of pushing objects out along the penetration direction is not strictly necessary; it’s just to correct the persistent penetration caused by the impulse method. When doing physics, you could use only impulses or the PBD method, but since the impulse method is more physical, the engine primarily uses the impulse method.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
void Physics::ApplyPositionCorrections()
{
for (auto& data : collisionData)
{
if (data.info.depth > 1e-3)
{
Collider* a = data.colliderA;
Collider* b = data.colliderB;
const CollisionInfo& info = data.info;

float invMassA = (a->rigidbody->type == RigidbodyComponent::Dynamic) ? 1.0f / a->rigidbody->mass : 0.0f;
float invMassB = (b->rigidbody->type == RigidbodyComponent::Dynamic) ? 1.0f / b->rigidbody->mass : 0.0f;
float totalInvMass = invMassA + invMassB;

glm::vec3 correction = info.depth / totalInvMass * info.normal * positionCorrectionFactor;

a->transform->position -= correction * invMassA;
b->transform->position += correction * invMassB;

a->isDirty = true; b->isDirty = true;
}
}
}

Written behind

At this point, the engine also shows some behavior when Play is clicked, which is commendable. But as a game/physics engine, this is far from enough; at the very least, we need to try multithreading. Currently, with 2000 collision pairs in the release version, the physics frame time is 330 microseconds, and FPS drops to 400+, leaving room for subsequent comparison.

Parallelism

Written before

We have already implemented the basic physics simulation process, but its performance when dealing with a large number of collision pairs is still not satisfactory. Therefore, the next step is to try parallelization. The first step is, of course, to write a thread pool to manage threads and tasks, and then use graph coloring to avoid locking when applying impulses and position corrections.

Thread Pool

Since we’re going with a thread pool, we might as well go all the way and implement work stealing to balance the load. Here’s the general structure of the thread pool.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
struct ThreadPool 
{
struct Worker
{
std::deque<std::function<void()>> tasks; // Task queue (deque for stealing)
std::mutex mutex; // Mutex protecting the queue
};

std::vector<std::unique_ptr<Worker>> workers; // Local data for all worker threads
std::vector<std::thread> threads; // Worker threads
std::atomic<bool> stop; // Stop flag
std::atomic<size_t> next_thread_index; // Index for round-robin task assignment
std::mutex cv_mutex; // Mutex for condition variable
std::condition_variable cv; // Global condition variable to notify new tasks
};

Chunked Synchronization

Evenly divide the range into chunks, then create tasks and submit them to the thread pool.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
void ThreadPool::ParallelFor(size_t start, size_t end, std::function<void(size_t)> func) 
{
if (start >= end) return;
size_t total = end - start;
size_t numThreads = workers.size();
size_t chunkSize = (total + numThreads - 1) / numThreads; // Simple chunking

std::atomic<size_t> remaining(0);
std::mutex local_mutex;
std::condition_variable local_cv;

for (size_t t = 0; t < numThreads; ++t) {
size_t blockStart = start + t * chunkSize;
size_t blockEnd = std::min(blockStart + chunkSize, end);
if (blockStart >= blockEnd) break;

remaining++;
Enqueue([this, blockStart, blockEnd, &func, &remaining, &local_cv, &local_mutex]() {
for (size_t i = blockStart; i < blockEnd; ++i) {
func(i);
}
std::unique_lock<std::mutex> lock(local_mutex);
if (--remaining == 0) {
local_cv.notify_one();
}
});
}

// Wait for all chunks to complete
std::unique_lock<std::mutex> lock(local_mutex);
local_cv.wait(lock, [&remaining] { return remaining == 0; });
}

Task Submission

When a task is added, the thread pool first performs round-robin selection and then pushes it to the back.

1
2
3
4
5
6
7
8
9
10
void ThreadPool::Enqueue(std::function<void()> task) 
{
// Round-robin select a target thread's local queue
size_t target = next_thread_index.fetch_add(1, std::memory_order_relaxed) % workers.size();
{
std::unique_lock lock(workers[target]->mutex);
workers[target]->tasks.push_back(std::move(task));
}
cv.notify_one(); // Wake up one waiting thread
}

Worker Loop

In the main loop, it preferentially takes tasks from the head of its own local queue. If its local queue is empty, it attempts to steal tasks from the tail of other threads’ queues, then executes the task or waits.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
void ThreadPool::WorkerLoop(size_t myIndex) 
{
Worker& myWorker = *workers[myIndex];

while (!stop)
{
std::function<void()> task;
{
std::unique_lock lock(myWorker.mutex, std::try_to_lock);
if (lock.owns_lock() && !myWorker.tasks.empty()) {
task = std::move(myWorker.tasks.front());
myWorker.tasks.pop_front();
}
}

if (!task) {
size_t numWorkers = workers.size();
for (size_t offset = 1; offset < numWorkers; ++offset) {
size_t victimIdx = (myIndex + offset) % numWorkers;
Worker& victim = *workers[victimIdx];
std::unique_lock lock(victim.mutex, std::try_to_lock);
if (lock.owns_lock() && !victim.tasks.empty()) {
task = std::move(victim.tasks.back()); // Steal from the tail
victim.tasks.pop_back();
break;
}
}
}

if (task) task();
else
{
std::unique_lock<std::mutex> lock(cv_mutex);
cv.wait(lock, [this] { return stop || has_any_task(); });
}
}
}

Parallel Simulation

Merging Data

Afterwards, all steps can be rewritten using the thread pool’s chunked synchronization. However, note that during broad and narrow phase detection, the results are written to arrays. If we go parallel, merging can be done via local processing.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
void ParallelPhysics::HandleNarrowCollisions() 
{
collisionData.clear();
size_t numPairs = colliderPairs.size();
if (numPairs == 0) return;

size_t numThreads = threadPool.thread_count();
size_t chunkSize = (numPairs + numThreads - 1) / numThreads;
size_t numBlocks = (numPairs + chunkSize - 1) / chunkSize;

std::vector<std::vector<CollisionData>> localData(numBlocks);

threadPool.parallel_for(0, numBlocks, [&](size_t blockIdx) {
size_t start = blockIdx * chunkSize;
size_t end = std::min(start + chunkSize, numPairs);
auto& local = localData[blockIdx];
for (size_t i = start; i < end; ++i) {
auto& pair = colliderPairs[i];
CollisionInfo info = GJK_CheckCollision(pair.first, pair.second);
if (info.flag && info.depth > 1e-3f) {
local.emplace_back(pair.first, pair.second, info);
}
}
});

for (auto& local : localData) {
collisionData.insert(collisionData.end(), local.begin(), local.end());
}
}

Graph Coloring Grouping

And for lock-free impulse processing and position correction, we use graph coloring grouping. Roughly treat collision pairs as edges in a graph, and rigid bodies as vertices. Each edge needs to be assigned a color (group number) such that adjacent edges do not have the same color. The engine uses an approximate greedy algorithm to quickly generate parallelizable groups. This ensures that collision pairs within each group do not share any rigid bodies, so they can be safely solved in parallel within the group.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
void ParallelPhysics::BuildCollisionGroups() {
collisionGroups.clear();
if (collisionData.empty()) return;

std::sort(collisionData.begin(), collisionData.end(),
[](const CollisionData& a, const CollisionData& b) {
return a.info.depth > b.info.depth;
});

static std::vector<int> bodyMask;
static std::vector<int> bodyEpoch;
static int globalEpoch = 0;

size_t maxId = 0;
for (auto& data : collisionData) {
maxId = std::max(maxId, (size_t)data.colliderA->id);
maxId = std::max(maxId, (size_t)data.colliderB->id);
}
if (bodyMask.size() <= maxId) {
bodyMask.resize(maxId + 1, 0);
bodyEpoch.resize(maxId + 1, 0);
}

++globalEpoch;

for (auto& data : collisionData) {
int aId = data.colliderA->id;
int bId = data.colliderB->id;
int groupIdx = 0;
while (true) {
if (groupIdx >= (int)collisionGroups.size()) {
collisionGroups.emplace_back();
collisionGroups.back().push_back(&data);
bodyMask[aId] = groupIdx + 1;
bodyEpoch[aId] = globalEpoch;
bodyMask[bId] = groupIdx + 1;
bodyEpoch[bId] = globalEpoch;
break;
}
bool aOccupied = (bodyEpoch[aId] == globalEpoch && bodyMask[aId] == groupIdx + 1);
bool bOccupied = (bodyEpoch[bId] == globalEpoch && bodyMask[bId] == groupIdx + 1);
if (!aOccupied && !bOccupied) {
collisionGroups[groupIdx].push_back(&data);
bodyMask[aId] = groupIdx + 1;
bodyEpoch[aId] = globalEpoch;
bodyMask[bId] = groupIdx + 1;
bodyEpoch[bId] = globalEpoch;
break;
}
++groupIdx;
}
}
}

Written behind

With this approach, in the release version with 2000 collision pairs, the physics frame time is 90 microseconds, FPS 1500+, compared to the serial version’s 330 microseconds and FPS 400+, which is a significant improvement. Of course, I have to complain that multithreading is really hard to get right. Initially, I didn’t want to write it myself, so I tried OpenMP, TaskFlow, and PPL sequentially, but their performance was comparable to serial. In the end, I had to implement multithreading myself. And after the current version, I also tried C20 barriers, hoping for further improvement, but the performance was again comparable to serial. Tired, so let’s leave it at that for now.