Build your own engine

Opening

Written before

Maybe it was to test the limits of my current abilities, prove to myself that I haven’t forgotten the essence of engines, and prepare for spring recruitment; maybe it was inspired by Animal Well, Games104, etc.; or maybe it was because recently I dissected Godot and Jolt’s design and it didn’t seem that difficult. Anyway, I started writing my own small engine, and at the turn of the new year, I achieved some initial results, hence this series.

Most readers here should be insiders, so I’ll briefly introduce the features: tree-structured organization of objects, components adding functionality, scene serialization for saving and loading. Rendering uses OpenGL + GPU Instancing, simulation uses a thread pool with work stealing. Let me also explain the name and icon. The name is the same as the Pokémon Ditto, because I initially planned to focus on physics simulation, as I personally value interactive experiences in games, which technically translates to simulation and animation in graphics. The icon image is mainly inspired by Duoling from Rock Kingdom, a pet released around the time I quit that browser game. I really liked its background setting as “the legendary collector of souls, shaping its own body by absorbing the souls of other pets.” When designing the engine’s functional architecture, I often drew inspiration from mature engines.

Preparation

First, before writing, you should have a basic understanding of engine architecture and your own capabilities. According to Games104, an engine should contain 5 layers:

Function Layer: Provides basic services on the engine surface. Rendering, Game Tick, values, etc.
Tool Layer: Provides interactive editing content for developers.
Core Layer: Provides underlying support. For example, memory management and vector calculations commonly used in 3D engines.
Resource Layer: Manages creation assets.
Platform Layer: Provides cross-platform support.

The platform layer is obviously beyond the scope of a small engine like mine. Handling various images, videos, audio, models in the resource layer would be very time-consuming and labor-intensive; maybe using libraries could help, but let’s set that aside for now. The other three layers are essential for the engine to run. The current project code is also organized according to this architecture.

Ditto/
├── Engine/
│ ├── Core/ # Core layer
│ ├── Graphics/ # Graphics/Rendering system (Function layer)
│ ├── Physics/ # Physics system (Function layer)
│ └── Resources/ # Resource layer
├── Editor/ # Tool layer
├── Assets/ # Project assets
├── 3rdParty/ # Third-party libraries
└── ...

Interface

Written before

With the basic architecture in place, the first step is naturally to handle rendering. No matter what you do, it only gets seen when rendered. Compared to the scene, I worked on the UI interface first, because the UI is on top of the scene and is seen first, and having an Editor interface makes the whole thing look more like an engine.

Preparation

Implementing various text, buttons, input boxes by myself would be too time-consuming and labor-intensive, and the aesthetics would be questionable. So I went with IMGUI, which has multiple underlying options like OpenGL, DirectX, Vulkan. All controls are manually specified and implemented, and performance is excellent. There is also a dedicated website demo showing what it can achieve. The basic usage is as follows, somewhat similar to Unity Editor writing.

ImGui::Begin("Settings Panel", nullptr, ImGuiWindowFlags_AlwaysAutoResize);

static bool showExtra = false;
ImGui::Checkbox("Show Advanced Options", &showExtra);

static float brightness = 0.8f;
ImGui::SliderFloat("Brightness", &brightness, 0.0f, 1.0f, "%.2f");

static int quality = 2;
const char* levels[] = { "Low", "Medium", "High" };
ImGui::Combo("Quality", &quality, levels, IM_ARRAYSIZE(levels));

if (showExtra) {
ImGui::Separator();
static float color[3] = { 1.0f, 1.0f, 1.0f };
ImGui::ColorEdit3("Background Color", color);
if (ImGui::Button("Reset")) {
color[0] = color[1] = color[2] = 1.0f;
}
ImGui::SameLine();
ImGui::Text("Click to restore default");
}

ImGui::End();

Drawing

Generally speaking, an engine will have interfaces like Tool, Scene, Game, Hierarchy, File, Inspector, etc. I don’t want to implement the resource layer, so I’ll ban File; I don’t want to implement multiple views, so I’ll merge Scene and Game. Roughly here I only need to implement the Tool toolbar, Hierarchy, Scene view, and Inspector.

void Editor::Draw()
{
    ImGui_ImplOpenGL3_NewFrame();
    ImGui_ImplGlfw_NewFrame();
    ImGui::NewFrame();

    DrawToolbar();
    if (showHierarchy) DrawHierarchy();
    if (showScene) DrawScene();
    if (showInspector) DrawInspector();
    DrawPopups();

    ImGui::Render();
    ImGui_ImplOpenGL3_RenderDrawData(ImGui::GetDrawData());
}

Starting from the top toolbar, since various extensions are not considered, the toolbar only has functions for saving/loading scenes and showing/hiding interfaces.

void Editor::DrawToolbar()
{
    if (ImGui::BeginMainMenuBar())
    {
        if (ImGui::BeginMenu("File"))
        {
            if (ImGui::MenuItem("Save Scene", "Ctrl+S")) showSavePopup = true;
            if (ImGui::MenuItem("Load Scene", "Ctrl+L")) showLoadPopup = true;
            ImGui::EndMenu();
        }
        if (ImGui::BeginMenu("View"))
        {
            if (ImGui::MenuItem("Toggle Hierarchy", NULL, showHierarchy)) showHierarchy = !showHierarchy;
            if (ImGui::MenuItem("Toggle Scene", NULL, showScene)) showScene = !showScene;
            if (ImGui::MenuItem("Toggle Inspector", NULL, showInspector)) showInspector = !showInspector;
            ImGui::EndMenu();
        }

        ImGui::SameLine(ImGui::GetWindowWidth() * 0.4f);
        if (engine->state == Engine::State::Edit)
        {
            ImGui::PushStyleColor(ImGuiCol_Button, ImVec4(0.2f, 0.7f, 0.2f, 1.0f));
            ImGui::PushStyleColor(ImGuiCol_ButtonHovered, ImVec4(0.3f, 0.8f, 0.3f, 1.0f));
            if (ImGui::Button("Play")) engine->SetEngineState(Engine::State::Play);
            ImGui::PopStyleColor(2);
        }
        else
        {
            ImGui::PushStyleColor(ImGuiCol_Button, ImVec4(0.8f, 0.2f, 0.2f, 1.0f));
            ImGui::PushStyleColor(ImGuiCol_ButtonHovered, ImVec4(0.9f, 0.3f, 0.3f, 1.0f));
            if (ImGui::Button("Stop")) engine->SetEngineState(Engine::State::Edit);
            ImGui::PopStyleColor(2);
        }

        ImGui::SameLine();
        if (engine->state != Engine::State::Stop)
        {
            if (ImGui::Button("Pause")) engine->SetEngineState(Engine::State::Stop);
        }
        else if (ImGui::Button("Conti"))
            engine->SetEngineState(Engine::State::Play);

        float windowWidth = ImGui::GetWindowWidth(), infoWidth = 300.0f;
        ImGui::SameLine(windowWidth - infoWidth);
        ImGui::Text("Scene:"); ImGui::SameLine();

        ImGui::PushItemWidth(150.0f);
        if (ImGui::InputText("##SceneName", sceneNameBuffer, sizeof(sceneNameBuffer), ImGuiInputTextFlags_EnterReturnsTrue))
            if (engine && engine->scene) engine->scene->name = sceneNameBuffer;
        ImGui::PopItemWidth();

        ImGui::EndMainMenuBar();
    }
}

Drawing Functional Panels

After the toolbar, the next three functional panels are Hierarchy, Scene view, and Inspector.

First is the Hierarchy panel. Since I use a tree hierarchy, the logic here is a bit more complex.

void Editor::DrawHierarchy()
{
    float menuBarHeight = ImGui::GetFrameHeight();
    float windowWidth = ImGui::GetIO().DisplaySize.x;
    float windowHeight = ImGui::GetIO().DisplaySize.y - menuBarHeight;
    float hierarchyWidth = 0.125f * windowWidth; // takes 1/8 width

    ImGui::SetNextWindowPos(ImVec2(0, menuBarHeight));
    ImGui::SetNextWindowSize(ImVec2(hierarchyWidth, windowHeight));
    ImGui::Begin("Hierarchy");

    if (ImGui::BeginPopupContextWindow())
    {
        if (ImGui::MenuItem("Create Cube"))
        {
            GameObject* cube = new GameObject("Cube");
            cube->AddComponent<RendererComponent>(RendererComponent::Type::Cube);
            engine->scene->gameObjects.push_back(cube);
            selectedObject = cube;
        }
        if (ImGui::MenuItem("Create Sphere"))
        {
            GameObject* sphere = new GameObject("Sphere");
            sphere->AddComponent<RendererComponent>(RendererComponent::Type::Sphere);
            engine->scene->gameObjects.push_back(sphere);
            selectedObject = sphere;
        }
        ImGui::EndPopup();
    }

    if (ImGui::BeginDragDropTarget())
    {
        if (const ImGuiPayload* payload = ImGui::AcceptDragDropPayload("GAMEOBJECT"))
        {
            GameObject* droppedObj = *(GameObject**)payload->Data;
            if (droppedObj)
            {
                if (droppedObj->parent) droppedObj->RemoveFromParent();
                else
                {
                    auto& rootList = engine->scene->gameObjects;
                    auto it = std::find(rootList.begin(), rootList.end(), droppedObj);
                    if (it != rootList.end()) rootList.erase(it);
                }
                engine->scene->gameObjects.push_back(droppedObj);
                droppedObj->parent = nullptr;
            }
        }
        ImGui::EndDragDropTarget();
    }

    for (GameObject* obj : engine->scene->gameObjects) DrawGameObjectNode(obj);

    ImGui::End();
}

Next is the Scene view. There’s no special logic, just recording FPS and PPF (physics time per frame), since the engine focuses on physics and involves multithreading, requiring fine-grained comparison.

void Editor::DrawScene()
{
    float menuBarHeight = ImGui::GetFrameHeight();
    float windowWidth = ImGui::GetIO().DisplaySize.x;
    float windowHeight = ImGui::GetIO().DisplaySize.y - menuBarHeight;
    float hierarchyWidth = 0.125f * windowWidth;
    float sceneWidth = 0.625f * windowWidth; // takes 1/2 width

    ImGui::SetNextWindowPos(ImVec2(hierarchyWidth, menuBarHeight));
    ImGui::SetNextWindowSize(ImVec2(sceneWidth, windowHeight));
    ImGui::Begin("Scene");
    ImGui::Text("Scene View");

    frame++; deltaTime += engine->deltaTime;
    if (deltaTime > 1.0f)
    {
        fps = frame / deltaTime;
        ppf = 1e6f * engine->physicsTime / engine->physicsCnt;
        frame = 0; deltaTime = 0; engine->physicsCnt = 0; engine->physicsTime = 0;
    }
    else if (deltaTime < 0) deltaTime = 0;
    ImVec2 windowPos = ImGui::GetWindowPos();
    ImVec2 windowSize = ImGui::GetWindowSize();

    ImGui::GetForegroundDrawList()->AddText(
        ImVec2(windowPos.x + windowSize.x - 80, windowPos.y + 20),
        IM_COL32(0, 255, 0, 255), ("FPS: " + std::to_string((int)fps)).c_str()
    );
    if (engine->state == Engine::State::Play)
        ImGui::GetForegroundDrawList()->AddText(
            ImVec2(windowPos.x + windowSize.x - 80, windowPos.y + 40),
            IM_COL32(0, 255, 0, 255), ("PPF: " + std::to_string((int)ppf)).c_str()
        );
    ImGui::End();
}

Finally, the Inspector panel. Specific components are drawn by the object itself.

void Editor::DrawInspector()
{
    float menuBarHeight = ImGui::GetFrameHeight();
    float windowWidth = ImGui::GetIO().DisplaySize.x;
    float windowHeight = ImGui::GetIO().DisplaySize.y - menuBarHeight;
    float hierarchyWidth = 0.125f * windowWidth;
    float sceneWidth = 0.625f * windowWidth;
    float inspectorWidth = 0.25f * windowWidth;

    ImGui::SetNextWindowPos(ImVec2(hierarchyWidth + sceneWidth, menuBarHeight));
    ImGui::SetNextWindowSize(ImVec2(inspectorWidth, windowHeight));
    ImGui::Begin("Inspector");

    if (!selectedObject) { ImGui::End(); return; }

    if (engine->state == Engine::State::Play) ImGui::BeginDisabled();
    selectedObject->OnInspectorGUI();
    if (engine->state == Engine::State::Play) ImGui::EndDisabled();

    ImGui::End();
}

Written behind

After that, the prototypes of various interfaces come out. Here you could use IMGUI’s main branch or the docker branch; with docker you could also implement window docking, etc. But I’m lazy, so for now it’s like this.

Simulation

Written before

As mentioned earlier, we need to handle the behavior during Play, so let’s start directly. Currently, only rigid bodies are supported, because soft bodies would disrupt the mesh structure, which would break the previously written GPU Instancing (GPU Instancing: Yeah, what would I eat?). So for now, the main work of physics is collision handling. Broad phase uses AABB, narrow phase ideally would use SAT for these geometric bodies, but for extensibility I’ll go with GJK-EPA.

Preparation

Here’s a brief introduction to the aforementioned collision detection algorithms. For convenience, we’ll use 2D convex polygon examples and only cover the algorithm ideas.

Separating Axis Theorem (SAT)

The algorithm idea is simple: find an axis that separates the two objects; if none found, they collide. Using the edges of the objects as axes is the fastest. In 3D, the complexity is O(n+m) where n and m are the number of edges.

GJK-EPA

GJK involves simplices and Minkowski difference. For collision detection between the blue and green regions, we get the orange region. The orange region contains the origin, so a collision occurs. The shape of the orange region comes from the shapes of the two regions (imagine the blue region as an empty frame, the green region as a constraint – constrain a vertex of the blue region to an edge of the green region, and translate the blue region along the green region’s edges to get the red region shape), and its position comes from the relative positions of the two objects (the difference between C and D is (1,1), so point J is (1,1)). In 3D, the complexity is near constant, reaching O(n+m) only in extreme cases.

EPA is similar to GJK but instead of finding the origin, it finds the edge closest to the origin to determine penetration depth. In 3D, the complexity is O(k(n+m+f)).

Physics Stepping

Main Flow

The main flow is roughly as follows. Since we’re dealing with rigid bodies, we use the impulse approach, with position correction as an auxiliary measure to prevent continuous penetration.

void Physics::UpdatePhysics(float dt)
{
    t += dt; if (t < deltaTime) return;
    int steps = glm::min(3.0f, t / deltaTime); t = fmod(t, deltaTime);

    for (int step = 0; step < steps; ++step)
	{
    	// 1. Clear previous frame's collision data
    	collisionData.clear(); colliderPairs.clear();
    	// 2. Integrate forces (update velocity and angular velocity)
    	IntegrateForce(deltaTime);
    	// 3. Update BVH tree
    	if (bvhTree) bvhTree->UpdateBVHTree();
    	// 4. Broad phase collision detection
    	HandleBroadCollisions();
    	// 5. Narrow phase collision detection
    	HandleNarrowCollisions();
    	// 6. Multiple iterations of sequential impulse solving
    	for (int iter = 0; iter < iterations; ++iter) SolveCollisions(iter);
    	// 7. Apply position correction
    	ApplyPositionCorrections();
	}
	// Currently physics runs at 60 FPS, rendering does not. So after each physics update, update the model matrices.
	for (auto collider : colliders) collider->transform->UpdateTransform();
}

Updating the BVH Tree

Here’s a brief look at the BVH tree update process. For broad-phase AABB detection, I constructed an AABB BVH tree based on SAH. Each physics frame, it updates and samples for reconstruction to maintain a relatively optimal structure.

void BVHTree::UpdateBVHTree() 
{
    if (!root) return;
    UpdateAllAABBs(root);
    SampleAndRebuild();
}

void BVHTree::UpdateAllAABBs(BVHNode* node) 
{
    if (!node) return;

    if (node->isLeaf) 
    {
        Collider* collider = node->data.leaf.collider;
        if (collider && collider->isDirty) { collider->UpdateWorldAABB(); node->aabb = collider->aabb; }
    }
    else 
    {
        UpdateAllAABBs(node->data.child.left);
        UpdateAllAABBs(node->data.child.right);
        node->UpdateAABB();
    }
}

void BVHTree::SampleAndRebuild() 
{
    if (leafNodes.empty()) return;
    BVHNode* nodeToReinsert = leafNodes[currentSampleIndex];
    ReinsertNode(nodeToReinsert);
    currentSampleIndex = (currentSampleIndex + 1) % leafNodes.size();
}

Broad Phase Detection

With the BVH tree, each time we get an AABB, we can query the tree directly. But to avoid writing duplicate collision pairs, we use a flag to only include the current and its preceding AABBs.

void Physics::HandleBroadCollisions()
{
    for (Collider* collider : colliders)
    {
        if (collider->rigidbody->type == RigidbodyComponent::Dynamic)
        {
            std::vector<Collider*> potentialCollisions;
            if (bvhTree) potentialCollisions = bvhTree->Query(collider->aabb);

            for (Collider* other : potentialCollisions)
            {
                if (other == collider) continue;

                bool alreadyExists = false;
                for (const auto& pair : colliderPairs)
                    if ((pair.first == collider && pair.second == other) || (pair.first == other && pair.second == collider))
                    {
                        alreadyExists = true; break;
                    }

                if (!alreadyExists)
                    if (other->rigidbody->type == RigidbodyComponent::Dynamic || other->rigidbody->type == RigidbodyComponent::Static)
                        colliderPairs.push_back({ collider, other });
            }
        }
    }
}

Narrow Phase Detection

For narrow phase, we use the previously mentioned GJK-EPA. However, compared to 2D, 3D is significantly more complex. For example, when constructing the simplex, we need to handle different dimensional cases separately:

2 points (line segment): Determine the origin’s position relative to the line segment, update direction or degenerate to a single point.
3 points (triangle): Determine the origin’s position relative to each edge of the triangle, update simplex and direction.
4 points (tetrahedron): Check if the origin is inside the tetrahedron (i.e., relative to each face’s normal direction). If the origin is inside, return true; otherwise, discard the farthest face and update direction.

void Physics::HandleNarrowCollisions()
{
    for (auto& pair : colliderPairs)
    {
        Collider* colliderA = pair.first, * colliderB = pair.second;
        CollisionInfo collisionInfo = GJK_CheckCollision(colliderA, colliderB);

        if (collisionInfo.flag && collisionInfo.depth > 1e-3)
            collisionData.push_back(CollisionData(colliderA, colliderB, collisionInfo));
    }
}

CollisionInfo GJK_CheckCollision(Collider* colliderA, Collider* colliderB) 
{
    CollisionInfo result;
    glm::vec3 direction = glm::vec3(1, 0, 0);  // Initial direction

    std::vector<SupportPoint> simplex;
    simplex.push_back(GetMinkowskiSupport(colliderA, colliderB, direction));
    direction = -simplex[0].point;

    for (int i = 0; i < 50; ++i)  // Limit iterations to prevent deadlock
    {
        SupportPoint next = GetMinkowskiSupport(colliderA, colliderB, direction);
        if (glm::dot(next.point, direction) < 0) return result; // Didn't pass origin, no collision

        simplex.push_back(next);
        if (UpdateSimplex(simplex, direction))  EPA(simplex, colliderA, colliderB);
    }
    return result;
}

Impulse Solving

After obtaining collision information (collision points and depth), we need to use impulses to push back. Here are the steps in more detail:

Sequential Solving and Iteration: Each frame, collisions are processed in descending order of penetration depth, and multiple iterations (e.g., 4) are used to gradually converge and reduce error accumulation.
Normal Impulse: Calculate the normal impulse based on relative velocity, restitution coefficient, and penetration depth, used to separate objects and compensate for penetration.
Tangential Friction: Calculate static and dynamic friction separately based on relative tangential velocity, limiting the tangential impulse within the friction cone to simulate surface resistance.
Rigid Body Motion Update: Impulses change both linear and angular velocities, achieving realistic rotational responses through world-space inertia tensor conversion.

void Physics::SolveCollisions(int iter)
{
     for (auto& data : collisionData) data.processed = false;

     std::sort(collisionData.begin(), collisionData.end(), 
         [](const CollisionData& a, const CollisionData& b) { return a.info.depth > b.info.depth; });

     for (auto& data : collisionData)
     {
         if (data.processed) continue;

         ApplyImpulse(data.colliderA, data.colliderB, data.info.normal, data.info.contactPointA, data.info.contactPointB, data.info.depth, iter);

         data.processed = true;
     }
}

void Physics::ApplyImpulse(Collider* a, Collider* b, const glm::vec3& normal,
    const glm::vec3& contactPointA, const glm::vec3& contactPointB,
    float penetrationDepth, int iteration)
{
    float invMassA = (a->rigidbody->type == RigidbodyComponent::Dynamic) ? 1.0f / a->rigidbody->mass : 0.0f;
    float invMassB = (b->rigidbody->type == RigidbodyComponent::Dynamic) ? 1.0f / b->rigidbody->mass : 0.0f;

    glm::mat3 invInertiaA = CalculateWorldInverseInertia(a), invInertiaB = CalculateWorldInverseInertia(b);
    glm::vec3 rA = contactPointA - a->transform->position, rB = contactPointB - b->transform->position;

    glm::vec3 velA = a->rigidbody->velocity + glm::cross(a->rigidbody->angularVelocity, rA);
    glm::vec3 velB = b->rigidbody->velocity + glm::cross(b->rigidbody->angularVelocity, rB);
    glm::vec3 relativeVel = velB - velA;

    float normalVel = glm::dot(relativeVel, normal);
    if (normalVel > 0.2f) return;

    glm::vec3 crossA = glm::cross(rA, normal);
    glm::vec3 crossB = glm::cross(rB, normal);

    glm::vec3 invInertiaCrossA = invInertiaA * crossA;
    glm::vec3 invInertiaCrossB = invInertiaB * crossB;

    float termA = invMassA + glm::dot(crossA, invInertiaCrossA);
    float termB = invMassB + glm::dot(crossB, invInertiaCrossB);

    float denominator = termA + termB;
    if (denominator == 0.0f) return;

    float biasFactor = 0.3f * (1.0f - float(iteration) / iterations);
    float bias = biasFactor * penetrationDepth / deltaTime;
    float j = -(1.0f + restitution) * normalVel + bias;
    j = glm::max(0.0f, j / denominator);

    glm::vec3 impulse = j * normal;

    if (a->rigidbody->type == RigidbodyComponent::Dynamic)
    {
        a->rigidbody->velocity -= impulse * invMassA;
        a->rigidbody->angularVelocity += invInertiaA * glm::cross(rA, impulse);
    }

    if (b->rigidbody->type == RigidbodyComponent::Dynamic)
    {
        b->rigidbody->velocity += impulse * invMassB;
        b->rigidbody->angularVelocity += invInertiaB * glm::cross(rB, -impulse);
    }

    glm::vec3 tangent = relativeVel - normal * normalVel;
    float tangentLen = glm::length(tangent);

    if (tangentLen > 1e-3)
    {
        tangent = glm::normalize(tangent);
        float tangentVel = glm::dot(relativeVel, tangent);

        glm::vec3 crossAT = glm::cross(rA, tangent);
        glm::vec3 crossBT = glm::cross(rB, tangent);

        glm::vec3 invInertiaCrossAT = invInertiaA * crossAT;
        glm::vec3 invInertiaCrossBT = invInertiaB * crossBT;

        float termAT = invMassA + glm::dot(crossAT, invInertiaCrossAT);
        float termBT = invMassB + glm::dot(crossBT, invInertiaCrossBT);

        float denominatorT = termAT + termBT;

        if (denominatorT != 0.0f)
        {
            float jt = -tangentVel / denominatorT;

            float friction = (fabs(tangentVel) < 0.01f) ? staticFriction : dynamicFriction;
            float maxFriction = friction * fabs(j);
            jt = glm::clamp(jt, -maxFriction, maxFriction);

            glm::vec3 tangentImpulse = jt * tangent;

            if (a->rigidbody->type == RigidbodyComponent::Dynamic)
                a->rigidbody->velocity -= tangentImpulse * invMassA,
                a->rigidbody->angularVelocity += invInertiaA * glm::cross(rA, tangentImpulse);
            if (b->rigidbody->type == RigidbodyComponent::Dynamic)
                b->rigidbody->velocity += tangentImpulse * invMassB,
                b->rigidbody->angularVelocity += invInertiaB * glm::cross(rB, -tangentImpulse);
        }
    }
}

Position Correction

The final step of pushing objects out along the penetration direction is not strictly necessary; it’s just to correct the persistent penetration caused by the impulse method. When doing physics, you could use only impulses or the PBD method, but since the impulse method is more physical, the engine primarily uses the impulse method.

void Physics::ApplyPositionCorrections()
{
    for (auto& data : collisionData)
    {
        if (data.info.depth > 1e-3)
        {
            Collider* a = data.colliderA;
            Collider* b = data.colliderB;
            const CollisionInfo& info = data.info;

            float invMassA = (a->rigidbody->type == RigidbodyComponent::Dynamic) ? 1.0f / a->rigidbody->mass : 0.0f;
            float invMassB = (b->rigidbody->type == RigidbodyComponent::Dynamic) ? 1.0f / b->rigidbody->mass : 0.0f;
            float totalInvMass = invMassA + invMassB;

            glm::vec3 correction = info.depth / totalInvMass * info.normal * positionCorrectionFactor;

            a->transform->position -= correction * invMassA;
            b->transform->position += correction * invMassB;

            a->isDirty = true; b->isDirty = true;
        }
    }
}

Written behind

At this point, the engine also shows some behavior when Play is clicked, which is commendable. But as a game/physics engine, this is far from enough; at the very least, we need to try multithreading. Currently, with 2000 collision pairs in the release version, the physics frame time is 330 microseconds, and FPS drops to 400+, leaving room for subsequent comparison.

Parallelism

Written before

We have already implemented the basic physics simulation process, but its performance when dealing with a large number of collision pairs is still not satisfactory. Therefore, the next step is to try parallelization. The first step is, of course, to write a thread pool to manage threads and tasks, and then use graph coloring to avoid locking when applying impulses and position corrections.

Thread Pool

Since we’re going with a thread pool, we might as well go all the way and implement work stealing to balance the load. Here’s the general structure of the thread pool.

struct ThreadPool 
{
    struct Worker 
    {
        std::deque<std::function<void()>> tasks;   // Task queue (deque for stealing)
        std::mutex mutex;                           // Mutex protecting the queue
    };

    std::vector<std::unique_ptr<Worker>> workers;   // Local data for all worker threads
    std::vector<std::thread> threads;                // Worker threads
    std::atomic<bool> stop;                           // Stop flag
    std::atomic<size_t> next_thread_index;            // Index for round-robin task assignment    
    std::mutex cv_mutex;                              // Mutex for condition variable
    std::condition_variable cv;                        // Global condition variable to notify new tasks
};

Chunked Synchronization

Evenly divide the range into chunks, then create tasks and submit them to the thread pool.

void ThreadPool::ParallelFor(size_t start, size_t end, std::function<void(size_t)> func) 
{
    if (start >= end) return;
    size_t total = end - start;
    size_t numThreads = workers.size();
    size_t chunkSize = (total + numThreads - 1) / numThreads; // Simple chunking

    std::atomic<size_t> remaining(0);
    std::mutex local_mutex;
    std::condition_variable local_cv;

    for (size_t t = 0; t < numThreads; ++t) {
        size_t blockStart = start + t * chunkSize;
        size_t blockEnd = std::min(blockStart + chunkSize, end);
        if (blockStart >= blockEnd) break;

        remaining++;
        Enqueue([this, blockStart, blockEnd, &func, &remaining, &local_cv, &local_mutex]() {
            for (size_t i = blockStart; i < blockEnd; ++i) {
                func(i);
            }
            std::unique_lock<std::mutex> lock(local_mutex);
            if (--remaining == 0) {
                local_cv.notify_one();
            }
            });
    }

    // Wait for all chunks to complete
    std::unique_lock<std::mutex> lock(local_mutex);
    local_cv.wait(lock, [&remaining] { return remaining == 0; });
}

Task Submission

When a task is added, the thread pool first performs round-robin selection and then pushes it to the back.

void ThreadPool::Enqueue(std::function<void()> task) 
{
    // Round-robin select a target thread's local queue
    size_t target = next_thread_index.fetch_add(1, std::memory_order_relaxed) % workers.size();
    {
        std::unique_lock lock(workers[target]->mutex);
        workers[target]->tasks.push_back(std::move(task));
    }
    cv.notify_one(); // Wake up one waiting thread
}

Worker Loop

In the main loop, it preferentially takes tasks from the head of its own local queue. If its local queue is empty, it attempts to steal tasks from the tail of other threads’ queues, then executes the task or waits.

void ThreadPool::WorkerLoop(size_t myIndex) 
{
    Worker& myWorker = *workers[myIndex];

    while (!stop) 
    {
        std::function<void()> task;
        {
            std::unique_lock lock(myWorker.mutex, std::try_to_lock);
            if (lock.owns_lock() && !myWorker.tasks.empty()) {
                task = std::move(myWorker.tasks.front());
                myWorker.tasks.pop_front();
            }
        }

        if (!task) {
            size_t numWorkers = workers.size();
            for (size_t offset = 1; offset < numWorkers; ++offset) {
                size_t victimIdx = (myIndex + offset) % numWorkers;
                Worker& victim = *workers[victimIdx];
                std::unique_lock lock(victim.mutex, std::try_to_lock);
                if (lock.owns_lock() && !victim.tasks.empty()) {
                    task = std::move(victim.tasks.back()); // Steal from the tail
                    victim.tasks.pop_back();
                    break;
                }
            }
        }

        if (task)  task();
        else 
        {
            std::unique_lock<std::mutex> lock(cv_mutex);
            cv.wait(lock, [this] { return stop || has_any_task(); });
        }
    }
}

Parallel Simulation

Merging Data

Afterwards, all steps can be rewritten using the thread pool’s chunked synchronization. However, note that during broad and narrow phase detection, the results are written to arrays. If we go parallel, merging can be done via local processing.

void ParallelPhysics::HandleNarrowCollisions() 
{
    collisionData.clear();
    size_t numPairs = colliderPairs.size();
    if (numPairs == 0) return;

    size_t numThreads = threadPool.thread_count();
    size_t chunkSize = (numPairs + numThreads - 1) / numThreads;
    size_t numBlocks = (numPairs + chunkSize - 1) / chunkSize;

    std::vector<std::vector<CollisionData>> localData(numBlocks);

    threadPool.parallel_for(0, numBlocks, [&](size_t blockIdx) {
        size_t start = blockIdx * chunkSize;
        size_t end = std::min(start + chunkSize, numPairs);
        auto& local = localData[blockIdx];
        for (size_t i = start; i < end; ++i) {
            auto& pair = colliderPairs[i];
            CollisionInfo info = GJK_CheckCollision(pair.first, pair.second);
            if (info.flag && info.depth > 1e-3f) {
                local.emplace_back(pair.first, pair.second, info);
            }
        }
        });

    for (auto& local : localData) {
        collisionData.insert(collisionData.end(), local.begin(), local.end());
    }
}

Graph Coloring Grouping

And for lock-free impulse processing and position correction, we use graph coloring grouping. Roughly treat collision pairs as edges in a graph, and rigid bodies as vertices. Each edge needs to be assigned a color (group number) such that adjacent edges do not have the same color. The engine uses an approximate greedy algorithm to quickly generate parallelizable groups. This ensures that collision pairs within each group do not share any rigid bodies, so they can be safely solved in parallel within the group.

void ParallelPhysics::BuildCollisionGroups() {
    collisionGroups.clear();
    if (collisionData.empty()) return;

    std::sort(collisionData.begin(), collisionData.end(),
        [](const CollisionData& a, const CollisionData& b) {
            return a.info.depth > b.info.depth;
        });

    static std::vector<int> bodyMask;
    static std::vector<int> bodyEpoch;
    static int globalEpoch = 0;

    size_t maxId = 0;
    for (auto& data : collisionData) {
        maxId = std::max(maxId, (size_t)data.colliderA->id);
        maxId = std::max(maxId, (size_t)data.colliderB->id);
    }
    if (bodyMask.size() <= maxId) {
        bodyMask.resize(maxId + 1, 0);
        bodyEpoch.resize(maxId + 1, 0);
    }

    ++globalEpoch;

    for (auto& data : collisionData) {
        int aId = data.colliderA->id;
        int bId = data.colliderB->id;
        int groupIdx = 0;
        while (true) {
            if (groupIdx >= (int)collisionGroups.size()) {
                collisionGroups.emplace_back();
                collisionGroups.back().push_back(&data);
                bodyMask[aId] = groupIdx + 1;
                bodyEpoch[aId] = globalEpoch;
                bodyMask[bId] = groupIdx + 1;
                bodyEpoch[bId] = globalEpoch;
                break;
            }
            bool aOccupied = (bodyEpoch[aId] == globalEpoch && bodyMask[aId] == groupIdx + 1);
            bool bOccupied = (bodyEpoch[bId] == globalEpoch && bodyMask[bId] == groupIdx + 1);
            if (!aOccupied && !bOccupied) {
                collisionGroups[groupIdx].push_back(&data);
                bodyMask[aId] = groupIdx + 1;
                bodyEpoch[aId] = globalEpoch;
                bodyMask[bId] = groupIdx + 1;
                bodyEpoch[bId] = globalEpoch;
                break;
            }
            ++groupIdx;
        }
    }
}

Written behind

With this approach, in the release version with 2000 collision pairs, the physics frame time is 90 microseconds, FPS 1500+, compared to the serial version’s 330 microseconds and FPS 400+, which is a significant improvement. Of course, I have to complain that multithreading is really hard to get right. Initially, I didn’t want to write it myself, so I tried OpenMP, TaskFlow, and PPL sequentially, but their performance was comparable to serial. In the end, I had to implement multithreading myself. And after the current version, I also tried C20 barriers, hoping for further improvement, but the performance was again comparable to serial. Tired, so let’s leave it at that for now.