Godot vs Jolt Parallel Comparison

Written before

The previous article Godot Physics Source Code Analysis analyzed the built-in 3D physics pipeline in Godot. Extending from that, the next topic would be the Jolt physics engine integrated in version 4.4. Initially, I considered writing a separate article introducing it, but upon review, I felt that Godot’s integration of Jolt is relatively simplistic. In terms of functionality, it has been aligned and limited to match Godot’s existing physics capabilities. In terms of performance, an adapter layer was added to route calls back to Godot’s generic thread pool for parallelization. Rather than introducing Jolt within Godot, it might be more insightful to discuss them independently and compare their approaches.

What is the most differentiating aspect among physics engines? In my opinion, it’s not the algorithms (Impulse, PBD) but the parallelization strategies. How constraint islands are constructed and partitioned, and how the thread pool is designed—these directly determine efficiency and can yield drastically different performance across platforms. The following sections will compare these two aspects.

Constraint Construction

Let’s start with the relatively simpler constraint construction, as Godot’s logic here is quite straightforward. As mentioned previously, it’s divided into three distinct logics for Areas, RigidBodies, and SoftBodies. Simply put, the Area part just throws any associated object in (since no computation is performed). RigidBodies detect associated RigidBodies and SoftBodies. SoftBodies only detect associated RigidBodies.

In contrast, Jolt’s handling is more complex, implementing an 8-bit mask graph partitioning logic, where 7 bits are for parallel groups and 1 bit for non-parallel, corresponding to typical 8-core CPUs. To improve parallel efficiency, with dynamic splitting enabled, Jolt splits constraint groups whose constraint count exceeds a split threshold (default: 128). It then checks the split bits; if a split doesn’t reach a merge threshold (default: 16), it’s merged into the non-parallel group; otherwise, a new parallel group is created. The detailed flow is as follows:

bool LargeIslandSplitter::SplitIsland()
{
        // 1. Obtain island data → Check size threshold → If less than 128, return false
        // 2. Reset object masks → Calculate constraint count per split region
        // 3. Split contact points (mask coloring algorithm)
        //     - Get the two objects associated with the contact point
        //     - Find the first common, unused bit in the masks of these two objects
        //     - Set that bit to '1' and record the split index
        // 4. Allocate split operations for constraints (invoke constraint-specific logic)
        // 5. Create split structure and merge small splits
        //     - Iterate through the 8 split bits
        //     - If the number of constraints plus contact points in a split is less than 16, merge into the non-parallel split (bit 7)
        //     - Otherwise, create a new parallel split
        // 6. Reorder data by split
        //     - Group and store contact points and constraint indices by split
        //     - Improve cache hit rate
}

Constraint Solving

Next is the solving stage. As mentioned earlier, this discussion does not involve specific solving algorithms but focuses on thread pool scheduling. Godot uses a generic thread pool with high/low priority queue scheduling, using semaphores + condition variables for waiting; whereas Jolt uses a physics-specific thread pool based on task dependency graph scheduling with barrier waits.

Godot Generic Thread Pool

First, the general workflow of Godot’s thread pool: initialization, task submission, thread execution, cooperative waiting, cleanup. The specific flow is as follows:

1. Initialization Phase:
    - Create a specified number of worker threads
    - Set the proportion of low-priority threads
    - Establish thread index mapping
2. Task Submission:
    - Create a task object (allocate memory)
    - Place it into the corresponding queue based on priority
    - Notify idle threads
3. Thread Execution:
    - Fetch a task from the queue
    - Execute the task (script/native function)
    - Handle task group synchronization
    - Send completion notification
4. Cooperative Waiting:
    - Worker threads can cooperatively process other tasks while waiting
    - Includes deadlock prevention mechanisms
    - Supports pause/resume mechanisms
5. Cleanup Phase:
    - Gradual shutdown
    - Wait for all tasks to complete
    - Destroy the thread pool

Basic Structure

The basic structure of Godot’s generic thread pool is roughly as follows, using a dual-queue system to balance responsiveness and throughput, and paged allocation to reduce memory fragmentation.

// Class structure and main members
class WorkerThreadPool : public Object {
    GDCLASS(WorkerThreadPool, Object)
    
    // Use TaskID and GroupID as the task identifiers
    typedef int64_t TaskID;
    typedef int64_t GroupID;
    
private:
    struct Task;      // Task Structure
    struct Group;     // Task Group Structure
    struct ThreadData; // Thread Data
    
    // Core Data Stucture
    PagedAllocator<Task, false, TASKS_PAGE_SIZE> task_allocator;
    PagedAllocator<Group, false, GROUPS_PAGE_SIZE> group_allocator;
    
    SelfList<Task>::List low_priority_task_queue;  // Low-priority queue
    SelfList<Task>::List task_queue;               // High-priority queue
    BinaryMutex task_mutex;                        // Mutual exclusion lock to protect task queue
    
    TightLocalVector<ThreadData> threads;          // Thread data array
    HashMap<Thread::ID, int> thread_ids;          // Thread IDs to indices Map
    HashMap<TaskID, Task *> tasks;                 // All tasks map
    HashMap<GroupID, Group *> groups;              // All task groups map
};

// Memory Management for Tasks and Groups
// Use the paging allocator to reduce memory fragmentation.
static const uint32_t TASKS_PAGE_SIZE = 1024;
static const uint32_t GROUPS_PAGE_SIZE = 256;

// Obtain the Task from the distributor instead of creating one.
Task *task = task_allocator.alloc();
// Return after use
task_allocator.free(task);

Cooperative Waiting

The most interesting aspect is likely the cooperative waiting mechanism: worker threads don’t idle while waiting but continue processing other tasks. This is particularly useful for handling task dependencies, effectively preventing deadlocks and improving CPU utilization.

void WorkerThreadPool::_wait_collaboratively(ThreadData *p_caller_pool_thread, Task *p_task) {
    while (true) {
        MutexLock lock(task_mutex);
        
        // Check if the waiting period has ended
        if (p_task->completed) break;
        
        // Try to do other tasks
        if (task_queue.first()) {
            Task *next_task = task_queue.first()->self();
            task_queue.remove(task_queue.first());
            lock.unlock();
            _process_task(next_task);
        } else {
            // Waiting for a condition variable
            p_caller_pool_thread->awaited_task = p_task;
            p_caller_pool_thread->cond_var.wait(lock);
            p_caller_pool_thread->awaited_task = nullptr;
        }
    }
}

Jolt Physics Thread Pool

Next is Jolt’s physics thread pool. Its general workflow is: initialization, task submission, thread execution, barrier synchronization, cleanup. The specific flow is as follows:

1. Initialization Phase:
    - Initialize task free list and barrier free list
    - Create a specified number of worker threads
    - Set thread initialization/exit callback functions
2. Task Creation:
    - Allocate a task object from the free list
    - Set task name, color, execution function, and dependency count
    - If dependency count is 0, enqueue the task immediately
3. Task Enqueuing:
    - Add the task to the circular queue
    - Use atomic operations to manage queue head and tail
    - Release semaphore to wake worker threads
4. Worker Thread Loop:
    - Wait for semaphore
    - Fetch tasks from the queue and execute them
    - After task execution completes, decrement the dependency count of its successor tasks
    - If a successor task's dependency count becomes 0, enqueue it for processing
5. Barrier Synchronization:
    - Use barriers to wait for a set of tasks to complete
    - The main thread can also participate in task execution (wait at the barrier)
6. Cleanup Phase:
    - Stop running threads
    - Wait for all threads to finish
    - Release allocated resources

Basic Structure

The basic structure of Jolt’s physics thread pool is roughly as follows. It uses a ring-based lock-free queue and fixed-size memory pools.

class JobSystemThreadPool final : public JobSystemWithBarrier {
private:
    // Task memory pool (fixed size)
    FixedSizeFreeList<Job> mJobs;
    
    // Array of worker threads
    Array<thread> mThreads;
    
    // Circular Queue (Lock-Free Design)
    static constexpr uint32 cQueueLength = 1024;
    atomic<Job *> mQueue[cQueueLength];
    
    // The local head pointer of each thread + the global tail pointer
    atomic<uint> *mHeads = nullptr;  // One head per thread
    atomic<uint> mTail = 0;          // Global tail
    
    // Semaphores are used for thread wake-up.
    Semaphore mSemaphore;
    
    // Exit indicator
    atomic<bool> mQuit = false;
};

Barrier Synchronization

In my view, Jolt’s barrier is a physics-specialized version of Godot’s cooperative waiting, suitable for synchronizing physics phases.

The barrier structure is as follows, with read and write pointers aligned to different cache lines to avoid cache invalidation across cores.

class BarrierImpl : public Barrier {
private:
    static constexpr uint cMaxJobs = 2048;  // Maximum number of tasks (a power of 2)
    atomic<Job *> mJobs[cMaxJobs]; // Use circular buffer stores tasks
    
    // Read-write pointer (cache alignment in line to avoid false sharing)
    alignas(JPH_CACHE_LINE_SIZE) atomic<uint> mJobReadIndex { 0 };
    alignas(JPH_CACHE_LINE_SIZE) atomic<uint> mJobWriteIndex { 0 };
    
    // Synchronous Control
    atomic<int> mNumToAcquire { 0 };  // The required count of semaphores
    Semaphore mSemaphore;             // Task Completion Semaphore
    
    atomic<bool> mInUse { false };
};

The logic for adding multiple tasks is as follows. The barrier supports batch task addition to reduce synchronization overhead:

void BarrierImpl::AddJobs(const JobHandle *inHandles, uint inNumHandles) {
    bool release_semaphore = false;
    
    for (const JobHandle *handle = inHandles; 
         handle < inHandles + inNumHandles; ++handle) {
        Job *job = handle->GetPtr();
        
        if (job->SetBarrier(this)) {
            mNumToAcquire++;
            
            // Release the semaphore only when the first executable task is completed
            if (!release_semaphore && job->CanBeExecuted()) {
                release_semaphore = true;
                mNumToAcquire++;
            }
            
            // Add to buffer
            job->AddRef();
            uint write_index = mJobWriteIndex++;
            while (write_index - mJobReadIndex >= cMaxJobs) {
                std::this_thread::sleep_for(std::chrono::microseconds(100));
            }
            mJobs[write_index & (cMaxJobs - 1)] = job;
        }
    }
    
    // Notify after the batch processing is completed.
    if (release_semaphore) mSemaphore.Release();
}

The core waiting mechanism is as follows. It also employs cooperative waiting, balancing efficiency and correctness.

void BarrierImpl::Wait() {
    // Loop until all tasks are completed
    while (mNumToAcquire > 0) {
        bool has_executed;
        
        do {
            has_executed = false;
            
            // A. Clean the completed job
            while (mJobReadIndex < mJobWriteIndex) {
                atomic<Job *> &job = mJobs[mJobReadIndex & (cMaxJobs - 1)];
                Job *job_ptr = job.load();
                
                if (job_ptr == nullptr || !job_ptr->IsDone()) break;
                
                // Release completed tasks
                job_ptr->Release();
                job = nullptr;
                ++mJobReadIndex;
            }
            
            // B. Search and excute tasks
            for (uint index = mJobReadIndex; index < mJobWriteIndex; ++index) {
                const atomic<Job *> &job = mJobs[index & (cMaxJobs - 1)];
                Job *job_ptr = job.load();
                
                if (job_ptr != nullptr && job_ptr->CanBeExecuted()) {
                    job_ptr->Execute();
                    has_executed = true;
                    break;
                }
            }
            
        } while (has_executed);  // As long as there is a task to be executed, continue.
        
        // C. Wait for semaphore
        int num_to_acquire = max(1, mSemaphore.GetValue());
        
        // Batch acquisition of semaphores (to reduce context switching)
        mSemaphore.Acquire(num_to_acquire);
        
        // Reduce waiting count
        mNumToAcquire -= num_to_acquire;
    }
    
   // D. Final Cleanup (Ensure that all tasks have been released)
    while (mJobReadIndex < mJobWriteIndex) {
        atomic<Job *> &job = mJobs[mJobReadIndex & (cMaxJobs - 1)];
        Job *job_ptr = job.load();
        job_ptr->Release();
        job = nullptr;
        ++mJobReadIndex;
    }
}

Summary

By comparing the implementations of physical parallelization between Godot and the original Jolt, we can see clear differences in their design goals. Godot adopts a highly generic thread pool and task system, leaning more towards overall engine consistency and maintainability. In contrast, Jolt builds a dedicated Job System, dependency graph, and barrier mechanisms centered around physics computation itself to maximize parallel efficiency in the physics phase. In Godot’s current integration of Jolt, Jolt’s tasks are ultimately adapted back to Godot’s generic thread pool, which is feasible in terms of functionality but means that the native parallelization design advantages are not fully leveraged.

In comparison, Unreal Engine (Chaos) decomposes physics tasks and integrates them into the engine-level scheduling system via a Task Graph, giving it stronger adaptability across different hardware platforms. This “task-centric rather than thread-centric” design makes it easier to fully utilize multi-core CPUs in complex scenarios. Truly, UE is unmatched. I give up.