Written before

  Unity DOTS is a data oriented technology stack developed by Unity based on the ECS architecture, which includes Burst Compressor technology and JobSystem technology. It aims to fully utilize SIMD and multi-threaded operations to fully leverage the advantages of ECS. I currently does not have the ability to deeply explore ECS, but for Burst and Job, which are relatively less closely related to it, can be dismantled and played with separately.

Preparation

Package preparation

  Enable the option to display Preview in the Package Manager of Unity, then install Jobs and Burst together.

img

General usage

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
using Unity.Jobs; //define【IJob】【IJobParallelFor】
using Unity.Burst; //define【BurstCompile】
using Unity.Collections; //define【NativeArray】container
using UnityEngine.Jobs; //define【IJobParallelForTransform】
using Unity.Mathematics; //SIMD math library

[BurstCompile] //Burst acceleration, but requires non delegated data structures
struct xx : IJob
{
public void Execute(){}
}

[BurstCompile] //Basically, there should be no Unity structures such as GameObject and Transform with Native containers
struct xx : IJobParallelFor
{
public void Execute(int i) {}
}

[BurstCompile] //Use float3 and float4x4 from the math library to complete the process
struct xx : IJobParallelForTransform
{
public void Execute(int i, TransformAccess t) {}
}

Small test

  Since everything has been parallelized, let’s first process the data and compare it with the traditional main thread

1
2
3
4
5
6
7
8
9
10
11
a = new int3[dataCount];
time = Time.realtimeSinceStartup;
for (int i = 0; i < dataCount; ++i)
a[i] = new int3(i, i, i);
Debug.Log("顺序直接赋值" + dataCount + "个用时" + (Time.realtimeSinceStartup - time) + "秒");

b = new NativeArray<int3>(dataCount, Allocator.TempJob);
JobHandle orderHandle = new CountInOrder() { data = b }.Schedule(dataCount, 64);
time = Time.realtimeSinceStartup;
orderHandle.Complete();
Debug.Log("并行直接赋值" + dataCount + "个用时" + (Time.realtimeSinceStartup - time) + "秒");

  The data was opened to 1e7, and the time was reduced from 0.03 seconds to 0.01 seconds or even less. The effect was outstanding (configured as the lowest configuration game book in 2020, with 12 threads)

img

img

Advanced test

  It looks like multithreading is very powerful, so Let’s play with the bigger test again. As I mentioned before, DynamicBone [^1] is very suitable for multithreading, so just go ahead. Before version 1.2, many predecessors struggled with its performance and multithreaded it, and it was not until a long time later that the official 1.3 multithreaded version arrived. My own ToyDynamicBone may can not be compared to the official one, so let’s focus on achieving consistent results. In terms of performance, batch processing and scheduling of all dynamic bones according to the practices of predecessors [^2], that is, setting up a manager to collect dynamic bone data in the scene and simulate it uniformly.

Prepare()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[BurstCompile]
struct Prepare : IJobParallelForTransform
{
public NativeArray<Particle> ps;
public void Execute(int i, TransformAccess t)
{
Particle p = ps[i];
t.localPosition = p.m_InitLocalPosition;
t.localRotation = p.m_InitLocalRotation;
p.m_TransformPosition = t.position;
p.m_TransformLocalPosition = t.localPosition;
p.m_TransformLocalToWorldMatrix = t.localToWorldMatrix;
ps[i] = p;
}
}

UpdateParticles()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
[BurstCompile]
struct UpdateParticles : IJobParallelFor
{
public NativeArray<Particle> ps;
public void Execute(int i)
{
Particle p = ps[i];
if (p.m_ParentIndex == -1)
{
p.m_PrevPosition = p.m_Position;
p.m_Position = p.m_TransformPosition;
return;
}
Particle p0 = ps[p.m_ParentIndex];
// verlet integration
float3 v = p.m_Position - p.m_PrevPosition;
p.m_PrevPosition = p.m_Position;
p.m_Position += v * (1 - p.m_Damping);
float restLen;
restLen = math.length(p0.m_TransformPosition - p.m_TransformPosition);
// keep shape
float4x4 m0 = p.m_TransformLocalToWorldMatrix;
m0.c3.xyz = p0.m_Position;
float3 restPos = math.mul(m0, new float4(p.m_TransformLocalPosition, 1)).xyz;
float3 d = restPos - p.m_Position;
p.m_Position += d * p.m_Elasticity;
float len = math.length(d);
float maxlen = restLen * (1 - p.m_Stiffness) * 2;
if (len > maxlen)
p.m_Position += d * ((len - maxlen) / len);
// keep length
float3 dd = p0.m_Position - p.m_Position;
float leng = math.length(dd);
if (leng > 0)
p.m_Position += dd * ((leng - restLen) / leng);
ps[i] = p;
}
}

SkipUpdateParticles()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
[BurstCompile]
struct SkipUpdateParticles : IJobParallelFor
{
public NativeArray<Particle> ps;

public void Execute(int i)
{
Particle p = ps[i];
if (p.m_ParentIndex >= 0)
{
Particle p0 = ps[p.m_ParentIndex];
float restLen;
restLen = math.length(p0.m_TransformPosition - p.m_TransformPosition);
// keep shape
float4x4 m0 = p.m_TransformLocalToWorldMatrix;
m0.c3.xyz = p0.m_Position;
float3 restPos;
restPos = math.mul(m0, new float4(p.m_TransformLocalPosition, 1)).xyz;
float3 d = restPos - p.m_Position;
p.m_Position += d * p.m_Elasticity;
d = restPos - p.m_Position;
float len = math.length(d);
float maxlen = restLen * (1 - p.m_Stiffness) * 2;
if (len > maxlen)
p.m_Position += d * ((len - maxlen) / len);
// keep length
float3 dd = p0.m_Position - p.m_Position;
float leng = math.length(dd);
if (leng > 0)
p.m_Position += dd * ((leng - restLen) / leng);
}
else
{
p.m_PrevPosition = p.m_Position;
p.m_Position = p.m_TransformPosition;
}
ps[i] = p;
}
}

Performance comparison

  During the internship, I wrote intermittently for a few days, and then used even more days completing the multi-threaded version under various simplify. Just set up a scene and move it to see the effect. As expected, after all, it was written from version 1.3.

version 1.2 on the left, version 1.3 in the middle, and ToyDynamicBone on the right

  I created 121 such tails using scripts and analyzed them using Unity Profiler, and the results were outstanding.
CPU Consume(ms per frame) GPU Consume(ms per frame)
DynamicBone 1.2 9.4 4.4
DynamicBone 1.3 8.4 5.0
DynamicBone toy 7.0 4.5

Conclusion

  Overall, Unity’s Jobs and Bursts are indeed powerful tools for performance optimization, and getting started is not difficult. Just use a parallelization mindset to modify the data structure while paying attention to small details in the calculation (Surreal. jpg).

Reference

[^1]: DynamicBone principle
[^2]: Implementation of high-performance DynamicBone