100 avatars in a browser tab: Optimizations in rendering for massive events and beyond
The Metaverse Festival was envisioned to be a milestone in the platform’s history, and user experience had to be up to par: improving the avatars’ rendering performance was paramount
The Metaverse Festival was envisioned to be a milestone in the platform’s history, and user experience had to be up to par: improving the avatars’ rendering performance was paramount.
When the Metaverse Festival planning started, only 20 avatars could be spawned simultaneously around the player and, if they were all rendered on screen at the same time, performance degraded significantly. That hard cap was due to web browser rendering limitations and to the old communications protocol, which was later improved with the new Archipelago solution. This would not provide a realistic social festival experience for users, so Decentraland contributors concluded that they had to increase the number of users that could be rendered on screen for the festival to be successful.
Explorer contributors began by putting a theory to the test: that rendering (along with CPU skinning) was the main culprit of performance issues caused by having multiple avatars on screen. The theory was confirmed after profiling performance with 100-200 avatar bots in a controlled environment.
In face of those findings, the goal was to increase the maximum display of avatars from 20 to 100. Three efforts were made towards that goal.
- A new impostor system was introduced. This system avoids rendering and animating distant avatars by replacing them with a single look-alike billboard when needed.
- A custom GPU skinning implementation was put into place, effectively reducing the CPU-bound skinning bottlenecks by a huge margin.
- The avatar rendering pipeline was re-implemented from scratch, reducing the draw calls from around 10+ to a single draw call in the best case scenario. Complex combinations of wearables could get the draw call count a bit higher due to render state switch, but it wouldn’t go over three or four calls in the worst cases.
According to benchmarking tests, these combined efforts improved the avatar rendering performance by around 180%. This is the first of many efforts to prepare the platform for mass events like the Metaverse Festival, which more than 20,000 people attended.
After implementing a tool for spawning avatar bots and another one for profiling performance in the web browser, the contributors were ready to take that profiling data and create the avatar impostor system.
While awesome scenes were being crafted leading up to the festival, a scene was built for testing purposes. It contained certain factors desired for the environment: a stadium-like structure, some objects being constantly updated, and at least two different video streaming sources. All of this was done using the Decentraland SDK.
With everything else in place, the proof of concept for avatar impostors was started.
After having the core logic running, the contributors started working on the “visual part” of the feature.
First, a sprite atlas with default impostors was used for randomizing impostors for every avatar.
Later on, several experiments were done using runtime-captured snapshots with each avatar’s angle towards the camera. However, texture manipulation in runtime in the browser proved to be extremely heavy on performance, so that option was discarded.
In the end, the users’ body snapshot already existing in the content servers was used for their impostor, and the default sprite atlas was applied for bots or users with no profile.
Last but not least, some final effects regarding position and distance were implemented and tweaked.
Highly frequented scenes like Wondermine proved invaluable when testing with real users.
👉 Impostor system contributions are public and available on the following PRs (1, 2, 3, 4)
The Unity implementation of skinning for WebGL/WASM target forces the skinning computations to be on the main thread and, moreover, it misses all the SIMD improvements present on other platforms. When rendering lots of avatars, this overhead piles up and becomes a performance issue, taking up to 15% of frame time (or more!) when trying to render multiple avatars.
On most of the GPU skinning implementations described on the internet, the animation data is packed into textures and then fed into the skinning shader. This is good for performance on paper, but it has its own limitations. For instance, blending between animations is complicated, apart from having to write your own custom animator to handle the animations slate.
Since the skinning is so poorly optimized in the WASM target, contributors found that a performance improvement of 200%~ can be observed even without packing the animation data into textures. This means that a simplistic implementation that simply uploads the bone matrices into a skinning shader per-frame is enough. This optimization was enhanced even more as the farthest animations are throttled and don’t upload their bone matrices on every frame. All in all, this approach gave avatars a performance boost, avoided rewriting the Unity animation system, and kept the animation state blending support.
The throttled GPU skinning can be seen in action on the farthest avatars in these videos at the Metaverse Festival:
👉 The GPU skinning contributions are public and available on the following PRs (1, 2, 3, 4).
Avatars Rendering Overhaul
Avatars used to be rendered as a series of different skinned mesh renderers that shared their bone data. Also, some wearables needed a separate material to account for the skin color and the emission. This meant that some wearables needed two or three draw calls. In this scenario, having to draw an entire avatar could involve 10+ draw calls. Draw calls are very costly in WebGL, easily comparable to mobile GPU draw call costs—or perhaps even worse. When benchmarking with more than 20 avatars on screen, the frame rate started dropping considerably.
The new avatar rendering pipeline works by merging all the wearable primitives into a single mesh, encoding sampler and uniform data in the vertex stream in a way that allows the packing of each of the wearable materials into a single one as well.
You may be wondering how the wearables textures are being packed. As wearables are of dynamic nature, they don’t have the benefit of atlasing and the texture sharing between them is poor—almost non-existent. The most obvious optimizations are:
- Packing all textures into a runtime generated atlas.
- Using a 2D texture array and putting all the wearable textures in runtime.
The issue with these approaches is that generating and copying texture pixels in runtime is very expensive due to the CPU-GPU memory bottleneck. The texture array approach has another limitation: its texture elements can’t just reference other textures, they have to be copied and all the textures of the array have to be of the same size.
In addition, creating a new texture for each avatar is memory intensive, and the current client heap is restrained to 2 GB due to emscripten limitations—it can be extended to 4 GB but only from Unity 2021 onwards, as they have updated emscripten on that version. As the contributors’ goal was to support at least a hundred avatars at the same time, this was not an option.
To avoid the texture copy issues altogether, a simpler but more efficient approach was taken.
In the avatar shader, a texture sampler pool uses 12 allowed sampler slots for the avatar textures. When rendering, the needed sampler is indexed by using custom UV data. This data is laid out in a way in which albedo and emission textures are identified by using the different UV channels. This approach allows a very efficient packing. For instance, the same material could use 6 albedo and 6 emissive textures, or 11 albedo and a single emissive, and so on. The avatar combiner takes advantage of this and tries to pack all the wearables used in the most efficient way possible—the tradeoff being that branching has to be used in the shader code, but fragment performance is not an issue, so the investment returns are exceedingly positive.
Top image, before the improvement (120ms / 8FPS). Bottom image, afterwards (50ms / 20FPS).
👉 The avatar rendering contributions are public and available on the following PRs (1, 2)
After these improvements, a performance increase of 180% (from an average of 10FPS to an average of 28FPS) was observed when having 100 avatars on screen.
Some of these improvements, like the throttled GPU skinning, may be applied to other animated meshes within Decentraland in the future.
These enormous efforts to allow 5x more avatars on screen (from 20 to 100) for the Festival are now a part of the Decentraland explorer for good and they will continue to add value to in-world social experiences.