Real-time WebGL video manipulation

Published in

Byborg Engineering

6 min readFeb 22, 2022

This is part III of the “Manipulating video in a browser” series. In the previous articles, we experimented with the new “Insertable Streams for MediaStreamTrack” API to modify the web cam’s video stream before sending it further — to the WebRTC module, for example.

The first attempt used JavaScript to process the video frames: https://medium.com/docler-engineering/manipulating-video-in-a-browser-5b37f8149d9b

The second one used WebAssembly to do the same operation: https://medium.com/docler-engineering/video-manipulation-with-webassembly-3477a0c8524d

Our example is a simple green screen effect where the application swaps the green background to a nice summer scene.

The results were somewhat disappointing: a lower performance PC, or an average mobile device, was not able to deliver a solid 30 frame/sec in 720p (HD)!

Raise the bet!

There is a hidden workhorse in nearly every computer, regardless if it’s a smartphone or a high-end PC:

the graphics processor a.k.a. the famous GPU

Now, I’ll show how to get the GPU to process video frames in a browser environment.

WebGL to the rescue

In modern browsers, our key to GPU processing is the WebGL API. Let’s dust off our previous example and swap the frame transformer function for WebGL’s one!

GPU processing is a little bit different than our previous examples. We will need to prepare and upload special programs to the GPU. With the help of the vertex and fragment shader programs, we can transform our video frames. Each frame will be uploaded to the GPU as a texture, then the modified frame will be passed further to the video stream.

I must mention that this article won’t explain all the details of WebGL, but it will demonstrate a WebGL based implementation of a certain technical problem. I have added some links to learning material in the appendix if you are wish to dig deeper into the topic.

Prepare for WebGL magic

To use WebGL, you will need the following components:

A HTML canvas element
It is not necessary to attach this element to the DOM, it can simply exist in memory. This will be the running context of your WebGL programs, and will hold the resulting image after every processing round.
The vertex shader
Since our image is a simple, static rectangle in the 3D space provided by WebGL, this component will not do too much, only coordinate system transformations.

The fragment shader
This piece of program will do the image manipulation by running on the GPU directly.
WebGL initialization
Check out the details in the upcoming sections.

WebGL initialization

Before getting the GPU to do what we want, we need to initialize the WebGL subsystem. The initialization code is mostly generic, the only exception is the interesting way we need to upload the YUV color format video frames.

The full initialization script can be found in the Appendix.

Apart from the basic initialization process, we also need to upload the background image to the GPU as a normal RGBA texture. However, the video frames coming from the camera are in the YUV color format, which is not directly “understandable” for the GPU. We will need to delegate the color space conversion to the GPU as well, and upload the YUV encoded frame data directly as a texture.

A generic RGBA texture is not exactly suitable for this purpose, because the Y,U and V channels are packed in separate section in the data buffer, and each colorful pixel should get data from each of the three sections. We need a simple array representation of the data, and the most appropriate texture format for this purpose is a pure monochrome format, which is called LUMINANCE only texture.

Data arrangement in the different texture formats

Now that we have both the background image and the video frames as textures, it’s time to examine how the GPU will process this data to create the desired video effect.

Fragment shader

We kept the implementation of the shader as close to the original JavaScript solution as possible to keep it understandable.

The most interesting part is the YUV to RGBA conversion. The following image illustrates how a shader program processes the YUV format data. We can see that in spite of the RGBA format, technically 3 different images contain all the information, and each pixel receives luminance and color information from all three sub-images.

Because the U and V channels’ height and width resolutions are halved, the even and odd rows will continuously appear in the buffer when it is sampled with the full resolution. We have to live with this small drawback, because we have tricked the WebGL subsystem to “think” we provided a monochrome texture with the full resolution, but in reality we sent a full resolution monochrome channel ( Y ) and two halved resolution color channels ( U and V ) in the same texture buffer.
In the shader program, the first four definitions are used to search for the YUV channel data. The rest should be pretty straightforward, since it’s really analogous to the JavaScript version.

Video stream transformer

We are now at the last step to get a working test. We need to modify the transformer function from our previous versions so that the application uploads the frames as textures, then gets the results back before creating the outgoing video frame.

First attempt

We are going to read the rendered pixels as a frame-buffer with the “readPixels” function.

The application works, but the performance is far under the expected level. Even on a high performance PC, the frame processing time is around 10ms. After inspecting the performance metrics, we quickly find that the “readPixels” function eats up the resources!

We can do better!

Second attempt

By carefully checking the VideoFrame constructor’s documentation, we find that it can accept canvas elements as well, so also try the version below. The following changes are needed:

🎉 Hurrah! 🎉

The speed has increased a lot! Now, the frames are processed in 1ms on the same PC. Excellent!!!

Benchmarks

To compare the different technologies, I have measured the frame processing time on different devices with all the options discussed in my series of articles.

All of the devices run Chrome v97. The resolution was set to HD (1280 * 720)

At 30 frame/sec, the available time between frames is 33ms. In the spreadsheet above, the red values indicates the cases when the frame rate dropped due to extended processing time.

Its obvious, that the GPU processing is far more performant than any other solution, and is capable of real-time image processing on slower (mobile) devices as well.

Another interesting result is the performance of WebAssembly, which seems to be relatively faster on slower devices compared to JavaScript.

Appendix

Useful links

Some of my favorite WebGL resources:

https://webgl2fundamentals.org/
https://thebookofshaders.com/

WebGL initialization script

You may find the details interesting…