This is a case study of a project that takes a FullHD 30fps camera stream, scales and encodes it for UDP streaming but also takes occasional snapshots in JPEG, and occasional recording to file and playing as USB camera. It could also play the recorded files and network streams. This should run on an i.MX53. H.264 and JPEG en/decoding can be done in VPU, scaling and color conversion can be done on IPU. To put all of this together, you end up with GStreamer.
For the camera, there’s already a capture device and v4l2src.
For the output, there’s v4l2sink.
For IPU and VPU, it’s more complex. It is really needed: e.g. using videoscale gives 100% CPU and about 1fps. There is an existing Freescale en/decoder plugin, but GStreamer 0.10 only and using a kernel interface that’s never going to be upstreamed. For scaling and colorspace conversion, there was nothing. v4l2 mem2mem device is something like a capture and output device in one, so let’s create a V4l2Filter element that can be used for any hw-accelerated operation.
V4l2Filter can’t use GstBaseTransform because it has to be asynchronous. It also can’t use GstVideoDecoder or GstVideoEncoder because that’s not what it does. That’s why Michael asked a question in an earlier session if there shouldn’t be a base class for asynchronous transforms. Now he had to do all the boilerplate.
Caps negotiation is tricky, because it’s an encoder and decoder wrapped into a single element: both the source and the sink can take jpeg or h.264, but that sounds like you could do transcoding which you can’t. QoS is implemented based on GstVideoDecoder. Event and meta handling for the encoder/decoder is tricky, because it does frame reordering, so you have to somehow preserve timestamps and other metadata. Also in-flow events have to be sent out after the correct frame. Some story for latency calculation.
The remaining problem is the copying. The problem is that the v4l2 buffer pool cannot work with memory that is not allocated by the same element. v4l2 has two mechanisms for zerocopy: userptr and dmabuf. userptr looks simpler but is not reliable because the v4l2 driver has to figure out if it can use this pointer or not. dmabuf has a clean API and should work with other plugins as well, it could e.g. be imported by a GL plugin. There was already a dmabuf GstMemory and GstAllocator (needed some fixes though), so the only thing is to use them. Teach GstV4l2BufferPool to create dmabuf buffers. There’s still an ugly hack in there because it is not designed to handle foreign buffers. Kernel patch to allow allocation of writable dmabuf memory.
The dynamic pipeline was first implemented in 0.10, and it was absolutely not reliable. With 1.0, it was much easier to make the dynamic pipeline, it worked reliably, and half of the code could be removed. But there are still some issues. When the pipeline is changed, you may need extra buffers. This was a problem until VIDIOC_CREATE_BUFS was added. [Some more issues here which I didn’t understand.]
With all the acceleration, the pipeline can run with almost no CPU load. Only when audio is added there is some CPU load. It becomes critical at some point because the accelerators are still accessing memory and the memory bandwidth is the limiting factor.
One conclusion from this project is that it is still difficult to debug things. Your pipeline can get stuck, and it is difficult to find out why.