iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🛠️

Building a Custom Redundant Audio Data (RED) Encoder to Prevent WebRTC Audio Quality Degradation

に公開

Preventing WebRTC Audio Quality Degradation by Creating a Custom Encoder for Redundant Audio Data (RED)

This article is for Day 15 of the NTT Communications Advent Calendar 2021.
I am @shinyoshiaki, a member of the SkyWay development team who joined in 2020.

In this article, I will write about RED, a technology for preventing audio quality degradation due to packet loss in WebRTC.

Note: As of December 14, 2021, the content of this article does not work on browsers other than Chrome.

Introduction

Starting from Chrome version M96, RFC2198 - RTP Payload for Redundant Audio Data (RED) has been officially enabled.

In RED, media packet redundancy is achieved by packing not only the latest media packet but also the previous N media packets into the RTP Payload.
This variable N is called "Distance," and a larger Distance value increases redundancy and improves audio quality when packet loss occurs. In exchange, the audio communication volume becomes Distance + 1 times larger compared to when RED is not used.

As explained in this Public Service Announcement (PSA), the Distance for RED in Chrome is currently fixed at 1.

As mentioned in the webrtc hacks article, in communication environments with high packet loss rates (e.g., 60% in the webrtc hacks example), a Distance value of 2 provides far better audio quality than 1.

Although increasing the Distance value increases the traffic accordingly, the amount is still small compared to video traffic. Therefore, in use cases where audio quality is prioritized, it is highly likely that one would want to set a Distance value greater than 1.

While the Distance for RED in Chrome is currently fixed at 1, the aforementioned PSA states that by using the browser's encoded insertable streams API to create a custom encoder that wraps Opus frames in the RFC 2198 format, the Distance value can be set to any desired value.

It is possible to use the encoded insertable streams API to write a custom encoder that wraps opus frames in the RFC 2198 format for applications that require more flexibility with respect to the amount of redundancy.

So, in this article, I will try to create a custom RED encoder that can set an arbitrary Distance value.

An article applying the contents of this post to SkyWay's js-sdk is scheduled to be published later on SkyWay's Note site.

Sample Code

The sample code uses TypeScript to run in the browser. (It should also be possible to rewrite it in any language using WebAssembly.)

The code upon which the samples in this article are based is available on GitHub.

RED Packet

The first thing needed to create a custom encoder is to serialize and deserialize RED packets.

Let's look at the RED packet specifications.

The diagram below is an example of a RED packet from Section 7 of RFC2198.

    0                   1                    2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3  4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |V=2|P|X| CC=0  |M|      PT     |   sequence number of primary  |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |              timestamp  of primary encoding                   |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |           synchronization source (SSRC) identifier            |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |1| block PT=7  |  timestamp offset         |   block length    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |0| block PT=5  |                                               |
   +-+-+-+-+-+-+-+-+                                               +
   |                                                               |
   +                LPC encoded redundant data (PT=7)              +
   |                (14 bytes)                                     |
   +                                               +---------------+
   |                                               |               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               +
   |                                                               |
   +                                                               +
   |                                                               |
   +                                                               +
   |                                                               |
   +                                                               +
   |                DVI4 encoded primary data (PT=5)               |
   +                (84 bytes, not to scale)                       +
   /                                                               /
   +                                                               +
   |                                                               |
   +                                                               +
   |                                                               |
   +                                               +---------------+
   |                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Within this diagram, the following part corresponds to the RED packet.
The RED packet is contained within the RTP Payload area.

  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |1| block PT=7  |  timestamp offset         |   block length    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |0| block PT=5  |                                               |
   +-+-+-+-+-+-+-+-+                                               +
   |                                                               |
   +                LPC encoded redundant data (PT=7)              +
   |                (14 bytes)                                     |
   +                                               +---------------+
   |                                               |               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               +
   |                                                               |
   +                                                               +
   |                                                               |
   +                                                               +
   |                                                               |
   +                                                               +
   |                DVI4 encoded primary data (PT=5)               |
   +                (84 bytes, not to scale)                       +
   /                                                               /
   +                                                               +
   |                                                               |
   +                                                               +
   |                                                               |
   +                                               +---------------+
   |                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Within the diagram above, the following part corresponds to the RED packet header.

  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |1| block PT=7  |  timestamp offset         |   block length    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |0| block PT=5  |
   +-+-+-+-+-+-+-+-+

First, let's look at the RED packet header.

RED Packet Header

A RED packet header consists of multiple header blocks, as shown in the following diagram.

    0                   1                    2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3  4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |F|   block PT  |  timestamp offset         |   block length    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The fields of a header block are defined as follows:

  • F
    • 1 bit
    • Indicates whether another header block follows. A value of 1 means another header block follows; 0 means this is the last header block.
  • block PT
    • 7 bits
    • The RTP payload type of the redundant packet.
      • Specifically, this matches the payloadType found in the ${payloadType}/${payloadType} part of the a=fmtp line immediately following the RED a=rtpmap line in the SDP.
        a=rtpmap:63 red/48000/2
        a=fmtp:63 111/111
        
  • timestamp offset
    • 14 bits
    • The unsigned difference between this block's timestamp and the timestamp in the RTP header.
      • The timestamp of a redundant packet must always be older than the timestamp in the RTP header.
    • Omitted if the F bit is 0.
  • block length
    • 10 bits
    • The length in bytes of the data block corresponding to this header block, excluding the header block part.
    • Omitted if the F bit is 0.

If the F bit is 0, it indicates that the block is the last packet (the primary/latest packet) rather than a redundant one. In this case, the timestamp offset and block length are omitted, resulting in an 8-bit (1-byte) header block as shown below:

                      0 1 2 3 4 5 6 7
                     +-+-+-+-+-+-+-+-+
                     |0|   Block PT  |
                     +-+-+-+-+-+-+-+-+

Based on these specifications, a program to deserialize/serialize RED header blocks would look like this:

rtp/src/rtp/red/packet.ts
interface RedHeaderField {
  fBit: number;
  blockPT: number;
  /** 14 bits */
  timestampOffset?: number;
  /** 10 bits */
  blockLength?: number;
}

export class RedHeader {
  fields: RedHeaderField[] = [];

  static deSerialize(buf: Buffer) {
    let offset = 0;
    const header = new RedHeader();

    for (;;) {
      const field: RedHeaderField = {} as any;
      header.fields.push(field);

      const bitStream = new BitStream(buf.slice(offset));
      field.fBit = bitStream.readBits(1);
      field.blockPT = bitStream.readBits(7);

      offset++;

      // The fBit of the last header block (latest packet) is 0
      if (field.fBit === 0) {
        break;
      }

      field.timestampOffset = bitStream.readBits(14);
      field.blockLength = bitStream.readBits(10);

      offset += 3;
    }

    return [header, offset] as const;
  }

  serialize() {
    let buf = Buffer.alloc(0);
    for (const field of this.fields) {
      // Redundant packet blocks have timestampOffset and blockLength
      if (field.timestampOffset && field.blockLength) {
        const bitStream = new BitStream(Buffer.alloc(4))
          .writeBits(1, field.fBit)
          .writeBits(7, field.blockPT)
          .writeBits(14, field.timestampOffset)
          .writeBits(10, field.blockLength);
        buf = Buffer.concat([buf, bitStream.uint8Array]);
      }
      // Latest packet
      else {
        // 1-byte header block
        const bitStream = new BitStream(Buffer.alloc(1))
          .writeBits(1, 0)
          .writeBits(7, field.blockPT);
        buf = Buffer.concat([buf, bitStream.uint8Array]);
      }
    }
    return buf;
  }
}

RED Packet Data Block

Immediately following the last header block, the data blocks are stored in the same order as the headers.

   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |1| block PT=7  |  timestamp offset         |   block length    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |0| block PT=5  |                                               |
   +-+-+-+-+-+-+-+-+                                               +
   |                                                               |
   +                LPC encoded redundant data (PT=7)              +
   |                (14 bytes)                                     |
   +                                               +---------------+
   |                                               |               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               +
   |                                                               |
   +                                                               +
   |                                                               |
   +                                                               +
   |                                                               |
   +                                                               +
   |                DVI4 encoded primary data (PT=5)               |
   +                (84 bytes, not to scale)                       +
   /                                                               /
   +                                                               +
   |                                                               |
   +                                                               +
   |                                                               |
   +                                               +---------------+
   |                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The length of each redundant packet's data block matches the blockLength field in its header block.

The length of the primary packet's data block is the remainder of the RED packet length after subtracting the header blocks and the redundant packet data blocks.

Now that we have all the rules for deserializing/serializing RED packets, we can complete the program to do so.

rtp/src/rtp/red/packet.ts
export class Red {
  header: RedHeader;
  blocks: {
    block: Buffer;
    blockPT: number;
    /** 14 bits */
    timestampOffset?: number;
  }[] = [];

  static deSerialize(buf: Buffer) {
    const red = new Red();
    let offset = 0;
    [red.header, offset] = RedHeader.deSerialize(buf);

    red.header.fields.forEach(({ blockLength, timestampOffset, blockPT }) => {
      if (blockLength && timestampOffset) {
        // Redundant packet length is blockLength
        const block = buf.slice(offset, offset + blockLength);
        red.blocks.push({ block, blockPT, timestampOffset });
        offset += blockLength;
      } else {
        // Primary packet length is the entire remaining area
        const block = buf.slice(offset);
        red.blocks.push({ block, blockPT });
      }
    });

    return red;
  }

  serialize() {
    this.header = new RedHeader();

    for (const { timestampOffset, blockPT, block } of this.blocks) {
      // Redundant packet
      if (timestampOffset) {
        this.header.fields.push({
          fBit: 1,
          blockPT,
          blockLength: block.length,
          timestampOffset,
        });
      }
      // Primary packet
      else {
        this.header.fields.push({ fBit: 0, blockPT });
      }
    }

    let buf = this.header.serialize();

    // Pack data blocks
    for (const { block } of this.blocks) {
      buf = Buffer.concat([buf, block]);
    }

    return buf;
  }
}

RED Custom Encoder

Now that we can read and write RED packets, the next step is to create a RED encoder that packs a specified Distance of past packets into a RED packet for redundancy.

rtp/src/rtp/red/encoder.ts
export class RedEncoder {
  cache: { block: Buffer; timestamp: number; blockPT: number }[] = [];
  // Maximum number of packets to hold. This size will be the maximum distance
  cacheSize = 10;

  // Default distance is set to 1
  constructor(public distance = 1) {}

  // Store the latest packet in the cache
  push(payload: { block: Buffer; timestamp: number; blockPT: number }) {
    this.cache.push(payload);
    // Discard old packets
    if (this.cache.length > this.cacheSize) {
      this.cache.shift();
    }
  }

  // Create a RED packet
  build() {
    const red = new Red();

    const redundantPayloads = this.cache.slice(-(this.distance + 1));
    const presentPayload = redundantPayloads.pop();

    // Pack redundant packets
    redundantPayloads.forEach((redundant) => {
      // Perform calculation considering that the RTP Header timestamp is 32-bit
      const timestampOffset = uint32Add(
        presentPayload.timestamp,
        -redundant.timestamp
      );
      // Overflows at 14 bits or more
      // https://bugs.chromium.org/p/webrtc/issues/detail?id=13182
      if (timestampOffset >= (0x01 << 14) ) {
        return;
      }
      red.blocks.push({
        block: redundant.block,
        blockPT: redundant.blockPT,
        timestampOffset,
      });
    });
    // Pack the latest packet
    red.blocks.push({
      block: presentPayload.block,
      blockPT: presentPayload.blockPT,
    });
    return red;
  }
}

The structure is designed so that the received RTP Payload and its RTP Header timestamp are stored in the encoder's cache using the push method, and a RED packet with an arbitrary distance is generated using the build method.

Using the Custom Encoder with Encoded Insertable Streams

Finally, we get to the main topic. We will run the custom encoder on a browser by combining the encoder we just created with insertable streams.

I have prepared sample code that incorporates the custom encoder into a simple use case where a sending Peer transmits audio in one direction to a receiving Peer.

rtp/examples/browser/customEncoder/main.ts
import { buffer2ArrayBuffer, Red, RedEncoder } from "werift-rtp";

(async () => {
  // Set the custom encoder's distance to 3
  const redEncoder = new RedEncoder(3);

  // Enable encodedInsertableStreams
  const sender = new RTCPeerConnection({
    encodedInsertableStreams: true,
  } as any);
  const receiver = new RTCPeerConnection({
    encodedInsertableStreams: true,
  } as any);

  const [track] = (
    await navigator.mediaDevices.getUserMedia({ audio: true })
  ).getTracks();

  const rtpSender = sender.addTrack(track);

  // Configure the sender side for insertableStreams
  const senderTransform = (sender: RTCRtpSender) => {
    //@ts-ignore
    const senderStreams = sender.createEncodedStreams();
    const readableStream = senderStreams.readable;
    const writableStream = senderStreams.writable;
    const transformStream = new TransformStream({
      transform: (encodedFrame, controller) => {
        if (encodedFrame.data.byteLength > 0) {
          // Deserialize RTP Payload (RED packet)
          const packet = Red.deSerialize(encodedFrame.data);
          // Extract the latest packet (non-redundant packet) and pass it to the custom encoder
          const latest = packet.blocks.at(-1);
          redEncoder.push({
            block: latest.block,
            blockPT: latest.blockPT,
            timestamp: encodedFrame.timestamp,
          });
          // Have the custom encoder create a RED packet
          const red = redEncoder.build();
          // Replace the RTP Payload with the RED packet created by the custom encoder
          encodedFrame.data = buffer2ArrayBuffer(red.serialize());
        }
        controller.enqueue(encodedFrame);
      },
    });
    readableStream.pipeThrough(transformStream).pipeTo(writableStream);
  };
  senderTransform(rtpSender);

  const [transceiver] = sender.getTransceivers() as any;
  const { codecs } = RTCRtpSender.getCapabilities("audio");
  // Declare the usage of RED
  transceiver.setCodecPreferences([
    codecs.find((c) => c.mimeType.includes("red")),
    ...codecs,
  ]);

  await sender.setLocalDescription(await sender.createOffer());
  await new Promise<void>((r) => {
    sender.onicecandidate = ({ candidate }) => {
      if (!candidate) r();
    };
  });

  // Configure the receiver side for insertableStreams
  const receiverTransform = (receiver: RTCRtpReceiver) => {
    //@ts-ignore
    const receiverStreams = receiver.createEncodedStreams();
    const readableStream = receiverStreams.readable;
    const writableStream = receiverStreams.writable;
    const transformStream = new TransformStream({
      transform: (encodedFrame, controller) => {
        if (encodedFrame.data.byteLength > 0) {
          // Deserialize RTP Payload (RED packet)
          const red = Red.deSerialize(encodedFrame.data);
          // Display the distance value
          console.log("distance", red.blocks.length - 1);
        }
        controller.enqueue(encodedFrame);
      },
    });
    readableStream.pipeThrough(transformStream).pipeTo(writableStream);
  };
  receiver.ontrack = (e) => {
    receiverTransform(e.receiver);
  };

  await receiver.setRemoteDescription(sender.localDescription);
  await receiver.setLocalDescription(await receiver.createAnswer());
  await sender.setRemoteDescription(receiver.localDescription);
})();

The distance of the RED packets received by the receiver will be displayed in the browser console.
Through this, we have confirmed that we have successfully set an arbitrary RED distance on the browser.

Conclusion

Before the advent of the encoded insertable streams API, achieving the kind of functionality discussed in this article required modifying the libwebrtc source code. It was a valuable experience to see firsthand how the arrival of the encoded insertable streams API has made it possible to handle these relatively low-layer processes flexibly on the browser side.

RED is a robust solution for enhancing audio quality, and I hope it contributes to better communication experiences for users in environments where audio quality is currently degraded due to packet loss.

While this article demonstrated a browser-side custom RED encoder to modify the Distance value, similar results can be achieved on the SFU side by de-encapsulating RED packets and performing logic similar to what we did in our custom encoder. For P2P use cases that do not involve an SFU, however, the approach described in this article is currently the only viable option.

References

Discussion