In the previous blog we introduced the RTP format. In this blog, we will cover how RTP packets are sent across networks in the real world, and some of the challenges that can follow.
RTP Transmission
While most internet traffic is TCP-based, real-time media is generally sent over UDP instead. There are some key differences between TCP and UDP:
Property | TCP | UDP |
---|---|---|
Connection Status | Requires a connection to be established before any data can be sent. | Connectionless; packets are transmitted by the sender as it chooses. |
Method | Data is sent as a continuous stream, with no built-in segmentation or boundaries. | Data is sent as discrete packets with defined boundaries. |
Reliability | Delivery of data is guaranteed, with built-in acknowledgement and retransmission. | Delivery of packets is not guaranteed; acknowledgement and retransmission is not built in. |
Ordering | Data is guaranteed to be received in the order in which it was sent. | Packets can be received in a different order to which they were sent. |
UDP over TCP
When comparing TCP to UDP in the chart above, we see that UDP is a much simpler protocol, with none of the feedback and acknowledgement that makes TCP reliable and ordered. This means that media sent over UDP will generally have a significantly lower latency than TCP, so in cases where latency in important UDP is used in preference to TCP, even though doing so introduces significant complexity, coping with the lossy, out-of-order nature of UDP (see the Resilience section when it becomes available). TCP may still be used as a fallback if UDP is unavailable and may be used preferentially in cases where latency requirements are less stringent, such as unidirectional streaming over platforms such as Facebook Live.
Due to the connectionless nature of UDP there is a risk that firewalls or NATs will block the media. The most robust solution to this problem is a set of techniques called Interactive Connectivity Establishment (ICE), which may be covered in a later blog series.
However, one simple way to mitigate this issue is through symmetric ports –implementations should send RTP for a media stream on the same port as it uses to receive the media. The reason for this is that most NATs and firewalls, when sending outgoing traffic, will also temporarily allow incoming traffic on the same port, and route it back to the IP address of the sender within the firewall. These temporarily open ports are known as ‘pinholes’. When debugging, if bidirectional streams such as main audio and video work, but unidirectional streams such as content video fail, the culprit is often a NAT or firewall allowing the bidirectional streams through due to pinholing while blocking the unidirectional stream which does not have an open pinhole.
Implementations normally assign ports dynamically at the point of use from within the ephemeral port range. RFC6335 recommends the use of 49152-65535 for dynamic ephemeral ports. Though use of this port range is quite common, others may use the Linux port range of 32768-60999 or some other range. The port range should generally be configurable by administrators to allow them to comply with enterprise policy. However, for cloud-based services that are accessible to many customers, it is recommended to use a considerably narrower range as many enterprises dislike opening access to a large number of external ports for security or policy reasons. If possible, in such cases, it is best to offer support for using only a single, consistent port for all traffic via multiplexing to minimise the ports an enterprise must open.
When multiplexing is not in use, each media stream in each concurrent call requires its own port, along with a port for RTCP. By convention these are usually a pair, with RTP on an even-numbered port and RTCP on the port numbered one higher. When binding these pairs, one simple technique is to bind a UDP socket within the dynamic range without specifying a port, and once that succeeds, attempt to bind the port one higher (if the port bound initially is even) or one lower (if the port bound is odd). If the second bind fails, which should only happen if another application is contending with the port range, the initial port should be released and a new one tried until a pair is successfully bound, or some upper limit of attempts is reached.
Note that, according to the rules of Offer/Answer SDP (RFC3264), as soon as an implementation has sent an SDP they should be ready to receive any of the media streams that SDP advertises support for. Indeed, because the media path is normally simpler and faster than the signalling path, it is more common than not for media packets to begin arriving prior to the SDP Answer or ACK. For unencrypted media, the implementation should ideally begin rendering the media as soon as it is received. This is often not possible with encrypted media because the encryption keys will not have been received yet, but at the very least the system should receive the packets and discard them gracefully until they can be handled, while not generating indications such as ICMP unreachable responses.
Sending RTP Packets
Packet sizes and MTU
UDP packets have a maximum size, known as the Maximum Transmission Unit, or MTU. However, rather than be a standard value, different network links will have different maximums. In practical terms, the smallest MTU for any hop along a given path determines the path MTU.
It is particularly important to ensure that RTP packets do not exceed the path MTU. If IPv4 is in use then packets that exceed the path MTU may be fragmented, and may be re-assembled automatically before being delivered to the far end, but IP fragmentation is notoriously unreliable for a range of implementation and security reasons (see RFC8900: “IP Fragmentation Considered Fragile” for more details) and oversized packets are quite likely to never be received at their intended destination. However, the lower the MTU the greater the overheads imposed by packet headers relative to the RTP payload data, and the less efficient transmission will become, so the closer to the maximum packets can be the better.
Unfortunately, while there are techniques for automatic discovery of the Path MTU (Path MTU Discovery, or PMTUD) they are not always effective, and will delay the start of media if run first. As such, most implementations rely on a configurable MTU with a relatively conservative default that can be overridden by the user or admin. The widely-supported upper limit for ethernet packets is 1500 bytes (meaning a maximum RTP packet size of 1472 bytes given a minimum 20 byte IP header size and the 8 byte UDP header), but as some network hops may introduce additional overheads or have somewhat lower limits; 1400 is generally the default for devices designed for enterprise networks, while implementations looking for maximum interoperability go even lower: Chrome currently uses a max RTP packet size of 1200 for WebRTC, for instance.
Packet Pacing
Note that, while the maximum size of an RTP packet is less than 1.5kB, media frames can, and will, be split across RTP packets; it is not uncommon for video keyframes to be split across dozens or even hundreds of packets. Exactly how this is done is dependent on the codec and factors such as, in H.264, the packetization mode. In contrast, audio frames are much smaller and will generally fit within a single packet. Indeed, some implementations include multiple audio frames in a single packet to save bandwidth by reducing the overhead per frame, though this will increase latency since the first frame cannot be sent until the subsequent frames in the packet are sampled and encoded, and hence this technique is usually not worth the bandwidth savings.
However, the consequence of having frames split across a large number of packets leads to a need for packet pacing. The simplest implementation of real-time media just puts packets onto the wire as soon as the encoder produces them, but this can be problematic for video, particularly at higher resolutions, where keyframes or periods of sudden change can lead to frames that involve hundreds of RTP packets. Sending all the packets in a sudden burst can temporarily and greatly exceed the negotiated bandwidth limits, which can congest the network and lead to buffer bloat or packet loss.
Packet pacing often involves adding a small FIFO buffer after the encoder to pace out the packets onto the wire in a more uniform fashion to mitigate this burstiness. There are a number of ways this can be implemented; one common pattern is a leaky bucket algorithm which allows control over bandwidth and burstiness. Each media stream can be paced independently, or there can be a single buffer combining multiple egress streams; this will be more complex to implement but can be designed to ensure that smaller streams such as audio are given priority when there is high contention. Note that, rather than a native implementation, some services use the Linux tc (traffic control) module to enable packet pacing.
Quality of Service (QoS)
One further step than can be taken to improve reliability and latency is setting appropriate Quality of Service (QoS) markings on packets. QoS is the practise of allowing routers and switches to classify traffic so as to appropriately prioritise it. In real-time media, this is generally done using Differentiated Service CodePoint (DSCP) markings as defined in RFC2474, which is a field of 6 bits in the IP header that help routers and switches classify the type of traffic within the packet.
Implementations will not generally need to worry about the mechanisms of DiffServ or writing the IP packet header, which will normally be handled at a lower level; the important part for real-time media implementation is choosing appropriate values for streams and ensuring these are set on the sockets on which the packets are being sent.
There are several RFCs that discuss how to best mark various packet flows (RFC4594 generally, RFC8837 in the context of WebRTC specifically). They generally recommend that the highest level of priority and latency-intolerance is assigned to the audio stream, which they recommend marking for Expedited Forwarding (EF) with a value of 46, while video should be marked for Assured Forwarding (AF) with a value of 34 (for AF41), with lower values for other packet flows. Many implementations follow these recommendations, though it is also recommended that any implementation targeted for use by larger enterprises allows admins to configure these values, as they may have their own specific QoS policies they want all their applications to follow.
It is possible to set DSCP at a more fine-grained level, applying different markings to different packets within a given media stream depending on their specific payload, such as applying lower DSCP values for packets encoding discardable layers in H.264 streams encoded with temporal or spatial scalability; however, few implementations go this far.
It is worth noting that the common practise of setting different levels of QoS for audio and video can lead them to take different paths through the network, causing them to arrive out of sync, and requiring delay of one to achieve lipsync with the other (see the next blog on lipsync for more details). As such, there are those who argue that if an audio and video stream are meant to be synchronised, they should have the same DSCP markings applied, to try and minimise differential delay between the streams, rather than marking audio as more important/delay-intolerant than its matching video.
No Comment! Be the first one.