Friday, May 18, 2012

Bufferbloat

Ever get the feeling that this Internet thing isn't performing as well as it should, especially since you have faster computers and more bandwidth than you did just 5 years ago but things seem to be just as slow or even slower. If you have, you're not alone. In fact, some Internet gurus have noticed the problem too and they think they may have found the source: too many good intentions and the human ability to sometimes be too clever for our own good, resulting in this thing called "buffer bloat".

Over the years your PCs and laptops and smart phones have gotten cheaper while simultaneously getting more memory (RAM, FLASH, etc...). It's been a great thing. But the same has been happening to the network infrastructure; everything from your home WiFi router to big iron routers interconnecting huge networks have also benefitted from reduced memory prices resulting in more and more memory on-board.

As it turns out, this memory increase in network routing equipment might not be a good thing. Network routers have always had some sort of memory for buffering network traffic, but the answer for smoothing out network traffic flows and congestion has always been in the computers on the sending and receiving ends. The protocols have built in mechanisms for closing the valves when the pipes are overflowing, so to speak. When a computer sends out data, at some point it knows to shut up and send no more until it has heard from the other side. But with the network routers buffering more and more of that data, the computer and the router get into this game of waiting on each other to act.

Jim Gettys, the guy credited for finally articulating the problem, has a video demonstrating the problem. He can actually get better network performance by tuning down buffer sizes.


As he states in a blog post on the topic:
The buffers are confusing TCP’s RTT estimator; the delay caused by the buffers is many times the actual RTT on the path.  Remember, TCP is a servo system, which is constantly trying to “fill” the pipe. So by not signalling congestion in a timely fashion, there is *no possible way* that TCP’s algorithms can possibly determine the correct bandwidth it can send data at (it needs to compute the delay/bandwidth product, and the delay becomes hideously large). TCP increasingly sends data a bit faster (the usual slow start rules apply), reestimates the RTT from that, and sends data faster. Of course, this means that even in slow start, TCP ends up trying to run too fast. Therefore the buffers fill (and the latency rises).
It has been a particularly devilish problem to diagnose because isolating the variables, something any good scientists would do, actually exasperates the problem. The more you try to take out interference and noise and other things that are hard to account for, the worse the problem gets. Again, Jim Gettys:
Ironically, I have realized that you don’t see the full glory of TCP RTT confusion caused by buffering if you have a bad connection as it reset TCP’s timers and RTT estimation; packet loss is always considered possible congestion. This is a situation where the “cleaner” the network is, the more trouble you’ll get from bufferbloat. The cleaner the network, the worse it will behave. And I’d done so much work to make my cable as clean as possible…
And its not just your home route that has the problem. The problem is everywhere, even in the big iron in your ISP's data center and the even bigger iron used to connect your ISP to other ISPs. Here's a video where researchers isolate the problem and show that backing off the buffer size actually makes things better.


So there. It's not you. You are not crazy. Things are not as they should be. But don't worry, your friendly, neighborhood Internet gurus are working on the problem.

5 comments:

  1. I could actually follow this post for the most part -- layman's terms are helpful for people like me who are as dumb as a rock when it comes to computers and such. I have no idea how it all works, I just know that it does; and I have been one of those who has been questioning why things have been getting slower instead of faster with all the technology. Thanks much!

    ReplyDelete
    Replies
    1. Layman's terms is one of the issues this industry has trouble with -- a lot of smart unable to communicate normally. I'm glad you found this helpful.

      Delete
  2. Yep, and another thing it can cause with buffer overflow is loss of connectivity, especially over T-3/OC3 links, because the response times exceed latency parameters and the router drops the connection to the server (just had that happen at work yesterday). Router was buffering legacy email server to the point that pings never made it to the server, and server was not sending traffic due to "errors" on the transport level. I'm going to shoot this to our techs for their SA.
    Thanks!

    ReplyDelete
    Replies
    1. Huh! That's a good point. If the route coordination protocols can't communicate, then this will cause route instability.

      Delete
    2. Yep, and I'm betting it's happening MUCH more frequently than we realize!

      Delete