Do You Want to Retry? Handling and Testing Network Failures – Anton Marchukov

[slides are online]

Anton is part of the CI team of Red Hat’s oVirt Community Infra team. The talk reports an actual story that is not finished yet. The CI team uses a lot of network services, thousands of jobs, so sometimes it fails. Eventually the job will run again so not such a big problem, but it’s annoying to have false failures. So a simple retry could be added, but does that solve the problem? The primary goal of this project is to make things reproducible.

Even if a single network failure is very unlikely, if you use a long dependency chain of network transactions, the chance becomes non-negligible. If this is repeated often then it is certain to happen.

In order to try things on a laptop, Anton set up a test environment with a container. It has an HTTP server with a json file. A network simulator (NetEm) between the container and the host simulates failures. But then it’s also possible that the network doesn’t work at all – so a UDP probe is tried before the test. To make things reproducible, the traffic is captured with dumpcap into csv files that can be analysed in Octave.

NetEm can introduce a number of impairments. Anton just uses packet loss, applied to outgoing packets – all the errors will anyway look the same to the Python scripts. The NetEm parameters are specified with tc; tc can also report statistics.

First test: it just stopped working after a bit. Turns out that python requests doesn’t have a timeout by default, so it just stops on failure. After adding a timeout, he had a working test. With the chosen parameters, 75% of the tests were successful.

To increase success probability, just retry. However, retry can only be done if it is safe, i.e. requests must be idempotent. It’s also safe if nothing happened, e.g. connection was never established. requests has builtin feature for HTTP retry – and you can give a filter for which methods can be retried. Now, still 14% of tests failed. Turns out that there are separate retries for connect, read and redirect, and only the total was changed so connect was still only attempted once.

This doesn’t solve all network errors. If the network is down for 5 minutes, then 3 retries in 10 seconds are not going to help. So to increase the time between retries, use exponential backup, which is supported by requests. But still 14% failed. Turns out it still didn’t retry after connection error, because requests was checking for a different failure than the exception raised by urllib3. So implement our own retry in addition to requests.

Still 1 failure, because the HTTP response was empty, because the connection was terminated early. Turns out that urllib3 doesn’t check the Content-Length header. Anyway, in some cases there is no Content-Length. So another way is needed to check the validity of the response. So just add an exception handler for invalid JSON and retry.

Providing retries is not easy. A library like requests can offer it, but in many cases it depends on how it’s used (cfr. idempotence, data validity). So when you provide a library, make retries flexible and try to take into account the use cases.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s