One of my responsibilities in current project is analysis of "hard" bugs – problems that are not easily reproducible or the cause is so mysterious that remains unresolved (or worse: resolved – not reproducible) even after few iterations developer <-> integration team. The problem visible in production environment but not reproducible by developer is pretty simple: you have to check logs carefully and match source code to derive results. Random problem with unknown exact reproduction steps but visible sometimes in PROD environment is harder – there's even no easy validation procedure for our fixes.
Recently I've hit one of most mysterious problems ever seen. After some time application logs are filled by "too many jobs" error message. The source of this message was tracked to libupnp library by google lookup pretty easily.
Libupnp handles UPNP (Universal Plug And Play) protocol that is used for auto-discovery of network services (printers, streaming applications, …). When new multicast announcement is received uPNP client downloads XML description of that service using HTTP protocol. This activity is done in newly created separate thread. Libupnp limits number of threads to 100 in order not to abuse system resources.
And here comes the problem localisation: with frequent multicast announcements and very slow (or unavailable) HTTP service it possible for libupnp to create 100 threads waiting for HTTP responses and no new threads can be created. I prooved that by:
- starting local uPNP service (mediatomb) on my laptop: service mediatomb start
- block HTTP port that serves description: iptables -A INPUT -p tcp –dport 49152 -j DROP ("DROP" prevents from sending any response, so a peer don't know what's going on)
- Generate some UPNP_* events (hello/bye): while true; do service mediatomb restart; sleep 1; done
- Watch logs
The solution seems to be very easy: lower HTTP timeout so "unavailable" uPNP devices (note the quotes, in my opinion it's a network configuration error) will not block our device. After searching in source code I located the line:
#define HTTP_DEFAULT_TIMEOUT 30
I lowered it to 2 seconds, rebuilt libupnp and did my test again. And it didn't work.
I've analysed where this timeout is applied, checked every socket API call and found the following line:
ret_code = connect( handle->sock_info.socket, ( struct sockaddr * )&peer->hostport.IPv4address, sizeof( struct sockaddr_in ) );
note that it's a blocking call with timeout set by operating system (on Linux it depends on /proc/sys/net/ipv4/tcp_syn_retries).
So even if we apply timeouts properly when HTTP (TCP) port is open it's the port opening time that gives so big delay (>20 seconds in our case). Solution was to use non-blocking connections and select() call properly (pseudo code below):
// set non-blocking mode fcntl(sock, F_SETFL, flags | O_NONBLOCK); // start connecting in background connect(sock, (struct sockaddr *)&sa, size)); // wait until connection is made or timeout is reached ret = select(sock + 1, &rset, &wset, NULL, (timeout) ? &ts : NULL)); // put socket back in blocking mode fcntl(sock, F_SETFL, flags);
After that change unavailable (due to network error or local firewall) devices were quickly discovered (in 2 seconds).