Dariusz on Software: entries from May 2013

About This Site

Software development stuff

Entries from May 2013.

Sat, 25 May 2013 10:55:19 +0000

After recent OS upgrade one of my unit tests started to fail (to be precise it started to hang). Quickly check showed me that CGI process started by os.popen() hanged. The old source code:



f = os.popen("./cgi_script.cgi > /dev/null", "w")
f.write(postBody)
f.flush()
f.close()

As os.popen() is deprecated now (I know, it's a very old codebase that started with Python 1.5) I've moved to new subprocess module:



fNull = file("/dev/null", "w")
p = subprocess.Popen("./cgi_script.cgi", shell=False, bufsize=1024, stdin = subprocess.PIPE, stdout = fNull)
fw = p.stdin
fw.write(postBody)
fw.flush()
fw.close()
del p

As you can see it's more verbose now but I've eliminated shell (slightly faster operation).

Some notes found during migration:

without "del p" process may be not terminated causing problems with DB state (CGI proces updates database and test checks this state later)
I/O configuration is more flexible than os.popen() - you can make pipes more easily

Tags: python.

Server flood automatic detection

Wed, 22 May 2013 23:04:14 +0000

My current customer develops embedded devices used by many end users. In order to save server load devices use multicasts for downloading data: every device registers itself on multicast channel using IGMP and listens to UDP packets. No connections to be managed results in lower overhead.

However, some data (or some requests) cannot be downloaded from multicasts channels and HTTP/HTTPS must be used to interact with server. As the number of devices is very high special methods have been used in order not to overload backend servers (randomised delays, client software optimization).

Consequently, small bug in client software that will trigger more events than usual can be very dangerous to whole system stability (because the effect of thousands of devices - perfect DDOS may kill back-end). In order to catch such errant behaviour as soon as possible I've developed daily report that controls server usage in my customer office.

First of all, we need to locate the most "interesting" device by IP address from logs (we list top 20 IPs based on server usage):

    ssh $server "cat /path-to-logs/localhost_access_log.$date.log" | awk '
        {
            t[$1 " " $7]++
            ip[$1]++
        }
        END {
            for(a in t) { print t[a], a }
            max = 0
            max_ip = ""
            for(a in ip) { if(ip[a] > max) { max=ip[a]; max_ip=a; } }
            print max_ip > "/tmp/max_ip"
        }
    ' | sort -n | tail -20
    IP="`cat /tmp/max_ip`"

Then selected IP will be examined hour-by-hour to locate patterns in behavior:



ssh $server "cat /path-to-logs/localhost_access_log.$date.log" | awk -v "SEARCH=$IP" '
{ sub(/:[0-9][0-9]:[0-9][0-9]$/, "", $4); TIME=$4; s[TIME]; }
$0 ~ SEARCH { s[TIME]++;}
END { for(a in s) { print a, s[a] } }
' $* | sort

Sample results with included requested URLs:



Number of calls on 2013-05-22, server: tm2

3 192.168.4.101 /path/nPVR getseries
3 192.168.4.101 /path/reminder get
3 192.168.4.101 /path/rentedItems
3 192.168.4.101 /path/stbProperties get
3 192.168.4.101 /path/subscriberInfo
3 192.168.4.140 /path/autoConfig
6 192.168.4.249 /path/authenticate
6 192.168.4.249 /path/favorite get
6 192.168.4.249 /path/rentedItems
6 192.168.4.249 /path/stbProperties get
7 192.168.4.249 /path/reminder get
7 192.168.4.249 /path/subscriberInfo
8 192.168.4.140 /path/authenticate
8 192.168.4.249 /path/nPVR get
8 192.168.4.249 /path/nPVR getseries
16 192.168.4.254 /path/subscriberInfo
25 192.168.4.254 /path/nPVR get
25 192.168.4.254 /path/nPVR getseries
83 192.168.4.254 /path/favorite get
98 192.168.4.254 /path/reminder get

192.168.4.254 activity per hour:

[22/May/2013:00
[22/May/2013:01
[22/May/2013:02
[22/May/2013:03
[22/May/2013:04
[22/May/2013:05
[22/May/2013:06
[22/May/2013:07
[22/May/2013:08
[22/May/2013:09
[22/May/2013:10
[22/May/2013:11
[22/May/2013:12 8
[22/May/2013:13 4
[22/May/2013:14 16
[22/May/2013:15 18
[22/May/2013:16 12
[22/May/2013:17 50
[22/May/2013:18 24
[22/May/2013:19 24
[22/May/2013:20 24
[22/May/2013:21 24
[22/May/2013:22 24
[22/May/2013:23 22

We can see that 192.168.4.254 device started to spam server 24 times per hour from 18:00 until midnight (imagine thousands of devices that synchronously send requests to back-end the same way). This device logs will be thoroughly examined for possible reason for that behaviour.

At Aplikacja.info, we believe that software problems must be located as early as possible (even during development process). In above example we re-use existing development environment to catch problems that might be discovered in production with huge user base.

The general idea is to move detection and neutralisation of a problem into this direction: 1. end user -> 2. customer's QA -> 3. development company's QA -> 4. developer (runtime) -> 5. developer (build time) The more distance from end user a bug is detected and fixed the better for us. I hope you would love to squash problems the same way.

Happy hunting!

Tags: linux, monitoring.

Google.com performance analysis over few years

Wed, 22 May 2013 22:19:02 +0000

Time is money. No doubt about that. The more time your customer waits for your service response the less coins may hit your pocket. You may ask: why? If your page loading is too slow your "almost" customer hits back in his browser and selects another link from sponsored links Google provides him. You loose this customer in favour of another supplier.

What about Google? Was their service (no kidding, it is a sevrice that searches information, paid by ads) always so blazingly fast? Let's check this site performance monitored from London server over few years:

As we can see there were some small problems in 2009 (~5% of requests required more than 1 second to be processed). Not so bad.

In order to double check our measurement results there are monitor result from Newark, USA:

As you can see there were also (small) performance problems in 2009, in some months ~2.5% of requests were processed in more than 10 seconds. Still pretty damn good.

You can see current stats for google here, thanks to site-uptime.net - site uptime monitor.

Tags: google.

HTTP(S) exchange analysis using Wireshark

Wed, 01 May 2013 11:33:09 +0000

Wireshark is a tool that allows to scan network packets and make analysis of network connection without direct access to server or client. Today we will show simple method to analyse TCP connections using this tool.

TCP connection is composed of many IP packets, connected by common strem index number. You can select particular TCP stream using Analyze / Follow TCP stream option or directly select given stream by it's index: tcp.stream eq 9 If you want track every opened connection you can check 1st packet of every TCP stream opened to particular server IP (213.75.34.114 in our example): tcp.flags.syn==1 and tcp.flags.ack==0 and ip.dst == 213.75.34.114 Note that with HTTP/1.1 things may be more complicated as this protocol supports "Persistent/Keep Alive" mode that allows multiple requests over one connection, so you may see only one packet with "tcp.flags.syn==1 and tcp.flags.ack==0". In order to scan full exchange you have to analyse protocol contents for request / response pairs.

Another complication is HTTPS (HTTP over SSL layer) - you won't be able even to count requests (if using "Keep Alive" mode). In this scenario you have to check traffic after HTTPS node or just inspect server logs.

Tags: networking, ssl.

Dariusz on Software

Methods and Tools

About This Site

Archive

Entries from May 2013.

Tags