My current customer develops embedded devices used by many end users. In order to save server load devices use multicasts for downloading data: every device registers itself on multicast channel using IGMP and listens to UDP packets. No connections to be managed results in lower overhead.
However, some data (or some requests) cannot be downloaded from multicasts channels and HTTP/HTTPS must be used to interact with server. As the number of devices is very high special methods have been used in order not to overload backend servers (randomised delays, client software optimization).
Consequently, small bug in client software that will trigger more events than usual can be very dangerous to whole system stability (because the effect of thousands of devices - perfect DDOS may kill back-end). In order to catch such errant behaviour as soon as possible I've developed daily report that controls server usage in my customer office.
First of all, we need to locate the most "interesting" device by IP address from logs (we list top 20 IPs based on server usage):
ssh $server "cat /path-to-logs/localhost_access_log.$date.log" | awk '
{
t[$1 " " $7]++
ip[$1]++
}
END {
for(a in t) { print t[a], a }
max = 0
max_ip = ""
for(a in ip) { if(ip[a] > max) { max=ip[a]; max_ip=a; } }
print max_ip > "/tmp/max_ip"
}
' | sort -n | tail -20
IP="`cat /tmp/max_ip`"
Then selected IP will be examined hour-by-hour to locate patterns in behavior:
ssh $server "cat /path-to-logs/localhost_access_log.$date.log" | awk -v "SEARCH=$IP" '
{ sub(/:[0-9][0-9]:[0-9][0-9]$/, "", $4); TIME=$4; s[TIME]; }
$0 ~ SEARCH { s[TIME]++;}
END { for(a in s) { print a, s[a] } }
' $* | sort
Sample results with included requested URLs:
Number of calls on 2013-05-22, server: tm2
3 192.168.4.101 /path/nPVR getseries
3 192.168.4.101 /path/reminder get
3 192.168.4.101 /path/rentedItems
3 192.168.4.101 /path/stbProperties get
3 192.168.4.101 /path/subscriberInfo
3 192.168.4.140 /path/autoConfig
6 192.168.4.249 /path/authenticate
6 192.168.4.249 /path/favorite get
6 192.168.4.249 /path/rentedItems
6 192.168.4.249 /path/stbProperties get
7 192.168.4.249 /path/reminder get
7 192.168.4.249 /path/subscriberInfo
8 192.168.4.140 /path/authenticate
8 192.168.4.249 /path/nPVR get
8 192.168.4.249 /path/nPVR getseries
16 192.168.4.254 /path/subscriberInfo
25 192.168.4.254 /path/nPVR get
25 192.168.4.254 /path/nPVR getseries
83 192.168.4.254 /path/favorite get
98 192.168.4.254 /path/reminder get
192.168.4.254 activity per hour:
[22/May/2013:00
[22/May/2013:01
[22/May/2013:02
[22/May/2013:03
[22/May/2013:04
[22/May/2013:05
[22/May/2013:06
[22/May/2013:07
[22/May/2013:08
[22/May/2013:09
[22/May/2013:10
[22/May/2013:11
[22/May/2013:12 8
[22/May/2013:13 4
[22/May/2013:14 16
[22/May/2013:15 18
[22/May/2013:16 12
[22/May/2013:17 50
[22/May/2013:18 24
[22/May/2013:19 24
[22/May/2013:20 24
[22/May/2013:21 24
[22/May/2013:22 24
[22/May/2013:23 22
We can see that 192.168.4.254 device started to spam server 24 times per hour from 18:00 until midnight (imagine thousands of devices that synchronously send requests to back-end the same way). This device logs will be thoroughly examined for possible reason for that behaviour.
At Aplikacja.info, we believe that software problems must be located as early as possible (even during development process). In above example we re-use existing development environment to catch problems that might be discovered in production with huge user base.
The general idea is to move detection and neutralisation of a problem into this direction:
1. end user -> 2. customer's QA -> 3. development company's QA -> 4. developer (runtime) -> 5. developer (build time)
The more distance from end user a bug is detected and fixed the better for us. I hope you would love to squash problems the same way.
Happy hunting!