Skip to content

Entries from February 2012.

Linux: How To Locate Duplicated Files Quickly

Locate duplicates quickly: I mean only size+filename check, not expensive MD5 sum computation:

find . -printf "%f:%s:%p\n" -type f | \
    awk -F: '
        {
            key=$1 " " $2;
            occur[key]++;
            loc[key]=loc[key] $3 " "
        }
        END {
            for(key in occur) {
                if(occur[key]>1) {
                    print key ": " loc[key]
                }
            }
        }
    ' | sort

A bit of explanation of above magic:

  • printf: tells find command to output file metadata instead of only file path (the default), this metadata (size, filename) will be used later
  • -F: :We want to handle properly paths with spaces, that's why special separator is used
  • key=$1 " " $2: we use file name (without dir) and file size to create ID for this file
  • occur: table (key -> number of file occurences)
  • loc: maps file ID to list of locations found
  • occur[key]>1: we want to show only files that have duplicates
  • sort: results are sorted alphabetically for easier navigation

Web2py Lighttpd Deployment

Web2py is "full stack" Python Web Framework, Lighttpd is fast, multi-threaded HTTP server. I'll present a method to connect web2py-based application under lighttpd.

I assume the following setup is already done:

  • A domain named "myapp.com" is configured to point to your server
  • Python / lighttpd is already installed on server
  • Your web2py app is placed under /var/www/web2py
  • Your web2py app has application "myapproot" configured

First of all, you have to configure lighttpd to locate web2py application, create file /etc/lighttpd/conf-enabled/myapp.conf:

$HTTP["host"] =~ "(www\.)?myapp\.com" {
    server.indexfiles = ( "/myapproot" )
    server.document-root = "/var/www/myapp"
    server.dir-listing = "disable"
    fastcgi.server = (
        ".fcgi" => ("localhost" => (
            "check-local" => "disable",
            "min-procs" => "1",
            "max-procs" => "2",
            "socket" => "/tmp/myapp.sock")
        )
    )

    url.rewrite-once = (
        "^/$" => "/ad",
        "^(/.+?/static/.+)$" => "/applications$1",
        "(^|/.*)$" => "/fcgihandler.fcgi$1",
    )
    $HTTP["url"] !~ "^(/ad|/fcgihandler.fcgi|/applications/myapproot/static/)" {
        url.access-deny = ("")
    }
}

Explanation:

  • (www\.)?myapp\.com: regular expression to match domain with or without "www." prefix
  • server.indexfiles: specifies relative URL that should be called when only domain is given
  • server.document-root: specifies location of web2py app in filesystem
  • server.dir-listing: we do not want user to list our files using HTTP
  • fastcgi.server: specifies where socket file is located
  • url.rewrite-once: allow to use elegant (short) URLs
  • url.access-deny: files other than static directory should be forbidden (security)

Then you have to configure fcgihandler.fcgi script properly:

(...)
fcgi.WSGIServer(application, bindAddress='/tmp/myapp.sock').run()

Note that /tmp/myapp.sock must be the same as specified in lighttpd configuration.

Then you have to start the fcgihandler.fcgi proces and ensure it will start on every boot. That's all.

Large C++ Project Build Time Optimisation

When you hit some level of code size in a project you starting to observe the following sequence:

  1. Developer creates and tests a feature
  2. Before submitting commit to repository update/fetch/sync is done
  3. Developer builds project again to check if build/basic functionality is not broken
  4. Smoke tests
  5. Submit

During step 3 you hear "damn slow rebuild!". One discovers that synchronization with repository forces him to rebuild 20% of files in a project (and it takes time when project is really huge). Why?

The answer here is: header dependencies. Some header files are included (directly and indirectly) in many source code files, that's rebuild of so many files is needed. You have the following options:

  • Skip build dependencies and pray resulting build is stable / working at all
  • Reduce header dependencies

I'll explain second option.

The first thing to do is to locate problematic headers. Here's a script that will find most problematic headers:

#!/bin/sh

awk -v F=$1 '
/^# *include/ {
    a=$0; sub(/[^<"]*[<"]/, "", a); sub(/[>"].*/, "", a); uses[a]++;
    f=FILENAME; sub(/.*\//, "", f); incl[a]=incl[a] f " ";
}

function compute_includes(f, located,
arr, n, i, sum) {
    # print "compute_includes(" f ")"
    if (f ~ /\.c/) {
        if (f in located) {
            return 0
        }
        else {
            located[f] = 1
            return 1
        }
    }
    if (!(f in incl)) {
        return 0
    }
    # print f "->" incl[f]
    n = split(incl[f], arr)
    sum = 0
    for (i=1; i<=n; i++) {
        if (f != arr[i]) {
            sum += compute_includes(arr[i], located)
        }
    }
    return sum
}

END {
    for (a in incl) {
        n = compute_includes(a, located)
        if (F) {
            if (F in located && a !~ /^Q/) {
                print n, a
            }
        }
        else {
            if (n && a !~ /^Q/) {
                print n, a
            }
        }
        for (b in located) {
            delete located[b]
        }
    };
}

' `find . -name \*.cpp -o -name \*.h -o -name \*.c` \
| sort -n

Sample output:

266 HiddenChannelsDefinitions.h
266 nmc-hal/hallogger.h
268 favoriteitemdefinitions.h
270 nmc-hal/playback.h
279 pvrsettingsitemdefinitions.h
279 subscriberinfoquerier.h
280 isubscriberinfoquerier.h
286 notset.h
292 asserts.h

As you can see there are header files that require ~300 source files to be rebuilt after change. You can start optimisations with those files.

If you locate headers to start with you can use the following techniques:

  • Use forwad declaration (class XYZ;) instead of header inclusion (#include "XYZ.h") when possible
  • Split large header files into smaller ones, rearrange includes
  • Use PIMPL to split interfaces from implementations

Monitor Multicast UDP Traffic

Recently I was diagnosing problems with missing multicast data for some system. Multicast send UDP packets from single source to many registered subscribers. IPTV is using this feature to stream video for example. For unicast (1-1) transmissions you can use netcat or telnet but what tool can test multicast data?

The answer was located pretty quickly: socat tool can register to multicast sources and pipe output to many destinations (stdout for example). Let's give a sample first:

socat UDP4-RECVFROM:9875,ip-add-membership=225.2.215.254:0.0.0.0,fork -\
 | grep -a '^s='

Values to change:

  • 225.2.215.254: sample multicast IP address
  • 9875: sample multicast port
  • ^s=: string to be located in multicast data (in my case data protocol is textual)

Explanation:

  • UDP4-RECVFROM: receive IPv4 UDP packets
  • ip-add-membership: multicast requires registering of packets receiver
  • 0.0.0.0: accept multicast on any client interface (you can specify here your client network address here as well)
  • grep -a: sometimes you need just subset of incoming data, in my case textual data was mixed with binary (-a parameter for grep)

8 types of waste in Lean Software Development

Lean methodology is a "production practice that considers the expenditure of resources for any goal other than the creation of value for the end customer to be wasteful, and thus a target for elimination". Lean ideas were moved into Software Development Fields by Marry and Tom Poppendieck. It's interesting how general idea behind "Lean" (originally applied to manufacturing) can be applied for other fields (software development in this case).

Waste of Over-production

Have you ever developed a module that was never used in final product? Or end users weren't using some functionality because it was too complicated? You will probably say "yes" here (like me). How can we eliminate unneeded functionality from specs? How can we validate user interface to ensure a part of a system will be useful before actual build?

We have two opposite options here: prototyping with detailed design (yeah, Waterfall!), and iterative functionality delivery (bye, bye fixed-price contracts, welcome estimated iteration deliverables). The former is better for customer ("my budget is fixed"), the latter for development shop (lower risk of invalid estimates).

Waste of Defects

The Holly Grall of software development methodologies: to limit number of defects. There are many approaches here:

  • Test-in quality at the end of a whole process (worst case IMHO)
  • Parallel tests with development (waste of repeated tests)
  • Automated system-level tests (may be hard to maintain!)
  • Automated unit-level tests (my choose)
  • Random DBC-based system level tests (promising ...)

Defect means: slower delivery and it's waste of resources (time, QA time, ...).

Waste of Inventory

Means "Holding inventory (material and information) more than required". Looks like it's missing continuous integration practices. If you working on a topic too long time integration effort might me much bigger than syncing in smaller steps (minimum once a day?). The only problem with frequent syncing might me hitting regression (thus slowing work). But we can handle that by automated testing (see "Defect" above).

Waste of Over-Processing

"Processing more than required wherein a simple approach would have done" tells us we should prefer KISS (Keep It Simple Stupid) over heavy frameworks and complicated (fat) interfaces. In my opinion not the completeness of an interface designates it's quality. Better interfaces are ones that are harder to be used in incorrect way (for example: eliminate default constructor if it does not have sense).

Waste of Transportation

"Movement of items more than required resulting in wasted efforts and energy and adding to cost", in my field possible wastes are:

  • duplicating bugtrackers within one project - needs syncing in both directions
  • duplicating version control storage (local SVNs + central Perforce as a real example) - requires syncing methods
  • submitting code always to more than one branch - is it better to cherry-pick set of commits from branch A to B then retest before submit?

Waste of Waiting

If team A is waiting for team B to produce module that will be used by team A it's a waste of waiting. The solution:

  • Create draft interfaces
  • Stub implementations from lower layer

Then upper level modules can be started in parallel with lover level and stubs can be eliminated during integration. Interface will emerge as a result of two processes (A requirements + B possibilities).

Waste of Motion

How many "proxies" do we have for any information flow? Is there a person that just pushes some messages between customer and programmers without added value? Maybe we can make flows shorter (they will be faster and more effective).

Waste of Un-utilized People

Looks very similar to "Waste of waiting". Our work strictly depends on people. We can improve our tools but people will be always be most important in software development activities. Avoid "push"-style, prefer "pull"-style work item assignment.