I recently realised that one of my most common tasks is to deal with streams of structured data, perform some task on or with the data, and spit out the results. The traditional UNIX way is to do line-based processing, piping through awk, sed, grep, perl one-liners, and storing the results in temporary files which are then used as input to other commands. However, having pre-structured data makes more sense for many scenarios.

So I’m currently writing a set of python libraries for working with streams of structured data. It uses a push-based pipeline, where a stream processor can either modify, filter, or add new items to the stream before passing it onto the next processor in the pipeline.

Perhaps some command line examples would show the idea best:

$ pipeline in:hosts filter:location=~ec2 parallel:16 task:ping filter:+success ssh:“uptime -a” out:host_uptimes
$ pipeline blogposts:all filter:tags=~tech task:blogpost2html concat:-
$ pipeline in:shares task:get_latest_share_price task:analyse_share_value csv:share,value,profit

All of this can also be done programmatically with relative ease.

With the command line tool, if no output processor is provided, the default will be used which saves the structured data to “last”, using a default “database”.

— by Robert Thomson, created 3rd Apr, 2013, last modified 3rd Apr, 2013 | Tags: Tech