/ Ollivander

Command line data crunching with Python

Every time I'm doing some data crunching on the command line, I find myself juggling between sed, awk, sort, uniq, etc. While I like the UNIX way of having one tool doing one thing well, I sometimes find it slightly boring to put all the tools together, sometimes stretching their features a bit too much.

I know that Perl and Ruby support implicit loops / prints - see this and that. Those switches makes it easy to work with data on the command line, but I don't use those languages a lot anymore, so I always need to lookup something in online manuals before performing something useful. And I never took my time to learn awk properly, so maybe I wouldn't need al that.

On the contrary, I still use Python quite a lot, and it's becoming the de-facto standard for data science purposes. Using it on the command line by piping something in & out of it, by the way, isn't always so easy - the -c switch allows passing a command in, but it's not always easy to understand whether a char is being interpreted by bash or by the python interpreter, and Python is whitespace-sensitive, too. So a command line like:

$ python -c 'import sys;for x in sys.stdin:print x'

won't "just work":

  File "<string>", line 1
    import sys;for x in [1,5]:    print x;print x
SyntaxError: invalid syntax

But: there's a bash feature to interpret escape sequences in single-quoted strings, so this will work fine:

$ echo -e "hello\nworld\nthis\nis\nme" | python -c $'import sys\nfor x in sys.stdin:\n    print(x.strip())'

I find Python string manipulation to be great and usually fast-enough for not-so-large datasets, so you can do very interesting things and shell out to standard unix commands only if and when you actually need to. As long as you rely on the standard lib only, you're quite safe about portability, too.

AN IMPORTANT NOTE: if you're treating non-ascii data, I suggest you set the PYTHONIOENCODING variable, especially if you're using Python3, since that interpreter version converts to unicode objects wherever it is possible:

echo -e "ààà\nworld\nthis\nis\nme" | PYTHONIOENCODING='utf-8' python3 -c $'import sys\nfor x in sys.stdin:\n    print(x.strip())'

Enjoy your command line! And if you want to become a command line data processing guru, I cannot recommend this book enough.

Photo by Daniel Cheung on Unsplash