-
Notifications
You must be signed in to change notification settings - Fork 122
[recipe] simple pipeline example
This short article is based on the author's response to a request I submitted.
json is a great tool for filtering and tweaking JSON but sometimes you want to use some of the other unix tools for your input. The most obvious one is sort. sort is powerful and efficient but, like many unix tools, only works with single line records. To use it with JSON records we can use the decorate-sort-undecorate pattern.
So the first task is to convert your JSON into decorated single line records. The decorations are the keys to be sorted on and the payload is the original JSON squeezed onto a single line. We need the output to be lines like:
k1 k2 ... { .. JSON .. }
where the keys k1 etc are extracted or derived from the individual records. We run this through sort and then use something like cut to remove the keys and we have a new stream of sorted JSON records.
Extracting the keys is simple; we just use the Lookups feature - http://trentm.com/json/#FEATURE-Lookups. To get the payload we JSONify the record, store it in the record (in an unused member!) and then use Lookup to pull it out again.
I have a daily task that needs to process RADIUS records in chronological order but they are split across several files so I do this:
gen-days-records |
json -ga -d " " -e 'this.self = JSON.stringify(this);this.ord = this["Acct-Status-Type"][2];' Timestamp ord self |
sort -k1n -k2r |
cut -f 1,2 --complement
I've added 2 fields to the original record. self is the JSON of the original input record and ord is derived from one of the fields. Timestamp and ord are the keys I want to sort on. The cut removes the sort keys leaving a sorted JSON stream.
My application likes to get JSON one line at a time but if you just want to view the output you could pipe the final stage into json again to pretty print it.
Incidentally, that's a a TAB character in the -d argument because that works better for cut and some other programs.
Sorting can lead to other things like finding unique entries. Here's me checking to see if a session-id is ever re-used:
cat-all-records |
json -ga0 -e 'delete this.Timestamp;' -c 'this["Acct-Status-Type"] == "Start"' |
sort -u |
json -ga Acct-Session-Id |
uniq -d
The first call to json prints a single line for every Start record (minus the Timestamp). Assuming the fields in the record always occur in the same order the rest of the pipeline will print any session-ids that occur in records with different parameters (eg username or host).
Another example:
cat-all-records |
json -ga -e "{this['Framed-IP-Address'] = this['Framed-IP-Address'] ? 'IP' : 'none';}" Acct-Session-Id Timestamp Framed-IP-Address |
sort -s -k1,1 -k2,2n |
awk '{ if ($1 == prev && $3 == "none" $$ prevk == "IP") {print} prevk = $1; prevv = $3;}'
No need for JSON in the output this time. This example modifies the fields before extracting them. The task here is to group the records by session-id and time showing whether there was a value for the IP field or not so I can see if that value ever gets "lost". The awk script reduces to a one liner.
- When sorting the decorated output it's probably better to be explicit about which keys to sort on so the JSON itself is not considered. Putting the json field last seems the easiest to me.
- All the examples assume streaming input, I imagine it would work for a single nested JSON object (eg an array) but you'd probably have to reconstruct the delimiters(?)
- The JSON output can have spaces so be careful about programs that simply split fields on whitespace. However it can't have TABs (or newlines).
- If you are using the record's JSON as a payload, make sure you store it after you've removed or added fields.