Fun with csplit

When you need to split a text file by lines or columns there are plenty of ways to accomplish that. But what if you need to split a file by lines, where each record consists of multiple lines, and the number of lines in each record is not fixed?
If there is any deterministic string in each record you could use as record separator, you should consider csplit.

I wanted to monitor system load with thread-level detail using top in batch-mode. So basically I ran the command: top -H -b -n2 -d10 where

1
2
3
4
-H ... display threads
-b ... batch mode (write data sequentially instead of updating screen)
-n ... number of iterations
-d ... delay between iterations

This command produces 2 records of information like this, and the line-count of each record depends on the number of processes/threads currently running.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
top - 17:51:35 up 9 days, 22:33,  1 user,  load average: 0.00, 0.03, 0.05
Tasks:  71 total,   1 running,  70 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  6.2 sy,  0.0 ni, 93.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  1015012 total,   558032 free,    81248 used,   375732 buff/cache
KiB Swap:  2097148 total,  2097148 free,        0 used.   722184 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
    1 root      20   0  125304   3784   2588 S  0.0  0.4   0:09.73 systemd
... arbitrary lines of process details come here!

top - 17:51:45 up 9 days, 22:33,  1 user,  load average: 0.00, 0.03, 0.05
Tasks:  71 total,   3 running,  68 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  1015012 total,   557916 free,    81364 used,   375732 buff/cache
KiB Swap:  2097148 total,  2097148 free,        0 used.   722072 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
   91 root      20   0       0      0      0 R  0.1  0.0   0:27.07 kworker/0:2
    1 root      20   0  125304   3784   2588 S  0.0  0.4   0:09.73 systemd
... arbitrary lines of process details come here!

The first iteration of top is printed immediately, while the next one is delayed by 10 seconds as specified by argument -d 10 (to obtain CPU metrics data averaged over 10 seconds, which is the information that I actually wanted). For that I needed to discard that first record, and log only the second one. As you can’t say how long, ie. how many lines the first record has, you cannot simple use head/tail to cut it off.
Fortunately, there is csplit from coreutils package!
As each record produced by top starts with the string top - it is dead easy to define a multi-line record separator using a regular expression. csplit has the ability to perform 2 major “pattern actions” on the input file.
Either copy or skip up to a matching line (where an optional +/- line OFFSET can fine-tune what how many context lines should be copied or ignored).

1
2
3
4
5
/REGEXP/[OFFSET]
copy up to but not including a matching line

%REGEXP%[OFFSET]
skip to, but not including a matching line

So, coming back to my example, I needed to drop the first record, and print the second one. I accomplished that using this csplit command:

1
top -b -n2 -d10 | csplit --silent --prefix='csplit-' - '%^top - %' '{1}'

csplit will will create one output file, csplit-00, containing only the second record.

Explanation: My pattern consists of 2 parts:

  1. %^top% - Read and skip any lines up to the record separator top - , which will match and exit at line 1, which is the first line of the first record.
  2. {1} - Repeat the previous action to read and skip the first record until top - matches again, at the start of the second record. Then, csplit writes the rest of the buffer to the output file.
updatedupdated2019-04-162019-04-16