Cli tools for manipulating text.
In the spirit of small and focused command-line programs that comprise
the (*)nix shell,
these utilities provide functions that I often find myself
cobbling together on-the-fly with awk and sed.
Laziness drove me to what you find here.
Select columns from a file.
here is some test data:
> cat data
2000 AUG $3,698.14 -$1,109.44
2000 SEP $870.96 -$261.29
2001 AUG $1,676.56 -$502.97
2001 AUG $940.80 -$282.24
2001 SEP $2,070.10 -$621.03
select some columns:
> cat data | ucol 2 1 2
AUG 2000 AUG
SEP 2000 SEP
AUG 2001 AUG
AUG 2001 AUG
SEP 2001 SEP
Here ucol is used
to extract a few columns
from the data file.
Note that the columns can be specified in any order,
and can be specified more than once.
Negative values can be specified as column numbers; these will index from the right:
> cat data | ucol -1
-$1,109.44
-$261.29
-$502.97
-$282.24
-$621.03
A range of column numbers can be specified with a plus sign (+) immediately following a column number—this will indicate the column number plus all subsequent columns:
> cat data | ucol 3+
$3,698.14 -$1,109.44
$870.96 -$261.29
$1,676.56 -$502.97
$940.80 -$282.24
$2,070.10 -$621.03
A substring of a column can be specified
with a square bracket immediately following a column number.
For example: 2[5,7] will select the 5th, 6th, and 7th characters from the second column.
The square bracket contains a starting character (optional, default=1), a comma, and
an ending character (optional, default is the end of the column value). Starting and ending characters can also be specified as negative numbers, which index from the right.
To specify a substring using a negative column number, an underscore (_) must be used instead of a minus sign (-)—this is to avoid ambiguity during option parsing. For example: _2[,3] will grab the first three characters from the second column from the right.
You may have to escape (or quote) the square bracket characters, depending on your shell:
> cat data | ucol '_2[2,-4]'
3,698
870
1,676
940
2,070
If no columns are specified, all columns will be extracted (1+).
ucol [-dDns] column-numbers [filename]
-dc use 'c' as input column delimiter
--delimiter (default whitespace)
-Dc use 'c' as output column delimiter
--output-delimiter (default space)
--csv parse lines as csv
--un-comma remove commas and/or leading dollar sign ($) from numbers
--to-json output as json (list of dict) using first row as keys
--to-sc output as sc (spreadsheet calculator) format (enables --un-comma)
-n allow null columns
--null-columns
Normally, when multiple column delimiters are
encountered in sequence, they are treated as a single
delimiter. If null columns are allowed, each column
delimiter starts a new column, and sequential delimiters
indicate zero-length columns.
--no-strip don't strip leading and trailing delimiters from line
(default=False, in other words, strip happens by default)
-s handle errors strictly
--strict
If a line is encountered that doesn't have enough columns
to satisfy the command, it is skipped. If the strict flag
is set, this condition will cause the program to stop.
Sum columns in a file.
Here is some test data:
> cat data
2000 AUG $3,698.14 -$1,109.44
2000 SEP $870.96 -$261.29
2001 AUG $1,676.56 -$502.97
2001 AUG $940.80 -$282.24
2001 SEP $2,070.10 -$621.03
Sum by the first two columns (think of the SQL groupby functionality):
> cat data | usum 1 2
2000 AUG 3698.14 -1109.44
2000 SEP 870.96 -261.29
2001 AUG 2617.36 -785.21
2001 SEP 2070.1 -621.03
Notice that 2001 AUG records are combined. Notice also that the "$" and ","
characters are ignored (effectively stripped from the columns). The value in a column with the largest precision
sets the precision for all of the sums of that column (this
gets rid of unintuitive floating point math things—without this the summed line above
would be 2001 AUG 2617.3599999999997 -785.21).
The specified order of the columns is preserved:
> cat data | usum 2 1
AUG 2000 3698.14 -1109.44
SEP 2000 870.96 -261.29
AUG 2001 2617.36 -785.21
SEP 2001 2070.1 -621.03
Add a count for each row:
> cat data | usum 1 2 -c
2000 AUG 1 3698.14 -1109.44
2000 SEP 1 870.96 -261.29
2001 AUG 2 2617.36 -785.21
2001 SEP 1 2070.1 -621.03
The count of rows matching the group-by columns is inserted between the column values and the sums.
Sum by just one column:
> cat data | usum 1
2000 0 4569.1 -1370.73
2001 0 4687.46 -1406.24
Notice that the non-numeric values in the second column are treated as zero.
Combine with ucol:
> cat data | ucol 2 3 4
AUG $3,698.14 -$1,109.44
SEP $870.96 -$261.29
AUG $1,676.56 -$502.97
AUG $940.80 -$282.24
SEP $2,070.10 -$621.03
> cat data | ucol 2 3 4 | usum 1
AUG 6315.5 -1894.65
SEP 2941.06 -882.32
Sum each column into a single line (group by zero/nothing):
> cat data | usum 0
10003 0 9256.56 -2776.97
Sum all the numbers in a file by not providing a column number:
> cat data | usum
16482.59
All the numeric tokens in the file are summed.
usum [-h] [--delimiter DELIMITER] [--count] [--strict] [groupby ...]
--delimiter, -d DELIMITER
input/output column delimiter, default=' '
--count, -c
add count of items included in sum for each output line
-s handle errors strictly
--strict
If a line is encountered that doesn't have the right number
of columns to satisfy the command, it is skipped. If a column
to be summed is not a numerical value, it is treated as zero.
If a column to be summed contains "$" or ",", these characters
are ignored.
If the strict flag is set, any of these conditions will
cause the program to stop.
Format text into paragraphs.
Here is some test data:
> cat test_data.txt
When in the Course of human events, it
becomes necessary for one people to
dissolve the political bands which have
connected them with another, and to
assume among the powers of the earth,
the separate and equal station to which
the Laws of Nature and of Nature's God
entitle them, a decent respect to the
opinions of mankind requires that they
should declare the causes which impel
them to the separation.
Format the test data into paragraphs of lines not exceeding 80 characters:
> upar < test_data.txt
When in the Course of human events, it becomes necessary for one people to
dissolve the political bands which have connected them with another, and to
assume among the powers of the earth, the separate and equal station to which
the Laws of Nature and of Nature's God entitle them, a decent respect to the
opinions of mankind requires that they should declare the causes which impel
them to the separation.
Format the test data into paragraphs of lines not exceeding 60 characters:
> upar -l60 < test_data.txt
When in the Course of human events, it becomes necessary for
one people to dissolve the political bands which have
connected them with another, and to assume among the powers
of the earth, the separate and equal station to which the
Laws of Nature and of Nature's God entitle them, a decent
respect to the opinions of mankind requires that they should
declare the causes which impel them to the separation.
Add an indent:
> upar -i5 < test_data.txt
When in the Course of human events, it becomes necessary for one people to
dissolve the political bands which have connected them with another, and to
assume among the powers of the earth, the separate and equal station to
which the Laws of Nature and of Nature's God entitle them, a decent respect
to the opinions of mankind requires that they should declare the causes
which impel them to the separation.
Multi-paragraph:
> upar -i5 -l 60 < test_data_2.txt
When in the Course of human events, it becomes
necessary for one people to dissolve the political
bands which have connected them with another, and to
assume among the powers of the earth, the separate and
equal station to which the Laws of Nature and of
Nature's God entitle them, a decent respect to the
opinions of mankind requires that they should declare
the causes which impel them to the separation.
We hold these truths to be self-evident, that all men
are created equal, that they are endowed by their
Creator with certain unalienable Rights, that among
these are Life, Liberty and the pursuit of Happiness.
If indent is not specified, then the indent of the
first line is used:
> cat test_data_3.txt
We hold these truths to be self-evident,
that all men are created equal,
that they are endowed by their Creator with certain unalienable Rights,
that among these are Life,
Liberty and the pursuit of Happiness.
> upar < test_data_3.txt
We hold these truths to be self-evident, that all men are created equal,
that they are endowed by their Creator with certain unalienable Rights, that
among these are Life, Liberty and the pursuit of Happiness.
From command mode in vi, the next 10 lines can be formatted
into a paragraph with this command:
:.,+9!upar -l75
This is helpful for cleaning up comment blocks or formatting simple text files.
upar [-li] [groupby [groupby ...]]
-ln use 'n' as the max output line length
--length (default 80)
-in indent lines by 'n', included in line length
--indent (default indent of first line)
- clone the repo
pip install .from the repo's top level