What constitutes a 'field' for the cut command?
The term "field" is often times associated with tools such as cut
and awk
. A field would be similar to a columns worth of data, if you take the data and separate it using a specific character. Typically the character used to do this is a Space.
However as is the case with most tools, it's configurable. For example:
- awk =
awk -F"," ...
- would separate by commas (i.e. ,). - cut =
cut -d"," ...
- would separate by commas (i.e. ,).
Examples
This first one shows how awk
automatically will split on spaces.
$ echo "The rain in Spain." | awk '{print $1" "$4}'
The Spain.
This one shows how cut
will split on spaces too.
$ echo "The rain in Spain." | cut -d" " -f1,4
The Spain.
Here we have a CSV list of column data that we're using cut
to return columns 1 & 4.
$ echo "col1,col2,col3,co4" | cut -d"," -f1,4
col1,co4
Awk too can do this:
$ echo "col1,col2,col3,co4" | awk -F"," '{print $1","$4}'
col1,co4
Awk is also a little more adept at dealing with a variety of separation characters. Here it's dealing with Tabs along with Spaces where they're inter-mixed at the same time:
$ echo -e "The\t rain\t\t in Spain." | awk '{print $1" "$4}'
The Spain.
What about the -s switch to cut?
With respect to this switch, it's simply telling cut
to not print any lines which do not contain the delimiter character specified via the -d
switch.
Example
Say we had this file.
$ cat sample.txt
This is a space string.
This is a space and tab string.
Thisstringcontainsneither.
NOTE: There are spaces and tabs in the 2nd string above.
Now when we process these strings using cut
with and without the -s
switch:
$ cut -d" " -f1-6 sample.txt
This is a space string.
This is a space
Thisstringcontainsneither.
$ cut -d" " -f1-6 -s sample.txt
This is a space string.
This is a space
In the 2nd example you can see that the -s
switch has omitted any strings from the output that do not contain the delimiter, Space.
A field according to POSIX is any part of a line delimited by any of the characters in IFS
, the "input field separator (or internal field separator)." The default value of this is space, followed by a horizontal tabulator, followed by a newline. With Bash you can run printf '%q\n' "$IFS"
to see its value.
It depends on the utility in question, but for cut
, a "field" starts at the beginning of a line of text, and includes everything up to the first tab. The second field runs from the character after the first tab, up to the next tab. And so on for third, fourth, ... Everything between tabs, or between start-of-line and tab, or between tab and end-of-line.
Unless you specify a field delimiter with the "-d" option: cut -d: -f2
would get you everything between first and second colon (':') characters.
Other utilities have different definitions, but a tab-character is common. awk
is a good fall back if cut
is too strict, as awk
divides fields based on one or more whitespace characters. That's a little bit more natural in a lot of situations, but you have to know a bit of syntax. To print the second field according to awk
:
awk '{print $2}'
sort
is the one that tricks me. My current sort
man page says something like
"non-blank to blank transition" for a field seperator. For some reason it takes a few tries to get sort
fields defined correctly. join
apparently uses "delimited by whitespace" fields, which is what awk
purports to do by default.
The moral of the story is to be careful, and experiment if you don't know.