One would hope that a simple task like sorting would be relatively unambiguous. Unfortunately, it isn't. The behavior of sort can be very puzzling. I'll try to straighten out some of the confusion -- at the same time, I'll be leaving myself open to abuse by the real sort experts. I hope you appreciate this! Seriously, though: if you know of any other wrinkles to the story, please let us know and we'll add them in the next edition.
The trouble with sort is figuring out where one field ends and another begins. It's simplest if you can specify an explicit field delimiter (Section 22.3). This makes it easy to tell where fields end and begin. But by default, sort uses whitespace characters (tabs and spaces) to separate fields, and the rules for interpreting whitespace field delimiters are unfortunately complicated. As I see them, they are:
The first whitespace character you encounter is a "field delimiter"; it marks the end of the old field and the beginning of the next field.
Any whitespace character following a field delimiter is part of the new field. That is, if you have two or more whitespace characters in a row, the first one is used as a field delimiter and isn't sorted. The remainder are sorted, as part of the next field.
Every field has at least one nonwhitespace character, unless you're at the end of the line. (That is, null fields only occur when you've reached the end of a line.)
All whitespace is not equal. Sorting is done according to the ASCII collating sequence. Therefore, TABs are sorted before spaces.
Here is a silly but instructive example that demonstrates most of the hard cases. We'll sort the file sortme, which is:
apple Fruit shipment 20 beta beta test sites 5 Something or other
All is not as it seems -- cat -t -v (Section 12.5, Section 12.4) shows that the file really looks like this:
^Iapple^IFruit shipment 20^Ibeta^Ibeta test sites 5^I^ISomething or other
^I indicates a tab character. Before showing you what sort does with this file, let's break it into fields, being very careful to apply the rules above. In the table, we use quotes to show exactly where each field begins and ends:
Field 0 |
Field 1 |
Field 2 |
Field 3 |
|
---|---|---|---|---|
Line 1 |
"^Iapple" |
"Fruit" |
"shipment" |
null (no more data) |
Line 2 |
"20" |
"beta" |
"beta" |
"test" |
Line 3 |
" 5" |
"^Isomething" |
"or" |
"other" |
OK, now let's try some sort commands; I've added annotations on the right, showing what character the "sort" was based on. First, we'll sort on field zero -- that is, the first field in each line:
% sort sortme ...sort on field zero apple Fruit shipments field 0, first character: TAB 5 Something or other field 0, first character: SPACE 20 beta beta test sites field 0, first character: 2
As I noted earlier, a TAB precedes a space in the collating sequence. Everything is as expected. Now let's try another, this time sorting on field 1 (the second field):
+% sort +1 sortme ...sort on field 1 5 Something or other field 1, first character: TAB apple Fruit shipments field 1, first character: F 20 beta beta test sites field 1, first character: b
Again, the initial TAB causes "something or other" to appear first. "Fruit shipments" preceded "beta" because in the ASCII table, uppercase letters precede lowercase letters. Now, let's sort on the next field:
+% sort +2 sortme ...sort on field 2 20 beta beta test sites field 2, first character: b 5 Something or other field 2, first character: o apple Fruit shipments field 2, first character: s
No surprises here. And finally, sort on field 3 (the "fourth" field):
+% sort +3 sortme ...sort on field 3 apple Fruit shipments field 3, NULL 5 Something or other field 3, first character: o 20 beta beta test sites field 3, first character: t
The only surprise here is that the NULL field gets sorted first. That's really no surprise, though: NULL has the ASCII value zero, so we should expect it to come first.
OK, this was a silly example. But it was a difficult one; a casual understanding of what sort "ought to do" won't explain any of these cases, which leads to another point. If someone tells you to sort some terrible mess of a data file, you could be heading for a nightmare. But often, you're not just sorting; you're also designing the data file you want to sort. If you get to design the format for the input data, a little bit of care will save you lots of headaches. If you have a choice, never allow TABs in the file. And be careful of leading spaces; a word with an extra space before it will be sorted before other words. Therefore, use an explicit delimiter between fields (like a colon), or use the -b option (and an explicit sort field), which tells sort to ignore initial whitespace.
-- ML
Copyright © 2003 O'Reilly & Associates. All rights reserved.