This post briefly explains the implementation details of unix sort’s numeric sort option (–numeric-sort) by providing a Regex pattern for matching String into a numeric value format.

In one of the module I took this semester, we were asked to implement a Unix Shell as a course project. We need to implement several shell applications in Java, and one of the task, which is quite interesting, is about implementing the “sort by numeric value” (sort -n).

When I consult the man sort in the shell and look for more details, the manual just provide another one-line inforamtion:

-n, –numeric-sort: compare according to string numerical value

But what exactly is the meaning of string numerical value? Are we trying to parse each string into a raw numeric value (like type Float) and anything that fail to be parsed are considered as normal String? What is the order between special chars, characters and numbers using this string numerical value? Surprisingly, Google doesn’t provide any useful information this time about how the “numeric sort” is implemented. But luckily, GNU docs do provide us a more detailed description about how numeric sort work:

Sort numerically. The number begins each line and consists of optional blanks, an optional ‘-’ sign, and zero or more digits possibly separated by thousands separators, optionally followed by a decimal-point character and zero or more digits. An empty number is treated as ‘0’. The LC_NUMERIC locale specifies the decimal-point character and thousands separator. By default a blank is a space or a tab, but the LC_CTYPE locale can change this.
Comparison is exact; there is no rounding error.
Neither a leading ‘+’ nor exponential notation is recognized. To compare such strings numerically, use the –general-numeric-sort (-g) option.

Basically, every string now represents a “number”, and this “number” must follow the following rules:

  • It begins with optional blanks, “-“ sign (negative value), followed by…
  • Zero or more digits, followed by…
  • optional decimal part: a decimal point character and zero or more digits

Any string that violated the above rules are considered as 0. In addition, if two string has the same numeric value, order them by their ascii order.

The implementation logic is actually quite similar to the famous atoi function in C language, which parse a string into an integer value. For each string to be sorted, we first try to match it to an valid numeric value by the above-mentioned rules. If the string format doesn’t follow the rule, we treat it as a numeric value of 0.

In order to match the pattern, I created a simplified Regular Expression pattern for the above rules:

1
^\\s*(-?\\d+(\\.\\d+)?)

Below, a simplified function for parsing a string into its numeric value are given:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
/**
* Try to parse the input string for a numeric value. If failed to parse a
* valid numeric value, 0 is return by default.
*
* e.g. "1" => 1 " 1 " => 1 "-1.a" => -1 "1.1a" => 1.1 "12 34" => 12 "abc"
* => 0 ".123" => 0
*
* @param input
* Not null. The input String
* @return the parsed numeric value.
*/
public static double numericValueOf(String input) {
Matcher m = Pattern.compile("^\\s*(-?\\d+(\\.\\d+)?)").matcher(input);
if (m.find()) {
return Double.valueOf(m.group(0));
} else {
return 0;
}
}

Personally I think the sort manual are a bit too simple. It doesn’t explain clearly how exactly sort by numeric value work. It could be much more helpful if more examples could be shown somewhere in the manual to let reader understand it better.