This topic has been brought up numerous times before, but I wanted to take a shot at another solution, because I think there is still room for improvement, and I wanted a simple example to experiment with some new VB9 functionality.
Here are a bunch of links to the most recent information on the problem and a bunch of potential solutions. By spelunking through these, you ought to be able to find a solution in any language you like, and every possible combination of algorithms.
Probably the simplest example of sorting above is the following Python program which originally came from a comment posted to Ned Batchelder’s blog above…
Ian Griffiths was even able to more or less duplicate this solution in C# 3 by writing a couple of general purpose utility functions. After a little cleanup, it looks like this…
One problem with Ian’s solution is that it doesn’t do an in-place sort, which makes it not quite a fair comparison with the original. Also, OrderBy uses deferred execution, so you must be careful when comparing the performance that you force the sort to happen (e.g. Call List.GetEnumerator().GetNext())
Of course, if it’s fair to write a missing EnumerableComparer class, then why not just implement a NaturalComparer, which would make the resulting code even simpler…
lst.OrderBy(s => s, NaturalComparer.CurrentCultureIgnoreCase);
This seems to give you the best solution with the added flexibility of working with OrderBy and Sort. (Interestingly, OrderBy often seems to be slightly faster than Sort, which seems strange considering that Sort doesn’t have to allocate an entirely new structure to contain the results.)
Before I show you my implementation for NaturalComparer, here are some problems found in most of the previous solutions I’ve seen.
- They often rely on converting sequences of numeric characters to integers or some other numeric type. This is completely unnecessary, and if you think about what must be going on behind the scenes, fairly costly. Maybe I’ll post an article on how to convert strings to integers so that you can get an appreciation for how much work is going on behind that seemingly innocent call to int.Parse.
- Even if converting to Integers were a good idea, it wouldn’t be advisable to throw and catch exceptions repeatedly as part of the algorithm. This can be fixed by using int.TryParse in the code above, but there is no point since we don’t want to go down that road at all because…
- Converting to Integers doesn’t work for larger sequences of Strings. For example, try calling ToInt(new String(‘1’, 500) and see what happens.
- Many solutions use a split() function to divide each string into a list of strings that are either numbers or alphabetic strings. This is completely unnecessary, and wasteful. For example, if I want to compare “1a2a3a…Na” with “2a3a…Ma”, then this (silly, stupid, wasteful, inefficient, …) algorithm would allocate two lists of size N and M, whereas a reasonable algorithm would realize the string1 is less than string2 after comparing the first character.
- Many solutions also tend to use regular expressions to do the split(). Although, this may be convenient, it’s not a very efficient approach. I’ve seen some solutions that use a simple lookup table for the ‘0’-‘9’ to help somewhat, but the split() approach seems like a dead end anyway.
- By the way, if you were to use a Regex here, then many platforms (including .NET) let you compile it to get better performance. When I compare performance later, I’ll use a modified version of Ian’s code that uses a compiled Regex and int.TryParse.
- Most of the solutions I’ve seen ignore all culture issues with string comparison. Considering that the most likely usages are for sorting lists of user visible strings this seems pretty shortsighted.
- Many solutions are also case sensitive, which is also pretty fundamentally wrong. Any correct NaturalSorting solution should at least have the option to sort case-insensitively.
- Another requirement for NaturalSorting is to support ignoring non-alphanumeric characters. Typically you want “a ” = “a”, but “a b” != “ab”. In general terms any character that is not a number or letter should only act as a separator. Many solutions, including the python above, ignore this requirement.
- A general class of problems with almost every solution is the side effect of creating lots of stuff during comparison. Some solutions dynamically create temporary strings, lists of strings, numbers, or even more complex objects. Sometimes this creation is somewhat hidden in a call to string.substring on a platform with immutable strings. Some authors try to use something like StringBuilder to help, but this does not address the real issue. NaturalCompare can probably be implemented without any need for dynamic allocation at all.
- Another class of problems is algorithms that tend to do the same thing twice. One example is using a regex to divide an input string into a list of strings and numbers, and then having to use another pass to check whether each on is a number again. It would be better to record the fact of whether it’s a string or number on the first pass so you don’t have to do the same work again. However, I confess that my own algorithm often has to make two passes through each section of a substring, once to search for a delimiter, and once to do the comparison.
The following is the Compare function from my solution to the problem. You can download the full source from http://cid-ae9441bae91063cc.skydrive.live.com/embedrow.aspx/Public/Utilities.zip.
Public Function Compare(ByVal x As String, ByVal y As String) As Integer Implements IComparer(Of String).Compare If x Is Nothing AndAlso y Is Nothing Then Return 0 End If If x Is Nothing Then Return -1 End If If y Is Nothing Then Return 1 End If ' step through the logical characters of the two strings, where a logical character is ' either the unicode Char or a run of numeric Chars. ' All numeric characters are treated as less than any other character ' All non-alphanumerics are ignored with respect to sorting. Dim xpos, ypos As Integer Do While xpos < x.Length AndAlso ypos < y.Length ' First find the next logical characters in x and y xpos = FindFirstIndexOf(x, xpos, myIsLetOrDig) ypos = FindFirstIndexOf(y, ypos, myIsLetOrDig) If xpos = -1 AndAlso ypos = -1 Then Return 0 ElseIf xpos = -1 Then Return -1 ElseIf ypos = -1 Then Return 1 ElseIf Char.IsNumber(x(xpos)) AndAlso Char.IsNumber(y(ypos)) Then ' Skip leading zeros Dim xtmp = FindNextIndexOf(x, xpos, myIsNonZero) Dim ytmp = FindNextIndexOf(y, ypos, myIsNonZero) ' Find the end of the logical character Dim xend = FindNextIndexOf(x, xpos, myIsNotNum) Dim yend = FindNextIndexOf(y, ypos, myIsNotNum) xpos = If(xtmp = xend, xtmp - 1, xtmp) ypos = If(ytmp = yend, ytmp - 1, ytmp) If xend - xpos < yend - ypos Then Return -1 ElseIf xend - xpos > yend - ypos Then Return 1 Else Dim iy = ypos For ix = xpos To xend - 1 If x(ix) < y(iy) Then Return -1 ElseIf x(ix) > y(iy) Then Return 1 End If iy += 1 Next End If ElseIf Char.IsNumber(x(xpos)) Then Return -1 ElseIf Char.IsNumber(y(ypos)) Then Return 1 Else Dim xend = FindNextIndexOf(x, xpos, myIsNotLet) Dim yend = FindNextIndexOf(y, ypos, myIsNotLet) Dim xlen = xend - xpos Dim ylen = yend - ypos Dim r = myCmpInfo.Compare(x, xpos, xlen, y, ypos, ylen, myCmpOpt) If r <> 0 Then Return r End If xpos = xend - 1 ' -1, because we're about to +1 ypos = yend - 1 End If xpos += 1 ypos += 1 Loop If xpos >= x.Length AndAlso ypos >= y.Length Then Return 0 ElseIf xpos >= x.Length Then Return -1 ElseIf ypos >= y.Length Then Return 1 End If Return 0 End Function
The basic idea is to iterate forward through the two input strings, using the FindXXX functions to find the separators between numbers, letters, and ignored characters. We return from Compare as soon as possible without having to look at any characters following the first difference. No extra allocations are performed. For example, when comparing strings I used the form of CompareInfo.Compare that takes offsets and lengths for the two strings to avoid having to allocate substrings. Originally the code used inline lambda expressions for the arguments to FindXXX, however testing showed a significant performance increase from predefining those functions outside any individual Compare, as they are always the same. For example, here’s the definition for myIsNonZero, and the others are much the same.
Private Shared ReadOnly myIsNonZero As CharPred = Function(c) c <> "0"c
Finally, here’s what FindNextIndexOf looks like…
I manually did a few test comparisons between my solution and an optimized form of Ian’s (compiled regex and int.TryParse) I compared using two lists of strings shuffled randomly. The first list contained the numbers 1-5000, and the second contained alternating strings, numbers, and special characters of the form “string n string n string n” where n was 1-5000.
Optimized IanG, string type one (108-115ms)
Optimized IanG, string type two (291-316ms)
Mine, string type one (16-20ms)
Mine, string type two (158-171ms)
Once again, these numbers aren’t really fair to compare, because my solution handles things like case insensitivity, ignoring non alphanums, culture idiosyncracies, etc. But it’s nice to know that the extra features are more than paid for with the more efficient algorithm.
There’s still room for improvement, so I may make updates from time to time if I need to use this code on a real project.
- The code doesn’t handle floating point numbers or number separators (e.g. ‘,’). If this is added, then it should probably be optional, as it’s not clear you would always want it.
- Although I have some unit tests, I don’t have anything to verify the Culture constraints. For example, I’m trusting Char.IsLetter to work correctly.
- I also assume that numeric characters can be compared by ordinal position. (e.g. ‘0’ < ‘9’)
Feel free to use the code any way you see fit, and please let me know if you find any problems or have any other suggestions.