Comparing Strings Using Natural Ordering

This topic has been brought up numerous times before, but I wanted to take a shot at another solution, because I think there is still room for improvement, and I wanted a simple example to experiment with some new VB9 functionality.

Here are a bunch of links to the most recent information on the problem and a bunch of potential solutions. By spelunking through these, you ought to be able to find a solution in any language you like, and every possible combination of algorithms.

http://www.codinghorror.com/blog/archives/001018.html

http://www.davekoelle.com/alphanum.html

http://nedbatchelder.com/blog/200712/human_sorting.html

http://www.interact-sw.co.uk/iangblog/2007/12/13/natural-sorting

http://www.codeproject.com/KB/string/NaturalComparer.aspx

Probably the simplest example of sorting above is the following Python program which originally came from a comment posted to Ned Batchelder’s blog above…

   1: def natural_sort(lst):
   2:   to_int = lambda text: int(text) if text.isdigit() else text
   3:   alphanum_key = lambda key: [ to_int(c) for c in re.split('([0-9]+)', key) ]
   4:   lst.sort( key=alphanum_key )

Ian Griffiths was even able to more or less duplicate this solution in C# 3 by writing a couple of general purpose utility functions. After a little cleanup, it looks like this…

   1: static IOrderedEnumerable<string> NaturalSort(List<string> lst) {
   2:     Func<string, object> ToInt = s => {
   3:         try {
   4:             return int.Parse(s);
   5:         } catch {
   6:         }
   7:         return s;
   8:     };
   9:     return lst.OrderBy(s => Regex.Split(s.Replace(" ", ""), "([0-9]+)")
  10:         .Select(ToInt), new EnumerableComparer<object>());
  11: }

One problem with Ian’s solution is that it doesn’t do an in-place sort, which makes it not quite a fair comparison with the original. Also, OrderBy uses deferred execution, so you must be careful when comparing the performance that you force the sort to happen (e.g. Call List.GetEnumerator().GetNext())

Of course, if it’s fair to write a missing EnumerableComparer class, then why not just implement a NaturalComparer, which would make the resulting code even simpler…

lst.Sort(NaturalComparer.CurrentCultureIgnoreCase);

or even…

lst.OrderBy(s => s, NaturalComparer.CurrentCultureIgnoreCase);

This seems to give you the best solution with the added flexibility of working with OrderBy and Sort. (Interestingly, OrderBy often seems to be slightly faster than Sort, which seems strange considering that Sort doesn’t have to allocate an entirely new structure to contain the results.)

Before I show you my implementation for NaturalComparer, here are some problems found in most of the previous solutions I’ve seen.

  • They often rely on converting sequences of numeric characters to integers or some other numeric type. This is completely unnecessary, and if you think about what must be going on behind the scenes, fairly costly. Maybe I’ll post an article on how to convert strings to integers so that you can get an appreciation for how much work is going on behind that seemingly innocent call to int.Parse.
  • Even if converting to Integers were a good idea, it wouldn’t be advisable to throw and catch exceptions repeatedly as part of the algorithm. This can be fixed by using int.TryParse in the code above, but there is no point since we don’t want to go down that road at all because…
  • Converting to Integers doesn’t work for larger sequences of Strings. For example, try calling ToInt(new String(‘1’, 500) and see what happens.
  • Many solutions use a split() function to divide each string into a list of strings that are either numbers or alphabetic strings. This is completely unnecessary, and wasteful. For example, if I want to compare “1a2a3a…Na” with “2a3a…Ma”, then this (silly, stupid, wasteful, inefficient, …) algorithm would allocate two lists of size N and M, whereas a reasonable algorithm would realize the string1 is less than string2 after comparing the first character.
  • Many solutions also tend to use regular expressions to do the split(). Although, this may be convenient, it’s not a very efficient approach. I’ve seen some solutions that use a simple lookup table for the ‘0’-‘9’ to help somewhat, but the split() approach seems like a dead end anyway.
  • By the way, if you were to use a Regex here, then many platforms (including .NET) let you compile it to get better performance. When I compare performance later, I’ll use a modified version of Ian’s code that uses a compiled Regex and int.TryParse.
  • Most of the solutions I’ve seen ignore all culture issues with string comparison. Considering that the most likely usages are for sorting lists of user visible strings this seems pretty shortsighted.
  • Many solutions are also case sensitive, which is also pretty fundamentally wrong. Any correct NaturalSorting solution should at least have the option to sort case-insensitively.
  • Another requirement for NaturalSorting is to support ignoring non-alphanumeric characters. Typically you want “a ” = “a”, but “a b” != “ab”. In general terms any character that is not a number or letter should only act as a separator. Many solutions, including the python above, ignore this requirement.
  • A general class of problems with almost every solution is the side effect of creating lots of stuff during comparison. Some solutions dynamically create temporary strings, lists of strings, numbers, or even more complex objects. Sometimes this creation is somewhat hidden in a call to string.substring on a platform with immutable strings. Some authors try to use something like StringBuilder to help, but this does not address the real issue. NaturalCompare can probably be implemented without any need for dynamic allocation at all.
  • Another class of problems is algorithms that tend to do the same thing twice. One example is using a regex to divide an input string into a list of strings and numbers, and then having to use another pass to check whether each on is a number again. It would be better to record the fact of whether it’s a string or number on the first pass so you don’t have to do the same work again. However, I confess that my own algorithm often has to make two passes through each section of a substring, once to search for a delimiter, and once to do the comparison.

The following is the Compare function from my solution to the problem. You can download the full source from http://cid-ae9441bae91063cc.skydrive.live.com/embedrow.aspx/Public/Utilities.zip.

	Public Function Compare(ByVal x As String, ByVal y As String) As Integer Implements IComparer(Of String).Compare
		If x Is Nothing AndAlso y Is Nothing Then
			Return 0
		End If
		If x Is Nothing Then
			Return -1
		End If
		If y Is Nothing Then
			Return 1
		End If
		' step through the logical characters of the two strings, where a logical character is 
		' either the unicode Char or a run of numeric Chars. 
		' All numeric characters are treated as less than any other character
		' All non-alphanumerics are ignored with respect to sorting. 
		Dim xpos, ypos As Integer
		Do While xpos < x.Length AndAlso ypos < y.Length
			' First find the next logical characters in x and y
			xpos = FindFirstIndexOf(x, xpos, myIsLetOrDig)
			ypos = FindFirstIndexOf(y, ypos, myIsLetOrDig)
			If xpos = -1 AndAlso ypos = -1 Then
				Return 0
			ElseIf xpos = -1 Then
				Return -1
			ElseIf ypos = -1 Then
				Return 1
			ElseIf Char.IsNumber(x(xpos)) AndAlso Char.IsNumber(y(ypos)) Then
				' Skip leading zeros
				Dim xtmp = FindNextIndexOf(x, xpos, myIsNonZero)
				Dim ytmp = FindNextIndexOf(y, ypos, myIsNonZero)

				' Find the end of the logical character
				Dim xend = FindNextIndexOf(x, xpos, myIsNotNum)
				Dim yend = FindNextIndexOf(y, ypos, myIsNotNum)

				xpos = If(xtmp = xend, xtmp - 1, xtmp)
				ypos = If(ytmp = yend, ytmp - 1, ytmp)

				If xend - xpos < yend - ypos Then
					Return -1
				ElseIf xend - xpos > yend - ypos Then
					Return 1
				Else
					Dim iy = ypos
					For ix = xpos To xend - 1
						If x(ix) < y(iy) Then
							Return -1
						ElseIf x(ix) > y(iy) Then
							Return 1
						End If
						iy += 1
					Next
				End If
			ElseIf Char.IsNumber(x(xpos)) Then
				Return -1
			ElseIf Char.IsNumber(y(ypos)) Then
				Return 1
			Else
				Dim xend = FindNextIndexOf(x, xpos, myIsNotLet)
				Dim yend = FindNextIndexOf(y, ypos, myIsNotLet)
                Dim xlen = xend - xpos
                Dim ylen = yend - ypos
                Dim r = myCmpInfo.Compare(x, xpos, xlen, y, ypos, ylen, myCmpOpt)
                If r <> 0 Then
                    Return r
                End If
				xpos = xend - 1	' -1, because we're about to +1
				ypos = yend - 1
			End If
			xpos += 1
			ypos += 1
		Loop
		If xpos >= x.Length AndAlso ypos >= y.Length Then
			Return 0
		ElseIf xpos >= x.Length Then
			Return -1
		ElseIf ypos >= y.Length Then
			Return 1
		End If
		Return 0
	End Function

The basic idea is to iterate forward through the two input strings, using the FindXXX functions to find the separators between numbers, letters, and ignored characters. We return from Compare as soon as possible without having to look at any characters following the first difference. No extra allocations are performed. For example, when comparing strings I used the form of CompareInfo.Compare that takes offsets and lengths for the two strings to avoid having to allocate substrings. Originally the code used inline lambda expressions for the arguments to FindXXX, however testing showed a significant performance increase from predefining those functions outside any individual Compare, as they are always the same. For example, here’s the definition for myIsNonZero, and the others are much the same.

      Private Shared ReadOnly myIsNonZero As CharPred = Function(c) c <> "0"c

Finally, here’s what FindNextIndexOf looks like…

   1: Public Delegate Function CharPred(ByVal c As Char) As Boolean
   2:
   3: Public Function FindNextIndexOf(ByVal s As String, _
   4:     ByVal start As Integer, ByVal p As CharPred) As Integer
   5:
   6:     If s Is Nothing OrElse s.Length = 0 Then
   7:         Return s.Length
   8:     End If
   9:     For i = start To s.Length - 1
  10:         Dim c = s(i)
  11:         If p(c) Then
  12:             Return i
  13:         End If
  14:     Next
  15:     Return s.Length
  16: End Function

Some Numbers

I manually did a few test comparisons between my solution and an optimized form of Ian’s (compiled regex and int.TryParse) I compared using two lists of strings shuffled randomly. The first list contained the numbers 1-5000, and the second contained alternating strings, numbers, and special characters of the form “string n string n string n” where n was 1-5000.

Optimized IanG, string type one (108-115ms)

Optimized IanG, string type two (291-316ms)

Mine, string type one (16-20ms)

Mine, string type two (158-171ms)

Once again, these numbers aren’t really fair to compare, because my solution handles things like case insensitivity, ignoring non alphanums, culture idiosyncracies, etc. But it’s nice to know that the extra features are more than paid for with the more efficient algorithm.

There’s still room for improvement, so I may make updates from time to time if I need to use this code on a real project.

  • The code doesn’t handle floating point numbers or number separators (e.g. ‘,’). If this is added, then it should probably be optional, as it’s not clear you would always want it.
  • Although I have some unit tests, I don’t have anything to verify the Culture constraints. For example, I’m trusting Char.IsLetter to work correctly.
  • I also assume that numeric characters can be compared by ordinal position. (e.g. ‘0’ < ‘9’)

Feel free to use the code any way you see fit, and please let me know if you find any problems or have any other suggestions.

 

 

 

 

 

Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

One Response to Comparing Strings Using Natural Ordering

  1. O2William says:

    This post is a few years old now, but I thought I’d point something out for anybody who stumbles across it as I did. NaturalComparer seem to work very well, but it contains a bug that occurs when you use it to compare two purely alphabetical strings where the first string being compared is longer than the second.

    The bug is in these lines:


    Dim l = xend - xpos
    Dim r = myCmpInfo.Compare(x, xpos, l, y, ypos, l, myCmpOpt)

    Consider this example: you’re comparing the string “abc” to the string “ab”.

    When this this code executes: x = “abc”, y = “ab”, xpos = 0, ypos = 0. l will be calculated to 3, the length of string x.

    Compare() will try to pull a substring of length 3 from string y, which has a length of 2. This will cause an ArgumentOutOfRangeException.

    You can fix this by calculating an “l” value for both strings, x and y:


    Dim lx = xend - xpos
    Dim ly = yend - yos
    Dim r = myCmpInfo.Compare(x, xpos, lx, y, ypos, ly, myCmpOpt)

    Incidentally, I suggest not using the character “l” as a variable name if you’re going to share your code online. While reviewing the code I kept misreading the letter l as the number one, and it really frustrated my debugging efforts for a while. 🙂

    Thanks for making this available.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s