Time for an SSD

The time is right to buy an SSD, because they are finally at the price/performance level where it’s just ridiculous to use anything else. Naysayers will claim that it’s still too early to switch, because you can buy a terabyte hard disk for $50 while a decent 60GB SSD costs $200. However, I think this price difference is irrelevant even in the current economy, because the performance difference is enormous and most of us don’t want or need a drive that big. Those that do need more storage (for a PVR?) can use an extra $50 drive in addition to a primary fast drive.

SSD technology is improving so rapidly that there are a lot of sub-par products still available. For example, you could buy an OCZ Apex 60GB drive for $145, and it will likely have similar performance to the more expensive new OCZ Vertex drives in some benchmarks. However, it may also suffer from some stuttering problems common to this generation of drives. For more info see the recent SSD articles at http://www.anandtech.com/storage/.

The best advice is to buy only what you need as the prices are dropping quickly. Common wisdom is that the price is cut in half each year, however I paid $31/GB for a small MemoRight SLC for my laptop 1.5 years ago, and the latest drives I bought are much higher performance at 1/10th the price. No one predicted that the prices would drop this precipitously, and I see they’re still selling my drive for $21/GB even today. A year from now they’ll probably have terabyte DRAM speed drives for $5 at Walgreens. 🙂

The safest bet is probably the Intel X-25M which is available for $630 @ 160GB or about half that price for 80GB. Personally I took a little more risk and bought 4 60GB OCZ Vertex drives for $200 each. I have 2 each set up as RAID0 in my two home computers, and I believe I’m getting much better performance than the Intel drive in most cases, although I’ve had a few headaches. I’ve had to flash firmware updates to the drives to fix bugs and get new features, which erases the drives. I switched to Windows 7 RC while I was at it which has some new features to work better with an SSD.

So what are the benefits?

· SSD is completely silent

· I can read files at ~400MB/s and write at ~300MB/s

· Seek time is ~.1ms

· Applications load instantly and never stutter or freeze.

· Lower power consumption and heat generation

My 2.66Ghz Core 2 Duo NVidia 680Sli machine takes about 40 seconds to start Windows 7 64bit from the point of pushing the power button with half of that time(20.5s) consisting of BIOS stuff. Here’s a picture of a common disk benchmark. Notice that it writes much faster than it reads in many cases. That seems to be an anomaly with this particular motherboard raid controller, and was the same when I was running Vista 32bit. I haven’t noticed it in practice.

My 3.16Ghz Core 2 Duo Intel 975 machine takes 49 seconds to start Windows Vista 32 bit with 16s of that consisting of BIOS stuff (And I hit Esc to skip the memory test.) The same benchmark test on this machine shows much better (and more normal) results.

This SSD thing is also a big deal for software developers. It’s time once again to adjust your perceptions and learn the new physical reality brought on by this change. Stop designing for outdated equipment. Planning your software architecture around the performance characteristics of a mechanical hard disk now makes no more sense than planning to run your software on a Cassette Tape Drive. The rules have changed.  Databases should be re-architected. Persistence should be revisited. Why load your files into in-memory RAM objects when you may get better overall performance leaving them on the disk? Change the way you do things, and your software is likely to be better than legacy solutions. Some companies and people have a hard time coping changing rules, so now’s a great time to challenge the status quo.

Let me close with an analogy.

If a truck were available that can teleport 30 miles at a time, and gets 100,000 mpg on regular gasoline, would you still want to buy a traditional truck? What if you could get a giant dump truck for $500 and the teleporting 100,000 mpg one cost $63,000? What if a sports car version were available for $20,000 that could go 500mph when you didn’t feel like teleporting and still get 100,000mpg? Does it really matter if 5 years from now you could get a 200,000mpg version that could teleport 50 miles at a time for $10,000? City planners, architects, shipping companies, and other affected parties would have to get their act together to take advantage of “physics 2.0” or new governments, communities, and businesses would replace them.

If you are even slightly inconvenienced or annoyed by the speed, noise, stuttering, power consumption, or other problems associated with a legacy mechanical hard disk then a solution is available for as little as $200.

Update 5/20/2009 : Newegg has the G.Skill Falcon 128GB for $309. I think this is identical to the OCZ Vertex 120GB, but ~$70 less.

Posted in Computers and Internet | 9 Comments

My Blog Personality

Try out Typealyzer to find out what your blog says about your personality. Here’s mine…

INTJ – The Scientists

The long-range thinking and individualistic type. They are especially good at looking at almost anything and figuring out a way of improving it – often with a highly creative and imaginative touch. They are intellectually curious and daring, but might be pshysically hesitant to try new things.

The Scientists enjoy theoretical work that allows them to use their strong minds and bold creativity. Since they tend to be so abstract and theoretical in their communication they often have a problem communcating their visions to other people and need to learn patience and use conrete examples. Since they are extremly good at concentrating they often have no trouble working alone.

Posted in Uncategorized | Leave a comment

Quantifying Experience

It’s hard to find good developers. And it’s worse than useless to filter candidates by "experience" as included on a resume. Statements like "10 years Java experience" or "15 years SQL experience" are meaningless. The problem is that you can’t sum up real experience in a single concise number, and if you could it wouldn’t be measured in years.

We could try to come up with a formula, but any attempt to do so makes it clear that "Software Developer" encompasses a very wide variety of actual skills. What I’d really like to see on a resume, or in addition to a resume, is a real list summarizing all the things you’ve actually done, and the lessons you learned while doing them. This includes all of the following…

  1. List each book you’ve read that seems pertinent to software development, and summarize what you learned. Books are your number one resource for accumulating some subset of the experience of others. If you’re not the type to read a book, then I hope you’re able to make up for it in other ways. It’s not likely though. You might hope to get the same benefit from blogs, but the best of them are just going to restate something that’s been said better and more thoroughly in a book.
  2. List each project you have worked on, no matter how small, and what you learned from each one. There’s a lot to be learned from projects big and small. I’ve personally built full products by myself from scratch, and worked on a variety of teams ranging in size from 2 to 20 developers. I’ve even spent some time working on one of hundreds of small teams working together on a truly massive project with something like 10,000 developers spread throughout the country. I’ve learned different lessons from each of these experiences, and I feel it adds up to more than "14 years experience". Many people in our industry get sucked into giant corporations, where they work on some tiny piece of a large maintenance-mode project for years at a time. They tend to slow down over time, cranking out a new SQL report once every month or two, and gradually falling further behind. I’ve nearly been trapped in similar situations myself, so I know you have to be ever-diligent to avoid it.
  3. List the various roles you’ve played in the software development process, and what you learned from each experience. Have you ever lead a team of people? Have you taken a back seat to another tech lead? Have you ever designed a real software product by yourself? Have you written documentation, or even a book? Have you taught courses, mentored junior developers, pair-programmed, designed web pages, designed a non-web GUI, played the part of a non-technical manager, or any of dozens of other roles that developers may experience in their career? What did you learn? Did you have good mentors in your early years, or did you have to learn everything the hard way?
  4. List all the type of processes and tools you’ve experienced in real professional development, not counting hobbies you’ve tinkered with at home (Unless you built something that people actually use.) I’ve built both GUI and server-side applications, both commercial software products and internal applications, used 8 different languages on non-trivial projects, become proficient in 6 different IDE/Editors, etc. I’ve found each of these experiences valuable.
  5. List your major successes and failures. There are always some high points and low points that stand out from the rest. When you look back on projects that were less successful than you’d hoped, then take the time to think hard about what really went wrong, and what you personally could have done better. In your internal dialogue, accept full responsibility for any problems, and don’t lay the blame on incompetent management, lazy coworkers, indecisive customers, or other distractions. Try instead to imagine what you could have done to make the project a perfect success. Maybe you should have tried harder to understand the customer requirements instead of trusting a Systems Analyst. Maybe you should have found a way to work around management incompetence? Maybe you should have devoted more time to mentoring junior developers instead of forcing them through the trial-by-fire that you suffered through? Or maybe you shouldn’t have coddled those junior developers, and let them learn a few things the hard way? Then again maybe you yourself are falling short in some ways. Do you understand the business point of view? Do you over-engineer, under-engineer, fail to test, test too much, write sloppy code, spend too little time designing before coding, or fail in any of thousands of other ways?

The above list probably does give us a fairly good measure of experience. If you can work through each of the above items completely in a fairly terse writing style and finish in less than a month, then you’re probably not very experienced. Looking at this list, I can’t even estimate how long it would take me to thoroughly explore the 5 items.

I’m not sure, but it might be worth working through the exercise for yourself. Just don’t expect anyone else to read it.

My next post, …

Experience Isn’t Everything

Posted in Software Development | 1 Comment

Comparing Strings Using Natural Ordering

This topic has been brought up numerous times before, but I wanted to take a shot at another solution, because I think there is still room for improvement, and I wanted a simple example to experiment with some new VB9 functionality.

Here are a bunch of links to the most recent information on the problem and a bunch of potential solutions. By spelunking through these, you ought to be able to find a solution in any language you like, and every possible combination of algorithms.

http://www.codinghorror.com/blog/archives/001018.html

http://www.davekoelle.com/alphanum.html

http://nedbatchelder.com/blog/200712/human_sorting.html

http://www.interact-sw.co.uk/iangblog/2007/12/13/natural-sorting

http://www.codeproject.com/KB/string/NaturalComparer.aspx

Probably the simplest example of sorting above is the following Python program which originally came from a comment posted to Ned Batchelder’s blog above…

   1: def natural_sort(lst):
   2:   to_int = lambda text: int(text) if text.isdigit() else text
   3:   alphanum_key = lambda key: [ to_int(c) for c in re.split('([0-9]+)', key) ]
   4:   lst.sort( key=alphanum_key )

Ian Griffiths was even able to more or less duplicate this solution in C# 3 by writing a couple of general purpose utility functions. After a little cleanup, it looks like this…

   1: static IOrderedEnumerable<string> NaturalSort(List<string> lst) {
   2:     Func<string, object> ToInt = s => {
   3:         try {
   4:             return int.Parse(s);
   5:         } catch {
   6:         }
   7:         return s;
   8:     };
   9:     return lst.OrderBy(s => Regex.Split(s.Replace(" ", ""), "([0-9]+)")
  10:         .Select(ToInt), new EnumerableComparer<object>());
  11: }

One problem with Ian’s solution is that it doesn’t do an in-place sort, which makes it not quite a fair comparison with the original. Also, OrderBy uses deferred execution, so you must be careful when comparing the performance that you force the sort to happen (e.g. Call List.GetEnumerator().GetNext())

Of course, if it’s fair to write a missing EnumerableComparer class, then why not just implement a NaturalComparer, which would make the resulting code even simpler…

lst.Sort(NaturalComparer.CurrentCultureIgnoreCase);

or even…

lst.OrderBy(s => s, NaturalComparer.CurrentCultureIgnoreCase);

This seems to give you the best solution with the added flexibility of working with OrderBy and Sort. (Interestingly, OrderBy often seems to be slightly faster than Sort, which seems strange considering that Sort doesn’t have to allocate an entirely new structure to contain the results.)

Before I show you my implementation for NaturalComparer, here are some problems found in most of the previous solutions I’ve seen.

  • They often rely on converting sequences of numeric characters to integers or some other numeric type. This is completely unnecessary, and if you think about what must be going on behind the scenes, fairly costly. Maybe I’ll post an article on how to convert strings to integers so that you can get an appreciation for how much work is going on behind that seemingly innocent call to int.Parse.
  • Even if converting to Integers were a good idea, it wouldn’t be advisable to throw and catch exceptions repeatedly as part of the algorithm. This can be fixed by using int.TryParse in the code above, but there is no point since we don’t want to go down that road at all because…
  • Converting to Integers doesn’t work for larger sequences of Strings. For example, try calling ToInt(new String(‘1’, 500) and see what happens.
  • Many solutions use a split() function to divide each string into a list of strings that are either numbers or alphabetic strings. This is completely unnecessary, and wasteful. For example, if I want to compare “1a2a3a…Na” with “2a3a…Ma”, then this (silly, stupid, wasteful, inefficient, …) algorithm would allocate two lists of size N and M, whereas a reasonable algorithm would realize the string1 is less than string2 after comparing the first character.
  • Many solutions also tend to use regular expressions to do the split(). Although, this may be convenient, it’s not a very efficient approach. I’ve seen some solutions that use a simple lookup table for the ‘0’-‘9’ to help somewhat, but the split() approach seems like a dead end anyway.
  • By the way, if you were to use a Regex here, then many platforms (including .NET) let you compile it to get better performance. When I compare performance later, I’ll use a modified version of Ian’s code that uses a compiled Regex and int.TryParse.
  • Most of the solutions I’ve seen ignore all culture issues with string comparison. Considering that the most likely usages are for sorting lists of user visible strings this seems pretty shortsighted.
  • Many solutions are also case sensitive, which is also pretty fundamentally wrong. Any correct NaturalSorting solution should at least have the option to sort case-insensitively.
  • Another requirement for NaturalSorting is to support ignoring non-alphanumeric characters. Typically you want “a ” = “a”, but “a b” != “ab”. In general terms any character that is not a number or letter should only act as a separator. Many solutions, including the python above, ignore this requirement.
  • A general class of problems with almost every solution is the side effect of creating lots of stuff during comparison. Some solutions dynamically create temporary strings, lists of strings, numbers, or even more complex objects. Sometimes this creation is somewhat hidden in a call to string.substring on a platform with immutable strings. Some authors try to use something like StringBuilder to help, but this does not address the real issue. NaturalCompare can probably be implemented without any need for dynamic allocation at all.
  • Another class of problems is algorithms that tend to do the same thing twice. One example is using a regex to divide an input string into a list of strings and numbers, and then having to use another pass to check whether each on is a number again. It would be better to record the fact of whether it’s a string or number on the first pass so you don’t have to do the same work again. However, I confess that my own algorithm often has to make two passes through each section of a substring, once to search for a delimiter, and once to do the comparison.

The following is the Compare function from my solution to the problem. You can download the full source from http://cid-ae9441bae91063cc.skydrive.live.com/embedrow.aspx/Public/Utilities.zip.

	Public Function Compare(ByVal x As String, ByVal y As String) As Integer Implements IComparer(Of String).Compare
		If x Is Nothing AndAlso y Is Nothing Then
			Return 0
		End If
		If x Is Nothing Then
			Return -1
		End If
		If y Is Nothing Then
			Return 1
		End If
		' step through the logical characters of the two strings, where a logical character is 
		' either the unicode Char or a run of numeric Chars. 
		' All numeric characters are treated as less than any other character
		' All non-alphanumerics are ignored with respect to sorting. 
		Dim xpos, ypos As Integer
		Do While xpos < x.Length AndAlso ypos < y.Length
			' First find the next logical characters in x and y
			xpos = FindFirstIndexOf(x, xpos, myIsLetOrDig)
			ypos = FindFirstIndexOf(y, ypos, myIsLetOrDig)
			If xpos = -1 AndAlso ypos = -1 Then
				Return 0
			ElseIf xpos = -1 Then
				Return -1
			ElseIf ypos = -1 Then
				Return 1
			ElseIf Char.IsNumber(x(xpos)) AndAlso Char.IsNumber(y(ypos)) Then
				' Skip leading zeros
				Dim xtmp = FindNextIndexOf(x, xpos, myIsNonZero)
				Dim ytmp = FindNextIndexOf(y, ypos, myIsNonZero)

				' Find the end of the logical character
				Dim xend = FindNextIndexOf(x, xpos, myIsNotNum)
				Dim yend = FindNextIndexOf(y, ypos, myIsNotNum)

				xpos = If(xtmp = xend, xtmp - 1, xtmp)
				ypos = If(ytmp = yend, ytmp - 1, ytmp)

				If xend - xpos < yend - ypos Then
					Return -1
				ElseIf xend - xpos > yend - ypos Then
					Return 1
				Else
					Dim iy = ypos
					For ix = xpos To xend - 1
						If x(ix) < y(iy) Then
							Return -1
						ElseIf x(ix) > y(iy) Then
							Return 1
						End If
						iy += 1
					Next
				End If
			ElseIf Char.IsNumber(x(xpos)) Then
				Return -1
			ElseIf Char.IsNumber(y(ypos)) Then
				Return 1
			Else
				Dim xend = FindNextIndexOf(x, xpos, myIsNotLet)
				Dim yend = FindNextIndexOf(y, ypos, myIsNotLet)
                Dim xlen = xend - xpos
                Dim ylen = yend - ypos
                Dim r = myCmpInfo.Compare(x, xpos, xlen, y, ypos, ylen, myCmpOpt)
                If r <> 0 Then
                    Return r
                End If
				xpos = xend - 1	' -1, because we're about to +1
				ypos = yend - 1
			End If
			xpos += 1
			ypos += 1
		Loop
		If xpos >= x.Length AndAlso ypos >= y.Length Then
			Return 0
		ElseIf xpos >= x.Length Then
			Return -1
		ElseIf ypos >= y.Length Then
			Return 1
		End If
		Return 0
	End Function

The basic idea is to iterate forward through the two input strings, using the FindXXX functions to find the separators between numbers, letters, and ignored characters. We return from Compare as soon as possible without having to look at any characters following the first difference. No extra allocations are performed. For example, when comparing strings I used the form of CompareInfo.Compare that takes offsets and lengths for the two strings to avoid having to allocate substrings. Originally the code used inline lambda expressions for the arguments to FindXXX, however testing showed a significant performance increase from predefining those functions outside any individual Compare, as they are always the same. For example, here’s the definition for myIsNonZero, and the others are much the same.

      Private Shared ReadOnly myIsNonZero As CharPred = Function(c) c <> "0"c

Finally, here’s what FindNextIndexOf looks like…

   1: Public Delegate Function CharPred(ByVal c As Char) As Boolean
   2:
   3: Public Function FindNextIndexOf(ByVal s As String, _
   4:     ByVal start As Integer, ByVal p As CharPred) As Integer
   5:
   6:     If s Is Nothing OrElse s.Length = 0 Then
   7:         Return s.Length
   8:     End If
   9:     For i = start To s.Length - 1
  10:         Dim c = s(i)
  11:         If p(c) Then
  12:             Return i
  13:         End If
  14:     Next
  15:     Return s.Length
  16: End Function

Some Numbers

I manually did a few test comparisons between my solution and an optimized form of Ian’s (compiled regex and int.TryParse) I compared using two lists of strings shuffled randomly. The first list contained the numbers 1-5000, and the second contained alternating strings, numbers, and special characters of the form “string n string n string n” where n was 1-5000.

Optimized IanG, string type one (108-115ms)

Optimized IanG, string type two (291-316ms)

Mine, string type one (16-20ms)

Mine, string type two (158-171ms)

Once again, these numbers aren’t really fair to compare, because my solution handles things like case insensitivity, ignoring non alphanums, culture idiosyncracies, etc. But it’s nice to know that the extra features are more than paid for with the more efficient algorithm.

There’s still room for improvement, so I may make updates from time to time if I need to use this code on a real project.

  • The code doesn’t handle floating point numbers or number separators (e.g. ‘,’). If this is added, then it should probably be optional, as it’s not clear you would always want it.
  • Although I have some unit tests, I don’t have anything to verify the Culture constraints. For example, I’m trusting Char.IsLetter to work correctly.
  • I also assume that numeric characters can be compared by ordinal position. (e.g. ‘0’ < ‘9’)

Feel free to use the code any way you see fit, and please let me know if you find any problems or have any other suggestions.

 

 

 

 

 

Posted in Uncategorized | 1 Comment

128bit Encoding Explained

First, here’s the promised VB version of the code. It’s almost identical to the Java version, except that…
  1. Unsigned integers remove the need for >>>= (The java unsigned shift right operator)
  2. Pass by reference eliminates the need for the Int128 wrapper class, and simplifies the logic in the strToInt function slightly.
  3. Lack of decrement operator means two statements are required to access the character buffer and decrement the index.
  4. Support for Modules makes it clear that I’m providing utility functions. (As compared to using static methods in a final class.)
  5. Type inferencing eliminates the need to explicitly declare the type for local variables. (Not sure that I like this feature yet, because now it’s sometimes hard to tell the type at a glance.
  6. The Select Case statement used in CharToMod64() is much easier to read than the corresponding Java if statements.
Module Utils

    Const bUnder As Byte = 95
    Const bDollar As Byte = 36
    Const bQuest As Byte = 63
    Const bFirstUAlpha As Byte = 65
    Const bFirstLAlpha As Byte = 97
    Const bFirstNum As Byte = 48
    Const mask6Bits As UInt64 = 63
    Const mask4bits As UInt64 = 15
    Const mask2Bits As UInt64 = 3
    Const bitsPerChar As Integer = 6

    Dim CharMap() As Char = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz_$".ToCharArray

    Function UInt128ToStr(ByVal msb As UInt64, ByVal lsb As UInt64) As String
        Const MAX_BYTE As Integer = 31
        Dim buf(0 To MAX_BYTE) As Char
        Dim i = MAX_BYTE

        ' 64 bit number has ten 6bit encoded values plus four bits left over
        For n = 1 To 10
            If lsb = 0 AndAlso msb = 0 Then
                Exit For ' Eliminate leading zeros
            End If
            Dim b = CByte(lsb And mask6Bits)
            buf(i) = CharMap(b)
            i -= 1
            lsb >>= bitsPerChar
        Next

        ' 4 bits from the lsb, and 2 from the msb
        If lsb > 0 OrElse msb > 0 OrElse i = MAX_BYTE Then
            Dim leftOver = lsb And mask4bits
            Dim firstTwo = msb And mask2Bits
            Dim b = CByte((firstTwo << 4) Or leftOver)
            buf(i) = CharMap(b)
            i -= 1
            msb >>= 2
        End If

        Do While msb <> 0
            Dim b = CByte(msb And mask6Bits)
            buf(i) = CharMap(b)
            i -= 1
            msb >>= bitsPerChar
        Loop

        ' It's easiest to simply prefix an underscore to avoid
        ' illegal identifiers and clashes with keywords and literals.
        buf(i) = "_"c
        i -= 1

        Return New String(buf, i + 1, MAX_BYTE - i)
    End Function

    Public Sub StrToUInt128(ByVal s As String, ByRef msb As UInt64, ByRef lsb As UInt64)
        Dim buf() = ASCIIEncoding.ASCII.GetBytes(s)
        Dim maxByte = buf.Length - 1
        Dim i = maxByte
        Dim minByte = 0
        If buf(0) = bUnder Then
            minByte = 1
        End If
        msb = 0
        lsb = 0

        For n = 0 To 9
            If i < minByte Then
                Return
            End If
            Dim b = CharToMod64(buf(i))
            If b <> 0 Then
                If n <> 0 Then
                    b <<= (n * bitsPerChar)
                End If
                lsb = lsb Or b
            End If
            i -= 1
        Next

        If i >= minByte Then
            Dim b = CharToMod64(buf(i))
            If b <> 0 Then
                Dim leftOver = b And mask4bits
                msb = (b >> 4) And mask2Bits
                leftOver <<= 60
                lsb = lsb Or leftOver
            End If
            i -= 1
        End If

        For m = 0 To 10
            If i < minByte Then
                Return
            End If
            Dim b = CharToMod64(buf(i))
            If b <> 0 Then
                b <<= (m * bitsPerChar + 2)
                msb = msb Or b
            End If
            i -= 1
        Next
    End Sub

    Private Function CharToMod64(ByVal c As Byte) As Byte
        Const b10 As Byte = 10
        Select Case c
            Case bFirstNum To bFirstNum + 10
                Return c - bFirstNum
            Case bFirstUAlpha To bFirstUAlpha + 26
                Return c - bFirstUAlpha + b10
            Case bFirstLAlpha To bFirstLAlpha + 26
                Return c - bFirstLAlpha + bDollar
            Case bUnder
                Return 62
            Case bDollar
                Return 63
            Case Else
                Debug.Assert(False, "Unexpected Character " & c)
                Return 0
        End Select
    End Function

End Module

Turning Numbers Into Strings

Because neither language has native support for 128bit integers, we have to resort to using two 64 bit integers. The algorithm I devised simply encodes every 6 bits of the input integers as a single character. I thought this worked out perfectly, because my first reading of Section 3.8 of the Java Language Spec lead me to believe that Java supports A-Z, a-z, 0-9, $, and _ as the only legal characters in identifiers. I learned later that it actually supports many more, so this algorithm could probably be extended to use N bits instead of only 6, which would make the resulting strings even smaller. In the end, I decided against this, because the current choice is less likely to have problems with character sets and encoding/decoding.

With 6 bits per character and 128 bits total, you can see that (128/6= 21.3~) at most 22 characters are required to hold any number. To ensure legal Java identifiers we also prefix an underscore to each generated string.

The first step in the algorithm is to allocate a temporary buffer to hold the decoded characters.  I used 32, because I’m a power-of-two kind of guy.

Next we process the least significant 64 bit number, repeatedly masking out the lowest 6 bits, converting them to a Character using a lookup table, and then shifting right 6 bits to get the next piece. In Java, we must use the unsigned shift operator for this, because a signed shift will pull in 1’s from the left, destroying the number. Notice that we only break out of the loop when we’ve either taken the first 60 bits of the 64 bit number, or both msb and lsb are zero. If only lsb is zero, then we just repeatedly divide 0/6 until 60 bits have been processed. This ensures that we get any trailing zeros. (e.g. Because 100 <> 100000.)

The middle section of code is to handle the remaining 4 bits from the lsb, and the first 2 bits of the msb. This is accomplished using masks and shifts appropriately. The i = MAX_BYTE condition is there to ensure that the number zero is written out as "_0" instead of "_".

Finally we process the remaining 62 bits from the msb. As soon as msb = 0 we’re free to break out of the loop, because we don’t want any leading zeros to pad the string.

One nice property of this algorithm is that it encodes sequential numbers in an efficient and almost human readable format. The first 10 numbers are just encoded as "_0" through "_9", followed by "_A" through "_Z", "_a" through "_z", then "_" and "$". It then rolls over to the next digit with the next 64 encoded as "_10" through "_1$". Anyone used to hexidecimal to string encoding should find this intuitive.  Furthermore it should be obvious that for random 128bit numbers this algorithm gives an optimal encoding to 64 characters.

Turning Strings Into Numbers

Reversing the process is only slightly trickier than the initial encoding. First we convert the input string into an array of Bytes. One complication is that I didn’t want to assume in the decoder that every input string would start with an underscore, so we check for this at the start and set minByte to either 1 or 0. This was originally there, because in the original Int128ToString() function I only prepended the underscore when necessary to make a legal Java identifier. However, I removed that code due to all the complications in checking for keyword and literal conflicts, and the flexibility in the decoding seemed nice so I left it.

Just as with encoding, the decoder steps through the first 60 bits worth of characters. This time, we use a function to convert the input characters back to base64 numbers using CharToMod64. Each pass through the loop using a bitwise OR operation to combine the returned base64 number with the current value of the lsb.  The trick here is that I found it most straightforward to shift the decoded base64 number to the correct position (b <<= n * 6) and then combine with the lsb using a bitwise OR.

The middle section once again converts the 6 bits from the decoded number into the remaining 4 bits for the lsb and the first 2 bits of the msb.

The final section processes the remaining characters into the final 62 bits of the msb.

Performance

I found a few interesting things with the performance of this algorithm.

First, I wrote a simple program to time how long it takes to encode the first million numbers, encode the last million,  encode and decode the first million, and finally encode and decode the last million. Here’s the code in VB.

Public Sub Main()
    Dim startTime = DateTime.Now
    For i As UInt64 = 0 To 999999
        Dim enc = UInt128ToStr(i, i)
    Next
    Dim stopTime = DateTime.Now
    Console.WriteLine("Encode took " & (stopTime - startTime).TotalMilliseconds)

    startTime = DateTime.Now
    For i As Int64 = -1L To -1000000 Step -1
        Dim enc = UInt128ToStr(CULng(i), CULng(i))
        Dim msb, lsb As UInt64
        StrToUInt128(enc, msb, lsb)
    Next
    stopTime = DateTime.Now
    Console.WriteLine("Encode big took " & (stopTime - startTime).TotalMilliseconds)

    startTime = DateTime.Now
    For i As UInt64 = 0 To 999999
        Dim enc = UInt128ToStr(i, i)
        Dim msb, lsb As UInt64
        StrToUInt128(enc, msb, lsb)
    Next
    stopTime = DateTime.Now
    Console.WriteLine("Roundtrip took " & (stopTime - startTime).TotalMilliseconds)

    startTime = DateTime.Now
    For i As Int64 = -1L To -1000000 Step -1
        Dim enc = UInt128ToStr(CULng(i), CULng(i))
        Dim msb, lsb As UInt64
        StrToUInt128(enc, msb, lsb)
    Next
    stopTime = DateTime.Now
    Console.WriteLine("Roundtrip big took " & (stopTime - startTime).TotalMilliseconds)
End Sub

Although both Java and .NET are probably more than fast enough, Java was 2-3 times faster at encoding, but a round trip encode/decode took about the same time for both. (Further testing of decode seemed to show that the .NET version really is faster at decoding.) It’s not worth it to me at this point to figure out why, but if someone is interested I would recommend using ILDASM to inspect the generated IL assembly code for the .NET version. I don’t think this is the kind of thing that’s going to be helped by source analysis or profiling.

Java

Encode took 188

Encode big took 249

Roundtrip took 763

Roundtrip big took 985

VB

Encode took 555.6285

Encode big took 565.3935

Roundtrip took 778.2705

Roundtrip big took 917.91

For comparison I also found source code online for a fast Base64 encoder (http://migbase64.sourceforge.net/).

Base64

Encode took 375

Encode big took 374

Roundtrip took 1352

Roundtrip big took 1332

This is not a slight on Mikael Grev’s algorithm at all, because mine doesn’t even pretend to generate compliant RFC2054 Base64 encoded strings.

In the end, I think I came up with a pretty tight little algorithm to solve a particular problem. I hope someone finds it useful.

Posted in Uncategorized | Leave a comment

Encoding 128bit Numbers as Strings

My current project needed a way to convert between 128 bit numbers (e.g. java.util.UUID) and strings. A special requirement is that the strings be legal Java identifiers, because we use them for variable names in generated code. For the same reason, I wanted to ensure that the strings were of minimal size.

I was able to come up with a trivial algorithm in a few minutes that works OK, but I felt that I should be able to come up with the most optimal solution given a little more time. It turns out that the optimal solution consumed most of this weekend, but I finally got it working.

There’s no way I can bill my customer for this exercise, so I thought I’d post it here in case anyone else finds it interesting. I’m also curious to see if anyone can come up with a better or faster solution.

Basically my approach is to treat the 128 bit number as up to 21 groups of 6 bits each. This gives me 64 possible characters which works out perfectly, because Java has exactly 64 legal characters for identifiers (0-9, A-Z, a-z, _, and $). I also prepend an additional underscore to avoid illegal starting characters and clashes with keywords and literals.

For now, here’s the Java source to ponder. Tomorrow I’ll discuss the code in more detail, and post a VB version which I actually wrote first, then ported to Java.


001 
002 public final class Utils {
003 
004     private static final byte bUnder = 95;
005     private static final byte bDollar = 36;
006     private static final byte bQuest = 63;
007     private static final byte bFirstUAlpha = 65;
008     private static final byte bFirstLAlpha = 97;
009     private static final byte bFirstNum = 48;
010     private static final long mask6Bits = 63;
011     private static final long mask4Bits = 15;
012     private static final long mask2Bits = 3;
013     private static final int bitsPerChar = 6;
014 
015     private static final char[] charMap = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz_$"
016                             .toCharArray();
017 
018     public final static class Int128 {
019         public final long msb;
020         public final long lsb;
021         public Int128(long m, long l) {
022             msb = m;
023             lsb = l;
024         }
025         public Int128() {
026             this(00);
027         }
028         @Override
029         public final String toString() {
030             return "" + msb + "," + lsb;
031         }
032     }
033 
034     public static String int128ToStr(long msb, long lsb) {
035         final int MAX_BYTE = 31;
036         char[] buf = new char[32];
037         int i = MAX_BYTE;
038 
039         // Process the least significant bits
040         // 64 bit number has 10 six bit encoded values plus four bits left over
041         for (int n = 0; n < 10; ++n) {
042             if (lsb == && msb == 0) {
043                 break// Eliminate leading zeros
044             }
045             byte b = (byte) (lsb & mask6Bits);
046             buf[i--= charMap[b];
047             lsb >>>= bitsPerChar;
048         }
049 
050         // 4 bits from the lsb, and 2 from the msb
051         if (lsb != || msb != || i == MAX_BYTE) {
052             long leftOver = lsb & 15;
053             long firstTwo = msb & 3;
054             byte b = (byte) ((firstTwo << 4| leftOver);
055             buf[i--= charMap[b];
056             msb >>>= 2;
057         }
058 
059         // Process the most significant bits
060         while (msb != 0) {
061             byte b = (byte) (msb & mask6Bits);
062             buf[i--= charMap[b];
063             msb >>>= bitsPerChar;
064         }
065 
066         // It's easiest to simply prefix an underscore to avoid
067         // illegal identifiers and clashes with keywords and literals.
068         buf[i--'_';
069 
070         return new String(buf, i + 1, MAX_BYTE - i);
071     }
072 
073     public static Int128 strToInt128(String s) {
074         byte[] buf = s.getBytes();
075         int maxByte = buf.length - 1;
076         int i = maxByte;
077         int minByte = 0;
078         if (buf[0== bUnder) {
079             minByte = 1;
080         }
081         long msb = 0;
082         long lsb = 0;
083 
084         for (int n = 0; n < 10; ++n) {
085             if (i < minByte) {
086                 return new Int128(msb, lsb);
087             }
088             long b = charToMod64(buf[i]);
089             if (b != 0) {
090                 if (n != 0) {
091                     b <<= (n * 6);
092                 }
093                 lsb |= b;
094             }
095             --i;
096         }
097 
098         if (i >= minByte) {
099             long b = charToMod64(buf[i]);
100             if (b != 0) {
101                 long leftOver = b & mask4Bits;
102                 msb = (b >>> 4& mask2Bits;
103                 leftOver <<= 60;
104                 lsb |= leftOver;
105             }
106             --i;
107         }
108 
109         for (int n = 0; n < 11; ++n) {
110             if (i < minByte) {
111                 return new Int128(msb, lsb);
112             }
113             long b = charToMod64(buf[i]);
114             if (b != 0) {
115                 b <<= (n * bitsPerChar + 2);
116                 msb |= b;
117             }
118             --i;
119         }
120         return new Int128(msb, lsb);
121     }
122 
123     private static byte charToMod64(byte c) {
124         if (c >= bFirstNum && c <= bFirstNum + 10) {
125             return (byte) (c - 48);
126         else if (c >= bFirstUAlpha && c <= bFirstUAlpha + 26) {
127             return (byte) (c - bFirstUAlpha + 10);
128         else {
129             if (c >= bFirstLAlpha && c <= bFirstLAlpha + 26) {
130                 return (byte) (c - bFirstLAlpha + 26 10);
131             else if (c == bUnder) {
132                 return 62;
133             else if (c == bDollar) {
134                 return 63;
135             else {
136                 assert false;
137                 return 0;
138             }
139         }
140     }
141 }

Java2html

Posted in Uncategorized | Leave a comment

Windows Environment Editor released on SourceForge

I finally got around to submitting the Windows Environment Editor to SourceForge. You may recall this as the project originally created for the ill-fated OCI Summer of Code programming contest. We haven’t done much work on it since then, but the major bugs have been addressed, and I’ve been using it regularly, as have a few brave volunteers. I think it’s the best of the many replacement environment editors currently available, but you can judge that for yourself.

Technologies

One of my primary purposes for this project was to learn how to create GUI applications using the relatively new Windows Presentation Foundation (WPF 1.0) for .NET. This framework, while similar to others in many respects, is really something fundamentally different than libraries like Swing, WinForms, GTK, MFC, etc. For more information I recommend reading http://arstechnica.com/reviews/os/pretty-vista.ars

One of the benefits of WPF is the separation between GUI design using a markup language (XAML) similar to HTML, and a programming API for making that GUI work. In theory, this will allow designers to use programs like Microsoft Expression Blend to make applications like mine look better. For this reason, we tried to express as much of the GUI as possible using XAML, rather than reverting to code. We also wasted huge amounts of time using Blend to play with the GUI design, but in the end we gave up and went with a simple (ugly?) design that eschews all the fancy gradient effects.

Of course, for reasons best explained in my previous posts, I used Visual Basic 2005 as the programming language. All told the application is something less that 2000 lines of code&XAML.

In the future, maybe I’ll port everything to the upcoming new releases for WPF and VB.

Features

  • Resizable dialog allows viewing more variables at once.
  • System and User settings are contained in collapsible sections.
  • Search box to quickly jump to a variable for editing.
  • Modify variables directly in the search box using Variable=Value syntax.
  • Edit variables directly without having to popup an additional dialog.
  • Both keyboard and mouse interfaces are supported.
  • Keyboard shortcuts for common tasks.
  • Size, Position, and other state is remembered between runs.
  • Vista support for running as a user or administrator.
  • Many small details enhance the user experience.

Conclusion

I estimate that this application represents several hundred hours of work by myself and Ngan that may have been better spent playing Scrabble or Age Of Empires. I hope that at least a few people find it useful. If you have any questions, want to contribute features or fixes, or just want to try it out, I encourage you to check out the project.

https://sourceforge.net/projects/winenvedit/

Posted in Computers and Internet | 3 Comments