- Unsigned integers remove the need for >>>= (The java unsigned shift right operator)
- Pass by reference eliminates the need for the Int128 wrapper class, and simplifies the logic in the strToInt function slightly.
- Lack of decrement operator means two statements are required to access the character buffer and decrement the index.
- Support for Modules makes it clear that I’m providing utility functions. (As compared to using static methods in a final class.)
- Type inferencing eliminates the need to explicitly declare the type for local variables. (Not sure that I like this feature yet, because now it’s sometimes hard to tell the type at a glance.
- The Select Case statement used in CharToMod64() is much easier to read than the corresponding Java if statements.
Module Utils Const bUnder As Byte = 95 Const bDollar As Byte = 36 Const bQuest As Byte = 63 Const bFirstUAlpha As Byte = 65 Const bFirstLAlpha As Byte = 97 Const bFirstNum As Byte = 48 Const mask6Bits As UInt64 = 63 Const mask4bits As UInt64 = 15 Const mask2Bits As UInt64 = 3 Const bitsPerChar As Integer = 6 Dim CharMap() As Char = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz_$".ToCharArray Function UInt128ToStr(ByVal msb As UInt64, ByVal lsb As UInt64) As String Const MAX_BYTE As Integer = 31 Dim buf(0 To MAX_BYTE) As Char Dim i = MAX_BYTE ' 64 bit number has ten 6bit encoded values plus four bits left over For n = 1 To 10 If lsb = 0 AndAlso msb = 0 Then Exit For ' Eliminate leading zeros End If Dim b = CByte(lsb And mask6Bits) buf(i) = CharMap(b) i -= 1 lsb >>= bitsPerChar Next ' 4 bits from the lsb, and 2 from the msb If lsb > 0 OrElse msb > 0 OrElse i = MAX_BYTE Then Dim leftOver = lsb And mask4bits Dim firstTwo = msb And mask2Bits Dim b = CByte((firstTwo << 4) Or leftOver) buf(i) = CharMap(b) i -= 1 msb >>= 2 End If Do While msb <> 0 Dim b = CByte(msb And mask6Bits) buf(i) = CharMap(b) i -= 1 msb >>= bitsPerChar Loop ' It's easiest to simply prefix an underscore to avoid ' illegal identifiers and clashes with keywords and literals. buf(i) = "_"c i -= 1 Return New String(buf, i + 1, MAX_BYTE - i) End Function Public Sub StrToUInt128(ByVal s As String, ByRef msb As UInt64, ByRef lsb As UInt64) Dim buf() = ASCIIEncoding.ASCII.GetBytes(s) Dim maxByte = buf.Length - 1 Dim i = maxByte Dim minByte = 0 If buf(0) = bUnder Then minByte = 1 End If msb = 0 lsb = 0 For n = 0 To 9 If i < minByte Then Return End If Dim b = CharToMod64(buf(i)) If b <> 0 Then If n <> 0 Then b <<= (n * bitsPerChar) End If lsb = lsb Or b End If i -= 1 Next If i >= minByte Then Dim b = CharToMod64(buf(i)) If b <> 0 Then Dim leftOver = b And mask4bits msb = (b >> 4) And mask2Bits leftOver <<= 60 lsb = lsb Or leftOver End If i -= 1 End If For m = 0 To 10 If i < minByte Then Return End If Dim b = CharToMod64(buf(i)) If b <> 0 Then b <<= (m * bitsPerChar + 2) msb = msb Or b End If i -= 1 Next End Sub Private Function CharToMod64(ByVal c As Byte) As Byte Const b10 As Byte = 10 Select Case c Case bFirstNum To bFirstNum + 10 Return c - bFirstNum Case bFirstUAlpha To bFirstUAlpha + 26 Return c - bFirstUAlpha + b10 Case bFirstLAlpha To bFirstLAlpha + 26 Return c - bFirstLAlpha + bDollar Case bUnder Return 62 Case bDollar Return 63 Case Else Debug.Assert(False, "Unexpected Character " & c) Return 0 End Select End Function End Module
Turning Numbers Into Strings
Because neither language has native support for 128bit integers, we have to resort to using two 64 bit integers. The algorithm I devised simply encodes every 6 bits of the input integers as a single character. I thought this worked out perfectly, because my first reading of Section 3.8 of the Java Language Spec lead me to believe that Java supports A-Z, a-z, 0-9, $, and _ as the only legal characters in identifiers. I learned later that it actually supports many more, so this algorithm could probably be extended to use N bits instead of only 6, which would make the resulting strings even smaller. In the end, I decided against this, because the current choice is less likely to have problems with character sets and encoding/decoding.
With 6 bits per character and 128 bits total, you can see that (128/6= 21.3~) at most 22 characters are required to hold any number. To ensure legal Java identifiers we also prefix an underscore to each generated string.
The first step in the algorithm is to allocate a temporary buffer to hold the decoded characters. I used 32, because I’m a power-of-two kind of guy.
Next we process the least significant 64 bit number, repeatedly masking out the lowest 6 bits, converting them to a Character using a lookup table, and then shifting right 6 bits to get the next piece. In Java, we must use the unsigned shift operator for this, because a signed shift will pull in 1’s from the left, destroying the number. Notice that we only break out of the loop when we’ve either taken the first 60 bits of the 64 bit number, or both msb and lsb are zero. If only lsb is zero, then we just repeatedly divide 0/6 until 60 bits have been processed. This ensures that we get any trailing zeros. (e.g. Because 100 <> 100000.)
The middle section of code is to handle the remaining 4 bits from the lsb, and the first 2 bits of the msb. This is accomplished using masks and shifts appropriately. The i = MAX_BYTE condition is there to ensure that the number zero is written out as "_0" instead of "_".
Finally we process the remaining 62 bits from the msb. As soon as msb = 0 we’re free to break out of the loop, because we don’t want any leading zeros to pad the string.
One nice property of this algorithm is that it encodes sequential numbers in an efficient and almost human readable format. The first 10 numbers are just encoded as "_0" through "_9", followed by "_A" through "_Z", "_a" through "_z", then "_" and "$". It then rolls over to the next digit with the next 64 encoded as "_10" through "_1$". Anyone used to hexidecimal to string encoding should find this intuitive. Furthermore it should be obvious that for random 128bit numbers this algorithm gives an optimal encoding to 64 characters.
Turning Strings Into Numbers
Reversing the process is only slightly trickier than the initial encoding. First we convert the input string into an array of Bytes. One complication is that I didn’t want to assume in the decoder that every input string would start with an underscore, so we check for this at the start and set minByte to either 1 or 0. This was originally there, because in the original Int128ToString() function I only prepended the underscore when necessary to make a legal Java identifier. However, I removed that code due to all the complications in checking for keyword and literal conflicts, and the flexibility in the decoding seemed nice so I left it.
Just as with encoding, the decoder steps through the first 60 bits worth of characters. This time, we use a function to convert the input characters back to base64 numbers using CharToMod64. Each pass through the loop using a bitwise OR operation to combine the returned base64 number with the current value of the lsb. The trick here is that I found it most straightforward to shift the decoded base64 number to the correct position (b <<= n * 6) and then combine with the lsb using a bitwise OR.
The middle section once again converts the 6 bits from the decoded number into the remaining 4 bits for the lsb and the first 2 bits of the msb.
The final section processes the remaining characters into the final 62 bits of the msb.
I found a few interesting things with the performance of this algorithm.
First, I wrote a simple program to time how long it takes to encode the first million numbers, encode the last million, encode and decode the first million, and finally encode and decode the last million. Here’s the code in VB.
Public Sub Main() Dim startTime = DateTime.Now For i As UInt64 = 0 To 999999 Dim enc = UInt128ToStr(i, i) Next Dim stopTime = DateTime.Now Console.WriteLine("Encode took " & (stopTime - startTime).TotalMilliseconds) startTime = DateTime.Now For i As Int64 = -1L To -1000000 Step -1 Dim enc = UInt128ToStr(CULng(i), CULng(i)) Dim msb, lsb As UInt64 StrToUInt128(enc, msb, lsb) Next stopTime = DateTime.Now Console.WriteLine("Encode big took " & (stopTime - startTime).TotalMilliseconds) startTime = DateTime.Now For i As UInt64 = 0 To 999999 Dim enc = UInt128ToStr(i, i) Dim msb, lsb As UInt64 StrToUInt128(enc, msb, lsb) Next stopTime = DateTime.Now Console.WriteLine("Roundtrip took " & (stopTime - startTime).TotalMilliseconds) startTime = DateTime.Now For i As Int64 = -1L To -1000000 Step -1 Dim enc = UInt128ToStr(CULng(i), CULng(i)) Dim msb, lsb As UInt64 StrToUInt128(enc, msb, lsb) Next stopTime = DateTime.Now Console.WriteLine("Roundtrip big took " & (stopTime - startTime).TotalMilliseconds) End Sub
Although both Java and .NET are probably more than fast enough, Java was 2-3 times faster at encoding, but a round trip encode/decode took about the same time for both. (Further testing of decode seemed to show that the .NET version really is faster at decoding.) It’s not worth it to me at this point to figure out why, but if someone is interested I would recommend using ILDASM to inspect the generated IL assembly code for the .NET version. I don’t think this is the kind of thing that’s going to be helped by source analysis or profiling.
Encode took 188
Encode big took 249
Roundtrip took 763
Roundtrip big took 985
Encode took 555.6285
Encode big took 565.3935
Roundtrip took 778.2705
Roundtrip big took 917.91
For comparison I also found source code online for a fast Base64 encoder (http://migbase64.sourceforge.net/).
Encode took 375
Encode big took 374
Roundtrip took 1352
Roundtrip big took 1332
This is not a slight on Mikael Grev’s algorithm at all, because mine doesn’t even pretend to generate compliant RFC2054 Base64 encoded strings.
In the end, I think I came up with a pretty tight little algorithm to solve a particular problem. I hope someone finds it useful.