128bit Encoding Explained

First, here’s the promised VB version of the code. It’s almost identical to the Java version, except that…
  1. Unsigned integers remove the need for >>>= (The java unsigned shift right operator)
  2. Pass by reference eliminates the need for the Int128 wrapper class, and simplifies the logic in the strToInt function slightly.
  3. Lack of decrement operator means two statements are required to access the character buffer and decrement the index.
  4. Support for Modules makes it clear that I’m providing utility functions. (As compared to using static methods in a final class.)
  5. Type inferencing eliminates the need to explicitly declare the type for local variables. (Not sure that I like this feature yet, because now it’s sometimes hard to tell the type at a glance.
  6. The Select Case statement used in CharToMod64() is much easier to read than the corresponding Java if statements.
Module Utils

    Const bUnder As Byte = 95
    Const bDollar As Byte = 36
    Const bQuest As Byte = 63
    Const bFirstUAlpha As Byte = 65
    Const bFirstLAlpha As Byte = 97
    Const bFirstNum As Byte = 48
    Const mask6Bits As UInt64 = 63
    Const mask4bits As UInt64 = 15
    Const mask2Bits As UInt64 = 3
    Const bitsPerChar As Integer = 6

    Dim CharMap() As Char = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz_$".ToCharArray

    Function UInt128ToStr(ByVal msb As UInt64, ByVal lsb As UInt64) As String
        Const MAX_BYTE As Integer = 31
        Dim buf(0 To MAX_BYTE) As Char
        Dim i = MAX_BYTE

        ' 64 bit number has ten 6bit encoded values plus four bits left over
        For n = 1 To 10
            If lsb = 0 AndAlso msb = 0 Then
                Exit For ' Eliminate leading zeros
            End If
            Dim b = CByte(lsb And mask6Bits)
            buf(i) = CharMap(b)
            i -= 1
            lsb >>= bitsPerChar
        Next

        ' 4 bits from the lsb, and 2 from the msb
        If lsb > 0 OrElse msb > 0 OrElse i = MAX_BYTE Then
            Dim leftOver = lsb And mask4bits
            Dim firstTwo = msb And mask2Bits
            Dim b = CByte((firstTwo << 4) Or leftOver)
            buf(i) = CharMap(b)
            i -= 1
            msb >>= 2
        End If

        Do While msb <> 0
            Dim b = CByte(msb And mask6Bits)
            buf(i) = CharMap(b)
            i -= 1
            msb >>= bitsPerChar
        Loop

        ' It's easiest to simply prefix an underscore to avoid
        ' illegal identifiers and clashes with keywords and literals.
        buf(i) = "_"c
        i -= 1

        Return New String(buf, i + 1, MAX_BYTE - i)
    End Function

    Public Sub StrToUInt128(ByVal s As String, ByRef msb As UInt64, ByRef lsb As UInt64)
        Dim buf() = ASCIIEncoding.ASCII.GetBytes(s)
        Dim maxByte = buf.Length - 1
        Dim i = maxByte
        Dim minByte = 0
        If buf(0) = bUnder Then
            minByte = 1
        End If
        msb = 0
        lsb = 0

        For n = 0 To 9
            If i < minByte Then
                Return
            End If
            Dim b = CharToMod64(buf(i))
            If b <> 0 Then
                If n <> 0 Then
                    b <<= (n * bitsPerChar)
                End If
                lsb = lsb Or b
            End If
            i -= 1
        Next

        If i >= minByte Then
            Dim b = CharToMod64(buf(i))
            If b <> 0 Then
                Dim leftOver = b And mask4bits
                msb = (b >> 4) And mask2Bits
                leftOver <<= 60
                lsb = lsb Or leftOver
            End If
            i -= 1
        End If

        For m = 0 To 10
            If i < minByte Then
                Return
            End If
            Dim b = CharToMod64(buf(i))
            If b <> 0 Then
                b <<= (m * bitsPerChar + 2)
                msb = msb Or b
            End If
            i -= 1
        Next
    End Sub

    Private Function CharToMod64(ByVal c As Byte) As Byte
        Const b10 As Byte = 10
        Select Case c
            Case bFirstNum To bFirstNum + 10
                Return c - bFirstNum
            Case bFirstUAlpha To bFirstUAlpha + 26
                Return c - bFirstUAlpha + b10
            Case bFirstLAlpha To bFirstLAlpha + 26
                Return c - bFirstLAlpha + bDollar
            Case bUnder
                Return 62
            Case bDollar
                Return 63
            Case Else
                Debug.Assert(False, "Unexpected Character " & c)
                Return 0
        End Select
    End Function

End Module

Turning Numbers Into Strings

Because neither language has native support for 128bit integers, we have to resort to using two 64 bit integers. The algorithm I devised simply encodes every 6 bits of the input integers as a single character. I thought this worked out perfectly, because my first reading of Section 3.8 of the Java Language Spec lead me to believe that Java supports A-Z, a-z, 0-9, $, and _ as the only legal characters in identifiers. I learned later that it actually supports many more, so this algorithm could probably be extended to use N bits instead of only 6, which would make the resulting strings even smaller. In the end, I decided against this, because the current choice is less likely to have problems with character sets and encoding/decoding.

With 6 bits per character and 128 bits total, you can see that (128/6= 21.3~) at most 22 characters are required to hold any number. To ensure legal Java identifiers we also prefix an underscore to each generated string.

The first step in the algorithm is to allocate a temporary buffer to hold the decoded characters.  I used 32, because I’m a power-of-two kind of guy.

Next we process the least significant 64 bit number, repeatedly masking out the lowest 6 bits, converting them to a Character using a lookup table, and then shifting right 6 bits to get the next piece. In Java, we must use the unsigned shift operator for this, because a signed shift will pull in 1’s from the left, destroying the number. Notice that we only break out of the loop when we’ve either taken the first 60 bits of the 64 bit number, or both msb and lsb are zero. If only lsb is zero, then we just repeatedly divide 0/6 until 60 bits have been processed. This ensures that we get any trailing zeros. (e.g. Because 100 <> 100000.)

The middle section of code is to handle the remaining 4 bits from the lsb, and the first 2 bits of the msb. This is accomplished using masks and shifts appropriately. The i = MAX_BYTE condition is there to ensure that the number zero is written out as "_0" instead of "_".

Finally we process the remaining 62 bits from the msb. As soon as msb = 0 we’re free to break out of the loop, because we don’t want any leading zeros to pad the string.

One nice property of this algorithm is that it encodes sequential numbers in an efficient and almost human readable format. The first 10 numbers are just encoded as "_0" through "_9", followed by "_A" through "_Z", "_a" through "_z", then "_" and "$". It then rolls over to the next digit with the next 64 encoded as "_10" through "_1$". Anyone used to hexidecimal to string encoding should find this intuitive.  Furthermore it should be obvious that for random 128bit numbers this algorithm gives an optimal encoding to 64 characters.

Turning Strings Into Numbers

Reversing the process is only slightly trickier than the initial encoding. First we convert the input string into an array of Bytes. One complication is that I didn’t want to assume in the decoder that every input string would start with an underscore, so we check for this at the start and set minByte to either 1 or 0. This was originally there, because in the original Int128ToString() function I only prepended the underscore when necessary to make a legal Java identifier. However, I removed that code due to all the complications in checking for keyword and literal conflicts, and the flexibility in the decoding seemed nice so I left it.

Just as with encoding, the decoder steps through the first 60 bits worth of characters. This time, we use a function to convert the input characters back to base64 numbers using CharToMod64. Each pass through the loop using a bitwise OR operation to combine the returned base64 number with the current value of the lsb.  The trick here is that I found it most straightforward to shift the decoded base64 number to the correct position (b <<= n * 6) and then combine with the lsb using a bitwise OR.

The middle section once again converts the 6 bits from the decoded number into the remaining 4 bits for the lsb and the first 2 bits of the msb.

The final section processes the remaining characters into the final 62 bits of the msb.

Performance

I found a few interesting things with the performance of this algorithm.

First, I wrote a simple program to time how long it takes to encode the first million numbers, encode the last million,  encode and decode the first million, and finally encode and decode the last million. Here’s the code in VB.

Public Sub Main()
    Dim startTime = DateTime.Now
    For i As UInt64 = 0 To 999999
        Dim enc = UInt128ToStr(i, i)
    Next
    Dim stopTime = DateTime.Now
    Console.WriteLine("Encode took " & (stopTime - startTime).TotalMilliseconds)

    startTime = DateTime.Now
    For i As Int64 = -1L To -1000000 Step -1
        Dim enc = UInt128ToStr(CULng(i), CULng(i))
        Dim msb, lsb As UInt64
        StrToUInt128(enc, msb, lsb)
    Next
    stopTime = DateTime.Now
    Console.WriteLine("Encode big took " & (stopTime - startTime).TotalMilliseconds)

    startTime = DateTime.Now
    For i As UInt64 = 0 To 999999
        Dim enc = UInt128ToStr(i, i)
        Dim msb, lsb As UInt64
        StrToUInt128(enc, msb, lsb)
    Next
    stopTime = DateTime.Now
    Console.WriteLine("Roundtrip took " & (stopTime - startTime).TotalMilliseconds)

    startTime = DateTime.Now
    For i As Int64 = -1L To -1000000 Step -1
        Dim enc = UInt128ToStr(CULng(i), CULng(i))
        Dim msb, lsb As UInt64
        StrToUInt128(enc, msb, lsb)
    Next
    stopTime = DateTime.Now
    Console.WriteLine("Roundtrip big took " & (stopTime - startTime).TotalMilliseconds)
End Sub

Although both Java and .NET are probably more than fast enough, Java was 2-3 times faster at encoding, but a round trip encode/decode took about the same time for both. (Further testing of decode seemed to show that the .NET version really is faster at decoding.) It’s not worth it to me at this point to figure out why, but if someone is interested I would recommend using ILDASM to inspect the generated IL assembly code for the .NET version. I don’t think this is the kind of thing that’s going to be helped by source analysis or profiling.

Java

Encode took 188

Encode big took 249

Roundtrip took 763

Roundtrip big took 985

VB

Encode took 555.6285

Encode big took 565.3935

Roundtrip took 778.2705

Roundtrip big took 917.91

For comparison I also found source code online for a fast Base64 encoder (http://migbase64.sourceforge.net/).

Base64

Encode took 375

Encode big took 374

Roundtrip took 1352

Roundtrip big took 1332

This is not a slight on Mikael Grev’s algorithm at all, because mine doesn’t even pretend to generate compliant RFC2054 Base64 encoded strings.

In the end, I think I came up with a pretty tight little algorithm to solve a particular problem. I hope someone finds it useful.

Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s