Tuesday, February 23, 2010

C#: The Gödel at the Starbow's End

The Gold at the Starbow's End is a short story collection by one of my favourite sci-fi authors, Frederick Pohl, author of the superb Heechee Saga, which still to this day makes me extremely annoyed that we haven't found a fleet of faster than light alien spaceships anywhere in the solar system yet. Still, apparently Phobos is probably hollow, so there's fresh hope!

The eponymous story in the collection is an extremely bleak story I find quite unsettling, like so many Cold War era novels. The basic premise is that a bunch of astronauts are sent off on a long boring journey with nothing to do but play around with maths in the hope they will derive some completely new physics. I can assure you that the story and indeed the others in the book are a good 85,000 times better than my primary school summary.

At one point in the book, the astronauts, having developed theories and modes of thought way beyond the poor suckers left on a socially disintegrating Earth, send back their research, including the theory of easy fusion power, in what is called a Gödelized form.

Maths is not my strong suit, so this was new to me. Basically, it encodes arbitrary information in the form of a single number. A big number. A very very big number. A really quite stupendously large number. A number so big ... well, you get the idea.

How it works is this:

  1. List the prime numbers, one for each of the characters in your message.
  2. Raise each prime number to the power of your character encoding - e.g. a = 1, b = 2, ...
  3. Multiply them all together.

So, to encode the word "whelk", you would calculate:

(2^23) * (3^8) * (5^5) * (7^12) * (11^11) =
679,212,979,662,721,270,806,215,943,782,400,000

To decode, you simply factor the number for each of the primes in the list. So, factoring for 2 gives you the first character, for 3 the 2nd, 5 the 3rd etc. etc.

Note: If you are like me a maths muppet, then this is how you factor a number, using the first prime, 2 as as example:
  1. Divide the number by 2 and get the remainder
  2. If the remainder is 0, add one to a counter and go back to the first step with the halved value
  3. Stop when the remainder is not 0
The counter gives you the factors of two, and hence your first character - e.g. if it were 26, it would be "z".

As you can see, this results in numbers of a size I may have already indicated as being quite large, and decoding requires huge amounts of factoring. Plus of course you need to generate lots and lots of primes in the first place.

In the story, the data size is "the contents of a standard unabridged dictionary". The number is transmitted as a sum of bases and exponents in a highl compact format that fits on one line, like this: (73^3489 + 342^43 + 67^12 + 56^4) - 78. (This is not the actual number, I will update that later when I have the book in front of me).

I really have no idea if it is possible to reduce the numbers to sums of bases and exponents like this and still maintain the information in it; it seems to me to result in such a ludicrously high compression ratio that there is no way it could work - but as I said, maths is not my strong suit. And if it is possible, I don't have a clue how to go about it. In the story, IBM require a 25 year development time before they will even bid on building a machine capable of dealing with the whole number.

So, as I don't know how to reduce the size of the numbers, it is basically a scheme which takes very short strings and turns them into very long ones. Brilliant! How incredibly useful!

However, I am a sucker for the useless, so I decided to write a class which did it.

The C# built in types are not sufficient; using a double any string over about 3 characters runs out of precision and results in garbage, so I downloaded the excellent freeware BigNum library from http://www.fractal-landscapes.co.uk/bigint.html.

The code itself is pretty simple, I was just messing around so it is arbitrarily limited to lowercase alphabetic characters only and a maximum input length of 25 characters. The input length limit is because I lazily copied the first 25 primes from the Wikipedia entry. The precision in the BigInt stuff is arbitrary too; it just needs to be big. I didn't really spend a lot of time on this code :o)

// You will need the BigNum assembly on your application path and "using BigNum" in the source file

public class Godel
{
    public PrecisionSpec Precision { get { return precision; } set { precision = value; } }

    public String Chars { get { return chars; } set { chars = value; } }
    public uint[] Primes { get { return primes; } set { primes = value; } }

    // ------------------------------------------------------------------------------------------------------------------------------------------------------------
    public BigInt Encode(String s)
    {        
        if (s != null && s.Length <= primes.Length)
        {
            BigInt res = new BigInt(1, precision);

            s = s.ToLower();

            int idx;
            for (int i = 0; i < s.Length; i++)
            {
                idx = chars.IndexOf(s.Substring(i, 1));
                idx = idx >= 0 ? idx : chars.Length - 1;

                BigInt curr = new BigInt(primes[i], precision);

                // Exponents should start at 1, then we can use zero occurrences to check for end of data
                BigInt pow = new BigInt(idx + 1, precision);

                curr.Power(pow);

                res.Mul(curr);
            }

            return res;
        }
        else if (s.Length > primes.Length)
            throw new ArgumentException("String cannot be longer than " + primes.Length.ToString() + " characters");
        else
            return new BigInt(0, precision);
    }

    public String Decode(BigInt value)
    {
        String res = "";
        for (int primeidx = 0; primeidx < primes.Length; primeidx++)
        {
            int exponent = Factor(value, primes[primeidx]);

            if (exponent <= 0)
                break;

            int cidx = exponent - 1;

            if (cidx < chars.Length)
                res += chars[cidx];
            else
                res += "[" + cidx.ToString() + "]";
        }

        return res;
    }

    // ------------------------------------------------------------------------------------------------------------------------------------------------------------
    private int Factor(BigInt value, uint prime)
    {
        int res = 0;
        int sanity = 10000;
        
        BigInt remaining = new BigInt(value);
        BigInt mod;

        BigInt.DivMod(remaining, prime, out remaining, out mod);

        while(mod.IsZero() && sanity-- > 0)
        {
            BigInt.DivMod(remaining, prime, out remaining, out mod);
        
            res++;
        }

        if (sanity <= 0)
            throw new InvalidOperationException("Sanity check reached");

        return res;
    }

    // ------------------------------------------------------------------------------------------------------------------------------------------------------------
    private uint[] primes = new uint[] { 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97 };
    private String chars = "abcdefghijklmnopqrstuvwxyz_";
    private PrecisionSpec precision = new PrecisionSpec(10000, PrecisionSpec.BaseType.DEC);
}

Pretty easy to use:

Godel godel = new Godel();

BigInt encoded = godel.Encode("test");

String originalstring = godel.Decode(encoded);

In the highly unlikely event you find a use for this, you can set the Primes property for longer strings, and the Chars to encode more than just lowercase strings. If decoding doesn't give you back the original string, you may need to increase the precision.

Friday, February 19, 2010

C#: Exceptions are slllooooowwwww

Exceptions, eh? I've always avoided them in high performace code, because everyone knows they're slow. I have never actually bothered testing this though, relying on cleverer and more dedicated people than me doing the grunt work.

Currently, I am writing a high throughput application that performs actions based on a passed in "verb".

Normally, I deal with this by setting up the verbs as an enum and using Enum.Parse() to return the verb and check it is valid at the same time.

I had no idea how quick Enum.Parse() actually is, so decided to test it.

The testing code uses the handy System.Diagnostics.Stopwatch class. It uses an enum containing 3 names, and creates a List<String> of 500,000 of each of those three values; so 1,500,000 strings in total.

The code the loops through the array and converts the strings to the enum values using either Enum.Parse() or hardcoded if / else constructs using String.Compare(String, String, StringComparison).

I also tossed a switch statement in there; usually I avoid them because I find the syntax ugly and opaque, but I read on an MS blog that switches on strings are converted to a hash comparison, so I though if that was true it may be worth checking for speed against the if / else. The switch used String.ToUpper() in the checking loop to provide case-insensitivity.

I then repeated the tests, but this time with every string an invalid enum value. Obviously with the Enum.Parse() this requires catching an exception, and with the other two just an extra condition, in all cases just setting a private member bool flag.

Here are the results:

DoTheEnumThing()

Enum.Parse
00:00:08.64

if/else
00:00:03.48

switch
00:00:02.53

DoTheEnumThingWithErrors()

Enum.Parse
00:05:15.11

if/else
00:00:05.63

switch
00:00:02.74

I think that pretty conclusively demonstrates that exception handling is slow. You certainly wouldn't want to use it in a high throughput situation where you were expecting a lot of invalid values to be passed. The switch is appreciably faster than the if / else, so this hardcoded solution, though ugly, is definitely the speediest way to go.

Thursday, February 18, 2010

JS: Client side templating

I originally created this for an autocomplete widget I needed for a project ages ago; eventually it became too annoying to write all the code for the millions of conditions so I just used the YUI one instead, which is a pain in the arse to set up but works perfectly when you've done it.

Anyway, while looking around at existing solutions, they all seemed to have the problem that the markup was either fixed and styled with CSS, or generated in a callback. Neither of these solutions was ideal; the first results in a widget usable most of the time but very uncustomisable for more unusual uses, and the second is very designer unfriendly if you need to separate your coding and design teams.

So, my idea was to be able to write your HTML as normal, then mark an element as a template container. This template would contain databound variables. It would be removed from the DOM on page load, stored, then repeatedly generated for each item in a provided dataset.

Basically, it is inspired by the ASP.NET repeater control, .NET being my primary development environment.

I recently refactored the code to be more generally useful, and created a jQuery plugin wrapper that enables it to be used using the jQuery selectors and AJAX functionality.

It has uses beyond autocomplete, of course; any AJAXian updatable areas could use it.

Pros and cons are:

  • Keeps the template HTML in the same file as everything else; this can be a pro if the area is specific to a certain page or you are using a "master page" or "header / footer includes" style design, and a con if the template is to be used in several places. In that case you are probably better off with a server-side templating system to generate the HTML and dumping that in using the jQuery native functions.
  • Allows lightweight data transfer; instead of returning all of the HTML you can just return a JSON dataset.

The basic template code and the jQuery wrapper can be downloaded from the links below:

Usage:

First of all, you need a datasource, which will be an array of objects, for example:

[
 {id:1, name:'Alice'},
 {id:2, name:'Bob'},
 {id:3, name:'Carol'}
]

Then, you need to set up the template:

<ul id="testTemplate">
    <li class="@ItemClass">@Text</li>
</ul>

Note the SQL style variable placeholders. When the template is created with the ID of the <ul> element the <li> will be cleared. It is then reproduced for every item in the data array.

Note: EVERYTHING contained within the template element is treated as the data row, so you can have very complex repeating templates; as a very simple example you could have two list items for every data item should you want to. This is more useful for creating table based templates.

The actual call to populate the template looks like this:

// Call using local data and the simple unwrappered template code
var template = new HTemplates.Template('testTemplate');
template.populate(data);

// Call using the jQuery wrapper and an AJAX data source
$('#testTemplate').hTemplate();     
$('#testTemplate').hTemplate().populate(data);

You can download an example page plus an example JSON datasource from the links below:


Note: You will need the jQuery framework; it has been tested with 1.4. You will also need to change the script paths to match your setup.

The jQuery plugin is the first time I have looked at jQuery; I'm not 100% sure this is the best way to create a plugin; the hTemplate().populate() call looks odd to me. Also there is probably a better way of JSON parsing that is less fragile than my check-for-open-bracket-then-eval() method.