Tuesday, February 23, 2010

C#: The Gödel at the Starbow's End

The Gold at the Starbow's End is a short story collection by one of my favourite sci-fi authors, Frederick Pohl, author of the superb Heechee Saga, which still to this day makes me extremely annoyed that we haven't found a fleet of faster than light alien spaceships anywhere in the solar system yet. Still, apparently Phobos is probably hollow, so there's fresh hope!

The eponymous story in the collection is an extremely bleak story I find quite unsettling, like so many Cold War era novels. The basic premise is that a bunch of astronauts are sent off on a long boring journey with nothing to do but play around with maths in the hope they will derive some completely new physics. I can assure you that the story and indeed the others in the book are a good 85,000 times better than my primary school summary.

At one point in the book, the astronauts, having developed theories and modes of thought way beyond the poor suckers left on a socially disintegrating Earth, send back their research, including the theory of easy fusion power, in what is called a Gödelized form.

Maths is not my strong suit, so this was new to me. Basically, it encodes arbitrary information in the form of a single number. A big number. A very very big number. A really quite stupendously large number. A number so big ... well, you get the idea.

How it works is this:

  1. List the prime numbers, one for each of the characters in your message.
  2. Raise each prime number to the power of your character encoding - e.g. a = 1, b = 2, ...
  3. Multiply them all together.

So, to encode the word "whelk", you would calculate:

(2^23) * (3^8) * (5^5) * (7^12) * (11^11) =
679,212,979,662,721,270,806,215,943,782,400,000

To decode, you simply factor the number for each of the primes in the list. So, factoring for 2 gives you the first character, for 3 the 2nd, 5 the 3rd etc. etc.

Note: If you are like me a maths muppet, then this is how you factor a number, using the first prime, 2 as as example:
  1. Divide the number by 2 and get the remainder
  2. If the remainder is 0, add one to a counter and go back to the first step with the halved value
  3. Stop when the remainder is not 0
The counter gives you the factors of two, and hence your first character - e.g. if it were 26, it would be "z".

As you can see, this results in numbers of a size I may have already indicated as being quite large, and decoding requires huge amounts of factoring. Plus of course you need to generate lots and lots of primes in the first place.

In the story, the data size is "the contents of a standard unabridged dictionary". The number is transmitted as a sum of bases and exponents in a highl compact format that fits on one line, like this: (73^3489 + 342^43 + 67^12 + 56^4) - 78. (This is not the actual number, I will update that later when I have the book in front of me).

I really have no idea if it is possible to reduce the numbers to sums of bases and exponents like this and still maintain the information in it; it seems to me to result in such a ludicrously high compression ratio that there is no way it could work - but as I said, maths is not my strong suit. And if it is possible, I don't have a clue how to go about it. In the story, IBM require a 25 year development time before they will even bid on building a machine capable of dealing with the whole number.

So, as I don't know how to reduce the size of the numbers, it is basically a scheme which takes very short strings and turns them into very long ones. Brilliant! How incredibly useful!

However, I am a sucker for the useless, so I decided to write a class which did it.

The C# built in types are not sufficient; using a double any string over about 3 characters runs out of precision and results in garbage, so I downloaded the excellent freeware BigNum library from http://www.fractal-landscapes.co.uk/bigint.html.

The code itself is pretty simple, I was just messing around so it is arbitrarily limited to lowercase alphabetic characters only and a maximum input length of 25 characters. The input length limit is because I lazily copied the first 25 primes from the Wikipedia entry. The precision in the BigInt stuff is arbitrary too; it just needs to be big. I didn't really spend a lot of time on this code :o)

// You will need the BigNum assembly on your application path and "using BigNum" in the source file

public class Godel
{
    public PrecisionSpec Precision { get { return precision; } set { precision = value; } }

    public String Chars { get { return chars; } set { chars = value; } }
    public uint[] Primes { get { return primes; } set { primes = value; } }

    // ------------------------------------------------------------------------------------------------------------------------------------------------------------
    public BigInt Encode(String s)
    {        
        if (s != null && s.Length <= primes.Length)
        {
            BigInt res = new BigInt(1, precision);

            s = s.ToLower();

            int idx;
            for (int i = 0; i < s.Length; i++)
            {
                idx = chars.IndexOf(s.Substring(i, 1));
                idx = idx >= 0 ? idx : chars.Length - 1;

                BigInt curr = new BigInt(primes[i], precision);

                // Exponents should start at 1, then we can use zero occurrences to check for end of data
                BigInt pow = new BigInt(idx + 1, precision);

                curr.Power(pow);

                res.Mul(curr);
            }

            return res;
        }
        else if (s.Length > primes.Length)
            throw new ArgumentException("String cannot be longer than " + primes.Length.ToString() + " characters");
        else
            return new BigInt(0, precision);
    }

    public String Decode(BigInt value)
    {
        String res = "";
        for (int primeidx = 0; primeidx < primes.Length; primeidx++)
        {
            int exponent = Factor(value, primes[primeidx]);

            if (exponent <= 0)
                break;

            int cidx = exponent - 1;

            if (cidx < chars.Length)
                res += chars[cidx];
            else
                res += "[" + cidx.ToString() + "]";
        }

        return res;
    }

    // ------------------------------------------------------------------------------------------------------------------------------------------------------------
    private int Factor(BigInt value, uint prime)
    {
        int res = 0;
        int sanity = 10000;
        
        BigInt remaining = new BigInt(value);
        BigInt mod;

        BigInt.DivMod(remaining, prime, out remaining, out mod);

        while(mod.IsZero() && sanity-- > 0)
        {
            BigInt.DivMod(remaining, prime, out remaining, out mod);
        
            res++;
        }

        if (sanity <= 0)
            throw new InvalidOperationException("Sanity check reached");

        return res;
    }

    // ------------------------------------------------------------------------------------------------------------------------------------------------------------
    private uint[] primes = new uint[] { 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97 };
    private String chars = "abcdefghijklmnopqrstuvwxyz_";
    private PrecisionSpec precision = new PrecisionSpec(10000, PrecisionSpec.BaseType.DEC);
}

Pretty easy to use:

Godel godel = new Godel();

BigInt encoded = godel.Encode("test");

String originalstring = godel.Decode(encoded);

In the highly unlikely event you find a use for this, you can set the Primes property for longer strings, and the Chars to encode more than just lowercase strings. If decoding doesn't give you back the original string, you may need to increase the precision.

1 comment:

  1. Now this brings back memories! I read the same story in my teens and tried to write a program to do this. My C-fu was weak, though, and I didn't get very far. It's a neat concept, though, I wonder if anyone has done serious work on it?

    ReplyDelete