Pickle: An interesting stack language

The pickle module provides a convenient method to add data persistence to your Python programs. How it does that, is pure magic to most people. However, in reality, it is simple. The output of a pickle is a “program” able to create Python data-structures. A limited stack language is used to write these programs. By limited, I mean you can’t write anything fancy like a for-loop or an if-statement. Yet, I found it interesting to learn. That is why I would like to share my little discovery.

Throughout this post, I use a simple interpreter to load pickle streams. Just copy-and-paste the following code in a file:

import code
import pickle
import sys

sys.ps1 = "pik> "
sys.ps2 = "...> "
banner = "Pik -- The stupid pickle loader.\nPress Ctrl-D to quit."

class PikConsole(code.InteractiveConsole):
    def runsource(self, source, filename="<stdin>"):
        if not source.endswith(pickle.STOP):
            return True  # more input is needed
        try:
            print repr(pickle.loads(source))
        except:
            self.showsyntaxerror(filename)
        return False

pik = PikConsole()
pik.interact(banner)

Then, launch it with Python:

$ python pik.py
Pik -- The stupid pickle loader.
Press Ctrl-D to quit.
pik>

So, nothing crazy yet. The easiest objects to create are the empty ones. For example, to create an empty list:

pik> ].
[]

Similarly, you can also create a dictionary and a tuple:

pik> }.
{}
pik> ).
()

Remark that every pickle stream ends with a period. That symbol pops the topmost object from the stack and returns it. So, let’s say you pile up a series of integers and end the stream. Then, the result will be last item you entered:

pik> I1
...> I2
...> I3
...> .
3

As you see, an integer starts with the symbol ‘I’ and end with a newline. Strings, and floating-point number are represented in a similar fashion:

pik> F1.0
...> .
1.0
pik> S'abc'
...> .
'abc'
pik> Vabc
...> .
u'abc'

Now that you know the basics, we can move to something slightly more complex — constructing compound objects. As you will see later, tuples are everywhere in Python, so let’s begin with that one:

pik> (I1
...> S'abc'
...> F2.0
...> t.
(1, 'abc', 2.0)

There is two new symbols in this example, ‘(‘ and ‘t’. The ‘(‘ is simply a marker. It is a object in the stack that tells the tuple builder, ‘t’, when to stop. The tuple builder pops items from the stack until it reaches a marker. Then, it creates a tuple with these items and pushes this tuple back on the stack. You can use multiple markers to construct a nested tuple:

pik> (I1
...> (I2
...> I3
...> tt.
(1, (2, 3))

You use a similar method to build a list or a dictionary:

pik> (I0
...> I1
...> I2
...> l.
[0, 1, 2]
pik> (S'red'
...> I00
...> S'blue'
...> I01
...> d.
{'blue': True, 'red': False}

The only difference is that dictionary items are packed by key/value pairs. Note that I slipped in the symbols for True and False, which looks like the integers 0 and 1, but with an extra zero.

Like tuples, you can nest lists and dictionaries:

pik> ((I1
...> I2
...> t(I3
...> I4
...> ld.
{(1, 2): [3, 4]}

There is another method for creating lists or dictionaries. Instead of using a marker to delimit a compound object, you create an empty one and add stuff to it:

pik> ]I0
...> aI1
...> aI2
...> a.
[0, 1, 2]

The symbols ‘a’ means “append”. It pops an item and a list; appends the item to the list; and finally, pushes the list back on the stack. Here how you do a nested list with this method:

pik> ]I0
...> a]I1
...> aI2
...> aa.
[0, [1, 2]]

If this is not cryptic enough for you, consider this:

pik> (lI0
...> a(lI1
...> aI2
...> aa.
[0, [1, 2]]

Instead of using the empty list symbol, ‘]’, I used a marker immediately followed by a list builder to create an empty list. That is the notation the Pickler object uses, by default, when dumping objects.

Like lists, dictionaries can be constructed using a similar method:

pik> }S'red'
...> I1
...> sS'blue'
...> I2
...> s.
{'blue': 2, 'red': 1}

However, to set items to a dictionary you use the symbol ‘s’, not ‘a’. Unlike ‘a’, it takes a key/value pair instead of a single item.

You can build recursive data-structures, too:

pik> (Vzoom
...> lp0
...> g0
...> a.
[u'zoom', [...]]

The trick is to use a “register” (or as called in pickle, a memo). The ‘p’ symbol (for “put”) copies the top item of the stack in a memo. Here, I used ’0′ for the name of the memo, but it could have been anything. To get the item back, you use the symbol ‘g’. It will copy an item from a memo and put it on top of the stack.

But, what about sets? Now, we have a small problem, since there is no special notation for building sets. The only way to build a set is to call the built-in function set() on a list (or a tuple):

pik> c__builtin__
...> set
...> ((S'a'
...> S'a'
...> S'b'
...> ltR.
set(['a', 'b'])

There is a few new things here. The ‘c’ symbol retrieves an object from a module and puts it on the stack. And the reduce symbol, ‘R’, apply a tuple to a function. Same semantic again, ‘R’ pops a tuple and a function from the stack, then pushes the result back on it. So, the above example is roughly the equivalent of the following in Python:

>>> import __builtin__
>>> apply(__builtin__.set, (['a', 'a', 'b'],))

Or, using the star notation:

>>> __builtin__.set(*(['a', 'a', 'b'],))

And, that is the same thing as writing:

>>> set(['a', 'a', 'b'])

Or shorter even, using the set notation from the upcoming Python 3000:

>>> {'a', 'a', 'b'}

These two new symbols, ‘t’ and ‘R’, allows us to execute arbitrary code from the standard library. So, you must be careful to never load untrusted pickle streams. Someone malicious could easily slip in the stream a command to delete your data. Meanwhile, you can use that power for something less evil, like launching a clock:

pik> cos
...> system
...> (S'xclock'
...> tR.

Even if the language doesn’t support looping directly, that doesn’t stop you from using the implicit loops:

pik> c__builtin__
...> map
...> (cmath
...> sqrt
...> c__builtin__
...> range
...> (I1
...> I10
...> tRtR.
[1.0, 1.4142135623730951, 1.7320508075688772, 2.0, 2.2360679774997898,
2.4494897427831779, 2.6457513110645907, 2.8284271247461903, 3.0]

I am sure you could you fake an if-statement by defining it as a function, and then load it from a module.

def my_if(cond, then_val, else_val):
    if cond:
        return then_val
    else:
        return else_val

That works well for simple cases:

>>> my_if(True, 1, 0)
1
>>> my_if(False, 1, 0)
0

However, you run into some problems if mix that with recursion:

>>> def factorial(n):
...     return my_if(n == 1,
...                  1, n * factorial(n - 1))
...
>>> factorial(2)
RuntimeError: maximum recursion depth exceeded in cmp

On the other hand, I don’t think you really want to create recursive pickle streams, unless you want to win an obfuscated code contest.

That is about all I had to say about this simple stack language. There is a few things haven’t told you about, but I sure you will be able figure them out. Just read the source code of the pickle module. And, take a look at the pickletools module, which provides a disassembler for pickle streams. As always, comments are welcome.

· RSS feed for comments on this post

10 Comments

  1. Brett said,

    June 18, 2007 @ 2:17 pm

    Nice post! I had never bothered to look at how pickle worked. It also now makes more sense why pickle exists on top of there also being marshal.

  2. Simon said,

    June 18, 2007 @ 3:13 pm

    Fun!

    This is exactly the kind of post I like to read.

  3. Jesse said,

    June 18, 2007 @ 6:46 pm

    This is really fantastic!

  4. grzywacz said,

    June 18, 2007 @ 7:11 pm

    A very nice post indeed. Thanks for sharing your knowledge. :)

  5. Scott Lamb said,

    June 18, 2007 @ 11:40 pm

    Very timely. I was just thinking that a streaming unpickler would be necessary to make git-p4‘s readP4Files not require crazy amounts of memory, and here you’ve described the format. Bookmarked.

  6. Arnar said,

    June 19, 2007 @ 4:27 am

    Cool article.

    The problem with my_if is that it’s not lazy. You can use lambda to make it lazy:

    def my_if(cond, then_val, else_val):
        if cond:
            return then_val()
        else:
            return else_val()
    
    def factorial(n):
        return my_if(n==1, lambda: 1, lambda: n * factorial(n-1))
    >>> factorial(2)
    2
    >>> factorial(15)
    1307674368000L
  7. Alexandre said,

    June 19, 2007 @ 10:47 am

    Thanks everyone for your cheerful comments. I am surprised by how this post caught up. I guess I share my knowledge more often, then.

    Arnar, yes indeed, that is a solution. That wouldn’t be usable in a pickle stream, however, since you cannot create functions in them.

  8. Eduardo Padoan said,

    June 19, 2007 @ 10:41 pm

    Why isn’t your blog aggregated at Planet Python yet? :) It seems that you are learning some cool things with GSoC! Bad suggestion: you could receive 2 strings in my_if and eval then, to make it lazy and usable in a pickle stream.

  9. Alexandre said,

    June 20, 2007 @ 9:46 am

    Eduardo, it is because I haven’t asked yet, since I don’t know whom to ask. I am already on Planet Python/SoC, though.

  10. Georg said,

    July 20, 2007 @ 8:52 am

    Nice post! I do hope you add builtin support for sets (and frozensets) to the unified pickle for Py3k, while you’re at it :)