The pickle
module provides a convenient method to add data
persistence to your Python programs. How it does that, is pure magic
to most people. However, in reality, it is simple. The output of a
pickle
is a “program” able to create Python data-structures. A
limited stack language is used to write these programs. By limited, I
mean you can’t write anything fancy like a for-loop or an
if-statement. Yet, I found it interesting to learn. That is why I
would like to share my little discovery.
Throughout this post, I use a simple interpreter to load pickle
streams. Just copy-and-paste the following code in a file:
import code
import pickle
import sys
sys.ps1 = "pik> "
sys.ps2 = "...> "
banner = "Pik -- The stupid pickle loader.\nPress Ctrl-D to quit."
class PikConsole(code.InteractiveConsole):
def runsource(self, source, filename="<stdin>"):
if not source.endswith(pickle.STOP):
return True # more input is needed
try:
print repr(pickle.loads(source))
except:
self.showsyntaxerror(filename)
return False
pik = PikConsole()
pik.interact(banner)
Then, launch it with Python:
$ python pik.py
Pik -- The stupid pickle loader.
Press Ctrl-D to quit.
pik>
So, nothing crazy yet. The easiest objects to create are the empty
ones. For example, to create an empty list:
pik> ].
[]
Similarly, you can also create a dictionary and a tuple:
pik> }.
{}
pik> ).
()
Remark that every pickle stream ends with a period. That symbol pops
the topmost object from the stack and returns it. So, let’s say you
pile up a series of integers and end the stream. Then, the result will
be last item you entered:
pik> I1
...> I2
...> I3
...> .
3
As you see, an integer starts with the symbol ‘I’ and end with a
newline. Strings, and floating-point number are represented in a
similar fashion:
pik> F1.0
...> .
1.0
pik> S'abc'
...> .
'abc'
pik> Vabc
...> .
u'abc'
Now that you know the basics, we can move to something slightly more
complex — constructing compound objects. As you will see later,
tuples are everywhere in Python, so let’s begin with that one:
pik> (I1
...> S'abc'
...> F2.0
...> t.
(1, 'abc', 2.0)
There is two new symbols in this example, ‘(‘ and ‘t’. The ‘(‘ is
simply a marker. It is a object in the stack that tells the tuple
builder, ‘t’, when to stop. The tuple builder pops items from
the stack until it reaches a marker. Then, it creates a tuple with
these items and pushes this tuple back on the stack. You can use
multiple markers to construct a nested tuple:
pik> (I1
...> (I2
...> I3
...> tt.
(1, (2, 3))
You use a similar method to build a list or a dictionary:
pik> (I0
...> I1
...> I2
...> l.
[0, 1, 2]
pik> (S'red'
...> I00
...> S'blue'
...> I01
...> d.
{'blue': True, 'red': False}
The only difference is that dictionary items are packed by key/value
pairs. Note that I slipped in the symbols for True
and False
,
which looks like the integers 0 and 1, but with an extra zero.
Like tuples, you can nest lists and dictionaries:
pik> ((I1
...> I2
...> t(I3
...> I4
...> ld.
{(1, 2): [3, 4]}
There is another method for creating lists or dictionaries. Instead
of using a marker to delimit a compound object, you create an empty one
and add stuff to it:
pik> ]I0
...> aI1
...> aI2
...> a.
[0, 1, 2]
The symbols ‘a’ means “append”. It pops an item and a list; appends
the item to the list; and finally, pushes the list back on the stack.
Here how you do a nested list with this method:
pik> ]I0
...> a]I1
...> aI2
...> aa.
[0, [1, 2]]
If this is not cryptic enough for you, consider this:
pik> (lI0
...> a(lI1
...> aI2
...> aa.
[0, [1, 2]]
Instead of using the empty list symbol, ‘]’, I used a marker
immediately followed by a list builder to create an empty list. That
is the notation the Pickler
object uses, by default, when dumping
objects.
Like lists, dictionaries can be constructed using a similar method:
pik> }S'red'
...> I1
...> sS'blue'
...> I2
...> s.
{'blue': 2, 'red': 1}
However, to set items to a dictionary you use the symbol ‘s’, not ‘a’.
Unlike ‘a’, it takes a key/value pair instead of a single item.
You can build recursive data-structures, too:
pik> (Vzoom
...> lp0
...> g0
...> a.
[u'zoom', [...]]
The trick is to use a “register” (or as called in pickle
, a memo).
The ‘p’ symbol (for “put”) copies the top item of the stack in a memo.
Here, I used ’0′ for the name of the memo, but it could have been
anything. To get the item back, you use the symbol ‘g’. It will
copy an item from a memo and put it on top of the stack.
But, what about sets? Now, we have a small problem, since there is no
special notation for building sets. The only way to build a set is to
call the built-in function set()
on a list (or a tuple):
pik> c__builtin__
...> set
...> ((S'a'
...> S'a'
...> S'b'
...> ltR.
set(['a', 'b'])
There is a few new things here. The ‘c’ symbol retrieves an object
from a module and puts it on the stack. And the reduce symbol, ‘R’,
apply a tuple to a function. Same semantic again, ‘R’ pops a tuple
and a function from the stack, then pushes the result back on it. So,
the above example is roughly the equivalent of the following in
Python:
>>> import __builtin__
>>> apply(__builtin__.set, (['a', 'a', 'b'],))
Or, using the star notation:
>>> __builtin__.set(*(['a', 'a', 'b'],))
And, that is the same thing as writing:
>>> set(['a', 'a', 'b'])
Or shorter even, using the set notation from the upcoming Python 3000:
>>> {'a', 'a', 'b'}
These two new symbols, ‘t’ and ‘R’, allows us to execute arbitrary
code from the standard library. So, you must be careful to never
load untrusted pickle streams. Someone malicious could easily slip in
the stream a command to delete your data. Meanwhile, you can use that
power for something less evil, like launching a clock:
pik> cos
...> system
...> (S'xclock'
...> tR.
Even if the language doesn’t support looping directly, that doesn’t
stop you from using the implicit loops:
pik> c__builtin__
...> map
...> (cmath
...> sqrt
...> c__builtin__
...> range
...> (I1
...> I10
...> tRtR.
[1.0, 1.4142135623730951, 1.7320508075688772, 2.0, 2.2360679774997898,
2.4494897427831779, 2.6457513110645907, 2.8284271247461903, 3.0]
I am sure you could you fake an if-statement by defining it as a
function, and then load it from a module.
def my_if(cond, then_val, else_val):
if cond:
return then_val
else:
return else_val
That works well for simple cases:
>>> my_if(True, 1, 0)
1
>>> my_if(False, 1, 0)
0
However, you run into some problems if mix that with recursion:
>>> def factorial(n):
... return my_if(n == 1,
... 1, n * factorial(n - 1))
...
>>> factorial(2)
RuntimeError: maximum recursion depth exceeded in cmp
On the other hand, I don’t think you really want to create recursive
pickle streams, unless you want to win an obfuscated code contest.
That is about all I had to say about this simple stack language.
There is a few things haven’t told you about, but I sure you will be
able figure them out. Just read the source code of the pickle
module. And, take a look at the pickletools
module,
which provides a disassembler for pickle streams. As always, comments
are welcome.