Why you cannot pickle generators

Joseph Turian wrote a post about regarding pickling generator on his blog. In his post, he says:

However, generators become problematic when you want to persist your experiment’s state in order to later restart training at the same place. Unfortunately, you can’t pickle generators in Python. And it can be a bit of a PITA to workaround this, in order to save the training state.

This caught my attention, because I was involved in the decision, he cites, to not allow generators to be pickled in CPython. Although Joseph’s examples are a bit convoluted, it is pretty clear why his generators cannot be pickled automatically—i.e., Python cannot pickle the operating system’s state, like file descriptors.

Let’s ignore that problem for a moment and look what we would need to do to pickle a generator. Since a generator is essentially a souped-up function, we would need to save its bytecode, which is not guarantee to be backward-compatible between Python’s versions, and its frame, which holds the state of the generator such as local variables, closures and the instruction pointer. And this latter is rather cumbersome to accomplish, since it basically requires to make the whole interpreter picklable. So, any support for pickling generators would require a large number of changes to CPython’s core.

Now if an object unsupported by pickle (e.g., a file handle, a socket, a database connection, etc) occurs in the local variables of a generator, then that generator could not be pickled automatically, regardless of any pickle support for generators we might implement. So in that case, you would still need to provide custom __getstate__ and __setstate__ methods. This problem renders any pickling support for generators rather limited.

Anyway, if you need for a such feature, then look into Stackless Python which does all the above. And since Stackless’s interpreter is picklable, you also get process migration for free. This means you can interrupt a tasklet (the name for Stackless’s green threads), pickle it, send the pickle to a another machine, unpickle it, resume the tasklet, and voilà you’ve just migrated a process. This is freaking cool feature!

But in my humble opinion, the best solution to this problem to the rewrite the generators as simple iterators (i.e., one with a __next__ method). Iterators are easy and efficient space-wise to pickle because their state is explicit. You would still need to handle objects representing some external state explicitly however; you cannot get around this.

Porting your code to Python 3

See the plain HTML version.

The following is a write-up of the presentation I gave to a group of Python developers at Montreal Python 5 on February 26th. This is basically a HTML-fied copy of the notes I prepared before the presentation. I haven’t done editing, so expect a few grammar mistakes there and there. My complete presentation slides are available here. A video was taped should be released in the upcoming weeks (I will post a link here when I finally get my hands on it). Please note that if you’re looking for more complete guide about Python 3 (and more accurate), I highly recommend that you read the What’s New In Python 3.0 document and the Python Enhancement Proposals numbered above 3000.

You may wonder why we did Python 3 afterall. The motivation was simple: to fix old warts and to clean up the language before it was too late. Python 3 is not complete rewrite of Python; it still pretty much the good old Python you all love. But I am not going to lie. There are many changes in Python 3; many that will cause pain when you will port your code; and so many that I won’t be able to cover them all in this talk. That is why I will focus only on the changes that will need to know to port your code. If you want to learn about all new and shiny features, you will need to visit the python.org’s website and the online documentation of Python 3.

In the second part of this presentation, I will go over the steps needed to port a real library to Python 3. Hopefully, this part will give you a basic knowledge and tools to tackle the problems linked to the migration.

Finally, I will give you an insider’s view of the upcoming changes in Python 3.1, which suppose to be released later this year.

Read the rest of this entry…

How to not switch to Dvorak

Once in a while, I practice to improve my touch typing skills. Most of the time, I just find some online stuff or use KTouch. But today, I wanted to try something different. I always hear good things about the Dvorak keyboard layout — i.e., how it’s supposedly more efficient and more comfortable than the Qwerty layout. Being a curious person, I wanted to test this out.

So when I opened up KTouch, I selected the Dvorak lecture, instead of the typical Qwerty one. The first lessons were fairly easy. As I went through the lecture, I managed to keep a fairly high pace and accuracy – i.e., about 210 characters per minute with a 95% accuracy. About at the fifth or sixth lesson, I said to myself: “Wow, I must have been a Dvorak typist in another life.” I was really impressed how quickly I had learnt the basics of the layout and I was indeed starting to believe that the Dvorak layout was vastly superior to Qwerty.

Shortly after, I was sold. At this point, I was thinking how was going to remap my Emacs key bindings.

However, when I got to the tenth lesson, I found something strange, very strange. The letter ‘q’ on the Dvorak layout was in the upper row on the left — exactly where it is on the Qwerty layout.

I stopped typing for a second…

…and look at the keyboard displayed on the screen.

“asdf asdf asdf”

Oops! I had forgot the change the actual layout of my keyboard. So, I was still using Qwerty.

Now, I realize that I have been victim of what they call the “placebo effect”. This little anecdote has certainly thought me to be more careful, in the future, when trying something new sold has “better”.

Shell tricks: shorthands

Even with tab completion, typing long commands is tedious. But, there’s something even worst: typing the same long commands again, and again, and again… So how do you solve that? It’s simple: you shorten them. Surprising, uh? Okay enough theory, let me show you some examples.

Here’s a tedious command of Type-A:

% sudo aptitude install zsh

Look at it carefully since you will need to hunt these long commands down until none remains. Now, let me explain how you execute a such command. Open up your personal shell initialization file (e.g. ~/.bashrc for Bash, ~/.zshrc for Zsh, etc). Then, add the following:

alias spkgi="sudo aptitude install"

Reload your shell and finally, enjoy:

% spkgi zsh

Now I can introduce, as you can deduce, other shorten commands that you can produce and reproduce:

# Package Management
alias pkg="aptitude"
alias spkg="sudo aptitude"
alias spkgi="sudo aptitude install"
alias spkgu="sudo aptitude safe-upgrade"
alias spkgr="sudo aptitude remove"
alias spkgd="sudo apt-get build-dep"

# Miscellaneous Helpers
alias nc="rlwrap nc"
alias e=$EDITOR
alias se=sudoedit
alias reload="source ~/.zshrc"
alias g=egrep

Next after Type-A tedious commands, we have the Type-S ones. To execute these, you will you need some sort of special shell support. So, here’s some examples of the Type-S monstrosity:

% find Lib/ -name '*.c' -print0 | xargs -0 grep ^PyErr
% find -name '*.html' -print0 | xargs -0 rename 's/\.html$/.var/'
% find -name '*.patch' -print0 | xargs -0 -I {} cp {} patches/

I hope you start to see some patterns (if you don’t, then try harder). The first one could (and should) be rewritten as:

% rgrep --include='*.c' ^PyErr Lib/

But that isn’t short enough for me, so I have a short helper:

    shift 2
    grep -Er --include=$filepat $pat ${@:-.}
# In Zsh, 'noglob' turns off globing.
# (e.g, "noglob echo *" outputs "*")
alias rg='noglob rg'

It is lovely to use:

% rg *.c ^PyErr Lib/
% rg *.c PyErr_Restore . -C 10 | less
% rg *.[ch] stringlib
% rg *.c ^[a-zA-Z]*_dealloc Modules/ Objects/

The second example is quite similar to the previous one. However, the find/rename combination is much less common (at least for me) than the find/grep one. This one needs to be broken in pieces. One obvious thing to factor out is the find -name with an alias:

alias fname="noglob find -name"

Using this alias, you can rewrite the second example as:

% fname *.html -print0 | xargs -0 rename 's/\.html$/.var/'

It’s better, but it’s not short enough yet. The ugly part of this command is the -print0 | xargs -0. I hate to type that. Wouldn’t it be nice if we could define an alias for it? How about:

alias each="-print0 | xargs -0"

Unfortunately, that doesn’t work since aliases are only expanded if they are in the command position. Luckly, Zsh has that neat feature called global aliases, which does exactly what we want.

alias -g each="-print0 | xargs -0"

With this feature of Zsh, the second example become:

% fname *.html each rename 's/\.html$/.var/'

Now, we can also attack the third one:

% fname *.patch each -I {} cp {} patches/

It is possible to shorten a bit by defining another alias combining each and -I {}, but that won’t make a big difference.

Finally, there are the Type-R tedious commands. These are hard to avoid, unless you’re careful. Here’s again some ridiculous examples to help you recognize these redundant commands:

% gcc -o stackgrow stackgrow.c
% pkg show emacs-snapshot-bin-common emacs-snapshot-common emacs-snapshot-gtk emacs-snapshot
% cat ../lispref.patch ../lwlib.patch ../etc.patch | patch -p1

To reduce these, you don’t need change your shell configuration; you change your habits instead. Using alternations (which are non-standard, but supported by most shells), you can rewrite the two first example as:

% gcc -o stackgrow{,.c}
% pkg show emacs-snapshot{{-bin,}-common,-gtk,}

Now, you are surely asking yourself: “what is different about the third one?” Well, think about it. Got it? No? Ah, come on, it is easy. Here’s a hint:

% echo 'cat ../{lispref,lwlib,etc}.patch | patch -p1' | wc -c
% echo 'cat ../lispref.patch ../lwlib.patch ../etc.patch | patch -p1' | wc -c

You like my hint, don’t you? Here’s the answer:

% echo 'cat ../li\t ../lw\t ../et\t | patch -p1' | wc -c

Tab completion doesn’t work well with prefix alternations. Even if the command using alternation is shorter, it still doesn’t beat good old tab completion.

And that’s all folks. I surely have plenty of other tricks to show, but that will be for the other posts of this short series.

Pretty Emacs Reloaded

Update: If you are using Ubuntu 8.04 LTS “Hardy Heron” or Ubuntu 8.10 “Intrepid Ibex”, use the packages in the PPA of the Ubuntu Emacs Lisp team, instead of the packages referenced here. For Ubuntu 9.04 “Jaunty Jackalope” and newer, use the packages in Ubuntu repositories.

My popular1 Pretty Emacs package just got a tad better. I transferred the package to the brand new PPA service provided by Launchpad. So, what’s new about the package? First, I glad to announce the long-awaited amd64 support. Also, I am adding Gutsy Gibbon to the list of supported distributions.

To use the updated package on Ubuntu 6.10 “Edgy Eft”, add the

following lines to your /etc/apt/sources.list file:

deb http://ppa.launchpad.net/avassalotti/ubuntu edgy main
deb-src http://ppa.launchpad.net/avassalotti/ubuntu edgy main

To use the package on Ubuntu 7.04 “Feisty Fawn”, add the following lines to your /etc/apt/sources.list file:

deb http://ppa.launchpad.net/avassalotti/ubuntu feisty main
deb-src http://ppa.launchpad.net/avassalotti/ubuntu feisty main

To use the package on the development version of Ubuntu “Gutsy Gibbon”, add the following lines to your /etc/apt/sources.list file:

deb http://ppa.launchpad.net/avassalotti/ubuntu gutsy main
deb-src http://ppa.launchpad.net/avassalotti/ubuntu gutsy main

Unfortunately, if you still use Ubuntu 6.06 "Dapper Drake", you will have to keep using the older package release from my orignal repository. I still support Ubuntu 6.06, but I won't update the package with newer snapshots.

After adding the repository to your software source list, upgrade your version of the package with:

sudo aptitude upgrade

If you do not have a previous version of the package already installed and you desire to install it, do this instead:

sudo aptitude install emacs-snapshot emacs-snapshot-el

When upgrading the package you might get the following warning message:

WARNING: untrusted versions of the following packages will be installed!

Untrusted packages could compromise your system's security. You should only proceed with the installation if you are certain that this is what you want to do.

This is due to a bug in the PPA system. I believe that it will be resolved quickly. So, you can safely ignore the warning message for the moment.

Final note, thank you everyone for trusting me and giving me some great feedback about the package. I like to give special thanks to Romain Francoise and Michael Olson for their work respectively on emacs-snapshot and emacs22, during this summer.

  1. A rough estimate tell me there is over 30 000 people using my package, where 88% of them are Feisty Fawn users and 11% are Edgy Eft users. 

Minor annoyance with Planet

Do you know how to fix Planet or WordPress, so when I edit an old post it does not pop back on Planet?

I do edit some of my posts, in particular the Pretty Emacs one, fairly often. I love to have my blog aggregated, but I would hate spamming Planet Ubuntu readers with my old posts. Therefore if I cannot fix this little annoyance, I will have no other choice to remove myself from Planet Ubuntu.

Summer of Code Weekly #4

All is well for me and my project. I finished the merge of cStringIO and StringIO, and I am now moving to the more challenging cPickle/pickle merge. During the last two weeks, I mostly spend my time analyzing the pickle module and thinking how I will clean up cPickle. My current plan is:

  1. Make cPickle’s source code conform to PEP-7.
  2. Remove the dependency on the now obsolete cStringIO.
  3. Benchmark cPickle and pickle.
  4. Add subclassing support to Pickler/Unpickler.
  5. Reduce the size of cPickle’s source code based on the bottlenecks found by the benchmarks.

Hopefully, cPickle/pickle merge will be as smooth (and as fun) as the cStringIO/StringIO merge.

Pickle: An interesting stack language

The pickle module provides a convenient method to add data persistence to your Python programs. How it does that, is pure magic to most people. However, in reality, it is simple. The output of a pickle is a “program” able to create Python data-structures. A limited stack language is used to write these programs. By limited, I mean you can’t write anything fancy like a for-loop or an if-statement. Yet, I found it interesting to learn. That is why I would like to share my little discovery.

Throughout this post, I use a simple interpreter to load pickle streams. Just copy-and-paste the following code in a file:

import code
import pickle
import sys

sys.ps1 = "pik> "
sys.ps2 = "...> "
banner = "Pik -- The stupid pickle loader.\nPress Ctrl-D to quit."

class PikConsole(code.InteractiveConsole):
    def runsource(self, source, filename="<stdin>"):
        if not source.endswith(pickle.STOP):
            return True  # more input is needed
            print repr(pickle.loads(source))
        return False

pik = PikConsole()

Then, launch it with Python:

$ python pik.py
Pik -- The stupid pickle loader.
Press Ctrl-D to quit.

So, nothing crazy yet. The easiest objects to create are the empty ones. For example, to create an empty list:

pik> ].

Similarly, you can also create a dictionary and a tuple:

pik> }.
pik> ).

Remark that every pickle stream ends with a period. That symbol pops the topmost object from the stack and returns it. So, let’s say you pile up a series of integers and end the stream. Then, the result will be last item you entered:

pik> I1
...> I2
...> I3
...> .

As you see, an integer starts with the symbol ‘I’ and end with a newline. Strings, and floating-point number are represented in a similar fashion:

pik> F1.0
...> .
pik> S'abc'
...> .
pik> Vabc
...> .

Now that you know the basics, we can move to something slightly more complex — constructing compound objects. As you will see later, tuples are everywhere in Python, so let’s begin with that one:

pik> (I1
...> S'abc'
...> F2.0
...> t.
(1, 'abc', 2.0)

There is two new symbols in this example, ‘(‘ and ‘t’. The ‘(‘ is simply a marker. It is a object in the stack that tells the tuple builder, ‘t’, when to stop. The tuple builder pops items from the stack until it reaches a marker. Then, it creates a tuple with these items and pushes this tuple back on the stack. You can use multiple markers to construct a nested tuple:

pik> (I1
...> (I2
...> I3
...> tt.
(1, (2, 3))

You use a similar method to build a list or a dictionary:

pik> (I0
...> I1
...> I2
...> l.
[0, 1, 2]
pik> (S'red'
...> I00
...> S'blue'
...> I01
...> d.
{'blue': True, 'red': False}

The only difference is that dictionary items are packed by key/value pairs. Note that I slipped in the symbols for True and False, which looks like the integers 0 and 1, but with an extra zero.

Like tuples, you can nest lists and dictionaries:

pik> ((I1
...> I2
...> t(I3
...> I4
...> ld.
{(1, 2): [3, 4]}

There is another method for creating lists or dictionaries. Instead of using a marker to delimit a compound object, you create an empty one and add stuff to it:

pik> ]I0
...> aI1
...> aI2
...> a.
[0, 1, 2]

The symbols ‘a’ means “append”. It pops an item and a list; appends the item to the list; and finally, pushes the list back on the stack. Here how you do a nested list with this method:

pik> ]I0
...> a]I1
...> aI2
...> aa.
[0, [1, 2]]

If this is not cryptic enough for you, consider this:

pik> (lI0
...> a(lI1
...> aI2
...> aa.
[0, [1, 2]]

Instead of using the empty list symbol, ‘]’, I used a marker immediately followed by a list builder to create an empty list. That is the notation the Pickler object uses, by default, when dumping objects.

Like lists, dictionaries can be constructed using a similar method:

pik> }S'red'
...> I1
...> sS'blue'
...> I2
...> s.
{'blue': 2, 'red': 1}

However, to set items to a dictionary you use the symbol ‘s’, not ‘a’. Unlike ‘a’, it takes a key/value pair instead of a single item.

You can build recursive data-structures, too:

pik> (Vzoom
...> lp0
...> g0
...> a.
[u'zoom', [...]]

The trick is to use a “register” (or as called in pickle, a memo). The ‘p’ symbol (for “put”) copies the top item of the stack in a memo. Here, I used ’0′ for the name of the memo, but it could have been anything. To get the item back, you use the symbol ‘g’. It will copy an item from a memo and put it on top of the stack.

But, what about sets? Now, we have a small problem, since there is no special notation for building sets. The only way to build a set is to call the built-in function set() on a list (or a tuple):

pik> c__builtin__
...> set
...> ((S'a'
...> S'a'
...> S'b'
...> ltR.
set(['a', 'b'])

There is a few new things here. The ‘c’ symbol retrieves an object from a module and puts it on the stack. And the reduce symbol, ‘R’, apply a tuple to a function. Same semantic again, ‘R’ pops a tuple and a function from the stack, then pushes the result back on it. So, the above example is roughly the equivalent of the following in Python:

>>> import __builtin__
>>> apply(__builtin__.set, (['a', 'a', 'b'],))

Or, using the star notation:

>>> __builtin__.set(*(['a', 'a', 'b'],))

And, that is the same thing as writing:

>>> set(['a', 'a', 'b'])

Or shorter even, using the set notation from the upcoming Python 3000:

>>> {'a', 'a', 'b'}

These two new symbols, ‘t’ and ‘R’, allows us to execute arbitrary code from the standard library. So, you must be careful to never load untrusted pickle streams. Someone malicious could easily slip in the stream a command to delete your data. Meanwhile, you can use that power for something less evil, like launching a clock:

pik> cos
...> system
...> (S'xclock'
...> tR.

Even if the language doesn’t support looping directly, that doesn’t stop you from using the implicit loops:

pik> c__builtin__
...> map
...> (cmath
...> sqrt
...> c__builtin__
...> range
...> (I1
...> I10
...> tRtR.
[1.0, 1.4142135623730951, 1.7320508075688772, 2.0, 2.2360679774997898,
2.4494897427831779, 2.6457513110645907, 2.8284271247461903, 3.0]

I am sure you could you fake an if-statement by defining it as a function, and then load it from a module.

def my_if(cond, then_val, else_val):
    if cond:
        return then_val
        return else_val

That works well for simple cases:

>>> my_if(True, 1, 0)
>>> my_if(False, 1, 0)

However, you run into some problems if mix that with recursion:

>>> def factorial(n):
...     return my_if(n == 1,
...                  1, n * factorial(n - 1))
>>> factorial(2)
RuntimeError: maximum recursion depth exceeded in cmp

On the other hand, I don’t think you really want to create recursive pickle streams, unless you want to win an obfuscated code contest.

That is about all I had to say about this simple stack language. There is a few things haven’t told you about, but I sure you will be able figure them out. Just read the source code of the pickle module. And, take a look at the pickletools module, which provides a disassembler for pickle streams. As always, comments are welcome.

Summer of Code Weekly #3

During this third week of the Summer of Code, I found very difficult to concentrate on my work — I been a lightbulb instead of a laser. The result was little code done. On the other hand, I learned a lot about other things. For example, I now finally understand assembly language; how to use gdb; the basics of the design of the Linux kernel; etc, etc.

I also read the book “Producing Open Source Software”, by Karl Fogel. It is really good primer to the world of free software. If you have a burning desire to contribute open source projects, just like me, I highly recommend that you get your own copy, or read it online.

Summer of Code Weekly #2

I can confirm it now, this second week of coding was even better. It was harder on my brain cells, though. I am mostly done with the StringIO merge. I now have working implementations in C of the BytesIO and the StringIO objects. The only thing remaining to do, for these two modules, is polishing the unit tests. And that shouldn’t that me very long to do. So, in basically one week of work, I completed the merge of cStringIO. I am certainly proud of that.

Now, I will need to attack the cPickle and cProfile modules. I don’t know yet which I work on first. cPickle still seems very scary to me, and unlike cStringIO it’s huge. It’s about five or six times bigger. cProfile, on the other hand, is about the same size of cStringIO and well documented. I even wonder if I need to code anything for cProfile. It will be a piece of cake to merge. Now, one question remains: should I take the cake now, or keep it for the end?