More about Pickling Time Zones in Python

Published

In my prior post on using dateutil for time zone information in Python, I half-jokingly posted a bit at the end about the size of pickled pytz tzinfo instances compared to pickled dateutil tzinfo instance. Paul Ganssle later emailed me to speculate that pytz’s smaller size is due to the fact that pytz is storing the key (name) of the time zone when you pickle one of its instances, whereas a pickled dateutil instance has all of the information about its time zone.

I’m not one to ignore the generous gift of someone’s time and knowledge, so let me dig into that a bit further. (Though Paul emailed me about four months ago, so I guess I am one to postpone acknowledgment of said gift.)

The TL;DR here is that Paul is basically right, and understanding this difference could be useful if you find yourself pickling/unpickling tzinfo instances from either library.

A Side Trip into Python Pickling

Let’s first look at what happens when we pickle a simple object in Python 2.7.15:

>>> class A(object):
...     def __init__(self, x):
...         self.x = x
...
>>> an_instance = A(42)
>>> import pickle, pickletools
>>> pickletools.dis(pickle.dumps(an_instance))
    0: c    GLOBAL     'copy_reg _reconstructor'
   25: p    PUT        0
   28: (    MARK
   29: c        GLOBAL     '__main__ A'
   41: p        PUT        1
   44: c        GLOBAL     '__builtin__ object'
   64: p        PUT        2
   67: N        NONE
   68: t        TUPLE      (MARK at 28)
   69: p    PUT        3
   72: R    REDUCE
   73: p    PUT        4
   76: (    MARK
   77: d        DICT       (MARK at 76)
   78: p    PUT        5
   81: S    STRING     'x'
   86: p    PUT        6
   89: I    INT        42
   93: s    SETITEM
   94: b    BUILD
   95: .    STOP
highest protocol among opcodes = 0

I read this as calling copy_reg._reconstructor, passing the class A, its base class object, and {'x': 42} as the state of the new object. In other words:

copy_reg._reconstructor(A, object, {'x': 42})

Pickling UTC

Now let’s see what we get when we pickle dateutil’s UTC instance:

>>> import dateutil.tz
>>> pickletools.dis(pickle.dumps(dateutil.tz.UTC))
    0: c    GLOBAL     'copy_reg _reconstructor'
   25: p    PUT        0
   28: (    MARK
   29: c        GLOBAL     'dateutil.tz.tz tzutc'
   51: p        PUT        1
   54: c        GLOBAL     'datetime tzinfo'
   71: p        PUT        2
   74: g        GET        2
   77: (        MARK
   78: t            TUPLE      (MARK at 77)
   79: R        REDUCE
   80: p        PUT        3
   83: t        TUPLE      (MARK at 28)
   84: p    PUT        4
   87: R    REDUCE
   88: p    PUT        5
   91: .    STOP
highest protocol among opcodes = 0

Again, if I’m reading this right, this is equivalent to:

copy_reg._reconstructor(dateutil.tz.tz.tzutc, datetime.tzinfo, ())

We can test that:

>>> import copy_reg, datetime, dateutil.tz.tz
>>> copy_reg._reconstructor(dateutil.tz.tz.tzutc, datetime.tzinfo, ())
tzutc()

Looks right to me.

Now, what do you get when you pickle pytz’s UTC?

>>> import pytz
>>> pickletools.dis(pickle.dumps(pytz.UTC))
    0: c    GLOBAL     'pytz _UTC'
   11: p    PUT        0
   14: (    MARK
   15: t        TUPLE      (MARK at 14)
   16: R    REDUCE
   17: p    PUT        1
   20: .    STOP
highest protocol among opcodes = 0

I believe this says to just call pytz._UTC():

>>> pytz._UTC()
<UTC>
>>> pytz._UTC() is pytz.UTC
True

As Paul pointed out to me in his email, pytz has some custom pickling code, which is why we don’t see copy_reg._reconstructor here.

Pickling Non-UTC Time Zones

As in my original post, the difference is more dramatic for non-UTC time zones. Whereas dateutil used the default Python pickling behavior for its UTC object, dateutil does customize pickling for tzinfo objects it reads from the tz database. Let’s try pickling US central time1 from dateutil:

>>> dateutil_central = dateutil.tz.gettz("America/Chicago")
>>> pickletools.dis(pickle.dumps(dateutil_central))
    0: c    GLOBAL     'dateutil.tz.tz tzfile'
   23: p    PUT        0
   26: (    MARK
   27: N        NONE
   28: S        STRING     '/usr/share/zoneinfo/America/Chicago'
   67: p        PUT        1
   70: t        TUPLE      (MARK at 26)
   71: p    PUT        2
   74: R    REDUCE
   75: p    PUT        3
   78: (    MARK
   79: d        DICT       (MARK at 78)
   80: p    PUT        4
   83: S    STRING     '_trans_list'
   98: p    PUT        5
  101: (    MARK
  102: I        INT        -1633276800
  115: I        INT        -1615158000
  128: I        INT        -1601848800
[...copious output omitted...]
 7842: S    STRING     '_ttinfo_std'
 7857: p    PUT        76
 7861: g    GET        30
 7865: s    SETITEM
 7866: b    BUILD
 7867: .    STOP
highest protocol among opcodes = 0
>>> len(pickle.dumps(dateutil_central))
7868

7,868 bytes! Looking at the output, I believe this confirms my guess from my earlier post, that a pickled dateutil instance has all the time zone information it needs. I’ll confirm that in a moment, but first let’s check out pytz’s pickle size for the same time zone:

>>> pytz_central = pytz.timezone('America/Chicago')
>>> pickletools.dis(pickle.dumps(pytz_central))
    0: c    GLOBAL     'pytz _p'
    9: p    PUT        0
   12: (    MARK
   13: S        STRING     'America/Chicago'
   32: p        PUT        1
   35: I        INT        -21060
   43: I        INT        0
   46: S        STRING     'LMT'
   53: p        PUT        2
   56: t        TUPLE      (MARK at 12)
   57: p    PUT        3
   60: R    REDUCE
   61: p    PUT        4
   64: .    STOP
highest protocol among opcodes = 0
>>> len(pickle.dumps(pytz_central))
65

This is a much smaller pickle, and as I said before, Paul hinted me as to why: pytz just pickles the key (time zone name) along with a bit of other information. When you unpickle it, it loads the data for that time zone from the tz database.

Pickled dateutil Instances Are Usable Without the Tz Database

I believe that unpickling and using a dateutil.tz.tz.tzfile instance will not need to reference the tz database. We can actually test this by denying dateutil access to the tz database on my system (/usr/share/zoneinfo), then unpickling and using one of dateutil’s tzinfo instances.2

First, pickle one of its instances out to a file:

>>> pickle.dump(dateutil_central, open('/tmp/central', 'wb'))

Next I will run this script as root (necessary because it uses chroot):

import os, os.path, pickle, datetime, dateutil.tz

# A few unpickling bits are not loaded until you use them, so use them
# before chroot.
pickle.loads(pickle.dumps(('foo', u'bar')))

os.chroot('/tmp')

# Confirm that we can't load a time zone by name because we the tz
# database files are no longer accessible.
assert not os.path.exists('/usr/share/zoneinfo')
print "/usr/share/zoneinfo does not exist"
try:
    central = dateutil.tz.gettz('America/Chicago')
except ImportError, ex:
    print "dateutil gave expected ImportError: %s" % (ex,)
else:
    raise Exception(
        'should not have been able to load time zone, got: %r' % (central,))

# Unpickle and use our pickled dateutil tzinfo!  Works even when the
# tz database can't be accessed.
central = pickle.load(open('/central', 'rb'))
print "I unpickled this tzinfo:  ", central
print "naively, now is:          ", datetime.datetime.now()
utc_now = datetime.datetime.utcnow().replace(tzinfo=dateutil.tz.UTC)
print "now in UTC is:            ", utc_now
print "converted to central time:", utc_now.astimezone(central)

(If I wasn’t so lazy I would have made this chroot somewhere other than /tmp, which is most likely quite insecure. Don’t do this at home!)

The result:

$ sudo python chrooted_demo.py
/usr/share/zoneinfo does not exist
dateutil gave expected ImportError: No module named zoneinfo
I unpickled this tzinfo:   tzfile('/usr/share/zoneinfo/America/Chicago')
naively, now is:           2018-12-26 13:21:37.416167
now in UTC is:             2018-12-26 19:21:37.416184+00:00
converted to central time: 2018-12-26 13:21:37.416184-06:00

This confirms that an unpickled dateutil tzinfo object has all the information it needs, and thus computations using unpickled instances will yield the same results no matter what version of the tz database is available at the time, or even if no tz database is available at all.

Conclusion

Perhaps this is a trade-off: when you unpickle and use a dateutil tzinfo instance, you’re going to get the same results as you would have when using the original instance.3 In contrast, unpickling a pytz tzinfo instance will load the information for that time zone from whatever tz database is included with whatever version of pytz you’re using at the time.

So if you want to make sure you get the same results when using a tzinfo object no matter when or where you unpickle it, dateutil seems superior. If you instead want to always be using the most recent version of the tz database when you unpickle a tzinfo, and/or if you want smaller pickles, pytz seems superior.

This is probably not something the vast majority of people need to worry themselves with, but I found it interesting to dig into these libraries a bit further, and also to play with the pickletools module and understand more about Python pickling.


  1. Yes, I did check that the Chicago Manual of Style, seventeenth edition, tells me not to capitalize time zone names. [return]
  2. I’m on macOS today, and I kind of hate fs_usage. If I was instead using Linux today, I would have included strace in my testing so that I could more definitely confirm that no tz database files were being accessed. In particular, it seems plausible that, after unpickling a tzinfo instance, dateutil could prefer to use the information from the system’s tz database instead of what it unpickled if the tz database is available—but I think this is very unlikely based on what Paul has told me, and I haven’t seen a hint of it in dateutil’s code, either. [return]
  3. When I say, “you’re going to get the same results”, I’m assuming you can unpickle it in the first place, of course, and that there have been no big changes in the underlying dateutil library. [return]