Peterbe.com

Peter Bengtsson's blog

Filtered by Python

Page 17

Fastest way to thousands-commafy large numbers in Python/PyPy

October 13, 2012
15 comments Python

Here are two perfectly good ways to turn 123456789 into "123,456,789":



import locale

def f1(n):
    locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
    return locale.format('%d', n, True)

def f2(n):
    r = []
    for i, c in enumerate(reversed(str(n))):
        if i and (not (i % 3)):
            r.insert(0, ',')
        r.insert(0, c)
    return ''.join(r)

assert f1(123456789) == '123,456,789'
assert f2(123456789) == '123,456,789'

Which one do you think is the fastest?

Truncated! Read the rest by clicking the link below.

hastebinit - quickly paste snippets into hastebin.com

October 11, 2012
9 comments Python, Linux

I'm quite fond of hastebin.com. It's fast. It's reliable. And it's got nice keyboard shortcuts that work for my taste.

So, I created a little program to quickly throw things into hastebin. You can have one too:

First create ~/bin/hastebinit and paste in:


#!/usr/bin/python

import urllib2
import os
import json

URL = 'http://hastebin.com/documents'

def run(*args):
    if args:
        content = [open(x).read() for x in args]
        extensions = [os.path.splitext(x)[1] for x in args]
    else:
        content = [sys.stdin.read()]
        extensions = [None]

    for i, each in enumerate(content):
        req = urllib2.Request(URL, each)
        response = urllib2.urlopen(req)
        the_page = response.read()
        key = json.loads(the_page)['key']
        url = "http://hastebin.com/%s" % key
        if extensions[i]:
            url += extensions[i]
        print url


if __name__ == '__main__':
    import sys
    sys.exit(run(*sys.argv[1:]))

Then run: chmod +x ~/bin/hastebinit

Now you can do things like:

$ cat ~/myfile | hastebinit
$ hastebinit < ~/myfile
$ hastebinit ~/myfile myotherfile

Hopefully it'll one day help at least one more soul out there!

django-mongokit now compatible with Django 1.4

August 11, 2012
0 comments Python

I've finally had time to sort out the django-mongokit code so it's now fully compatible with Django 1.4.

Because I don't personally use the project I've sort of got myself lost in various patches from some awesome contributors who keep it in check.

Also, thanks to Marc Abramowitz who added a tox.ini which I've now updated to also test python 2.7 and Django 1.4

Go forth and build awesome Django apps with MongoKit which is still think is the best wrapper available on PyMongo out there.

How to use premailer as a command line script

July 13, 2012
5 comments Python

(This post is a response to Richard Patchet's request for tips on how to actually use premailer)

First of all, premailer is a Python library that converts a document of HTML and tranforms its <style> tags into inline style attributes on the HTML itself. This comes very handy when you need to take a nicely formatted HTML newletter template and prepare it before sending because when you send HTML emails you can't reference an external .css file.

So, here's how to turn it into a command line script.

First, install, then write the script:

$ pip install premailer
$ touch ~/bin/run-premailer.py
$ chmod +x ~/bin/run-premailer.py

Now, you might want to do this differently but this should get you places:


#!/usr/bin/env python

from premailer import transform

def run(files):
    try:
        base_url = [x for x in files if x.count('://')][0]
        files.remove(base_url)
    except IndexError:
        base_url = None

    for file_ in files:
        html = open(file_).read()
        print transform(html, base_url=base_url)

if __name__ == '__main__':
    import sys
    run(sys.argv[1:])

To test it, I've made a sample HTML page that looks like this:


<html>
    <head>
        <title>Test</title>
        <style>
        h1, h2 { color:red; }
        strong {
          text-decoration:none
          }
        p { font-size:2px }
        p.footer { font-size: 1px}
        p a:link { color: blue; }
        </style>
    </head>
    <body>
        <h1>Hi!</h1>
        <p><strong>Yes!</strong></p>
        <p class="footer" style="color:red">Feetnuts</p>
        <p><a href="page2/">Go to page 2</a></p>
    </body>
</html>

Cool. So let's run it: $ run-premailer.py test.html


<html>
  <head>
    <title>Test</title>
  </head>
  <body>
        <h1 style="color:red">Hi!</h1>
        <p style="font-size:2px"><strong style="text-decoration:none">Yes!</strong></p>
        <p style="{color:red; font-size:1px} :link{color:red}">Feetnuts</p>
    <p style="font-size:2px"><a href="page2/" style=":link{color:blue}">Go to page 2</a></p>
    </body>
</html>

Note that premailer supports converting relative URLs, so let's actually using that:
$ run-premailer.py test.html https://www.peterbe.com


<html>
  <head>
    <title>Test</title>
  </head>
  <body>
        <h1 style="color:red">Hi!</h1>
        <p style="font-size:2px"><strong style="text-decoration:none">Yes!</strong></p>
        <p style="{color:red; font-size:1px} :link{color:red}">Feetnuts</p>
    <p style="font-size:2px"><a href="https://www.peterbe.com/page2/" 
     style=":link{color:blue}">Go to page 2</a></p>
    </body>
</html>

I'm sure you can think of many many ways to improve that. Mayhaps use argparse or something fancy to allow for more options. Mayhaps make it so that you can supply named .css files on the command line that get automagically inserted on the fly.

Newfound love of @staticmethod in Python

July 2, 2012
6 comments Python

The @staticmethod decorator is nothing new. In fact, it was added in version 2.2. However, it's not till now in 2012 that I have genuinely fallen in love with it.

First a quick recap to remind you how @staticmethod works.


class Printer(object):

    def __init__(self, text):
        self.text = text

    @staticmethod
    def newlines(s):
        return s.replace('\n','\r')

    def printer(self):
        return self.newlines(self.text)

p = Printer('\n\r')
assert p.printer() == '\r\r'

So, it's a function that has nothing to do with the instance but still belongs to the class. It belongs to the class from an structural point of view of the observer. Like, clearly the newlines function is related to the Printer class. The alternative is:


def newlines(s):
    return s.replace('\n','\r')

class Printer(object):

    def __init__(self, text):
        self.text = text

    def printer(self):
        return newlines(self.text)

p = Printer('\n\r')
assert p.printer() == '\r\r'

It's the exact same thing and one could argue that the function has nothing to do with the Printer class. But ask yourself (by looking at your code); how many times do you have classes with methods on them that take self as a parameter but never actually use it?

So, now for the trump card that makes it worth the effort of making it a staticmethod: object orientation. How would you do this neatly without OO?


class UNIXPrinter(Printer):

    @staticmethod
    def newlines(s):
        return s.replace('\n\r', '\n')

p = UNIXPrinter('\n\r')
assert p.printer() == '\n'

Can you see it? It's ideal for little functions that should be domesticated by the class but have nothing to do with the instance (e.g. self). I used to think it looked like it's making a pure looking thing like something more complex that it needs to be. But now, I think it looks great!

Secs sell! How I cache my entire pages (server-side)

May 10, 2012
1 comment Python, Django

I've blogged before about how this site can easily push out over 2,000 requests/second using only 6 WSGI workers excluding latency. The reason that's possible is because the whole page(s) can be cached server-side. What actually happens is that the whole rendered HTML blob is stored in the cache server (Redis in my case) so that no database queries are needed at all.

I wanted my site to still "feel" dynamic in the sense that once you post a comment (and it's published), the page automatically invalidates the cache and thus, the user doesn't have to refresh his browser when he knows it should have changed. To accomplish this I used a hacked cache_page decorator that makes the cache key depend on the content it depends on. Here's the code I actually use today for the home page:


def _home_key_prefixer(request):
    if request.method != 'GET':
        return None
    prefix = urllib.urlencode(request.GET)
    cache_key = 'latest_comment_add_date'
    latest_date = cache.get(cache_key)
    if latest_date is None:
        # when a blog comment is posted, the blog modify_date is incremented
        latest, = (BlogItem.objects
                   .order_by('-modify_date')
                   .values('modify_date')[:1])
        latest_date = latest['modify_date'].strftime('%f')
        cache.set(cache_key, latest_date, 60 * 60)
    prefix += str(latest_date)

    try:
        redis_increment('homepage:hits', request)
    except Exception:
        logging.error('Unable to redis.zincrby', exc_info=True)

    return prefix


@cache_page_with_prefix(60 * 60, _home_key_prefixer)
def home(request, oc=None):
    ...
    try:
        redis_increment('homepage:misses', request)
    except Exception:
        logging.error('Unable to redis.zincrby', exc_info=True)
    ...

And in the models I then have this:


@receiver(post_save, sender=BlogComment)
@receiver(post_save, sender=BlogItem)
def invalidate_latest_comment_add_dates(sender, instance, **kwargs):
    cache_key = 'latest_comment_add_date'
    cache.delete(cache_key)

So this means:

whole pages are cached for long time for fast access
updates immediately invalidates the cache for best user experience
no need to mess with ANY SQL caching

So, the next question is, if posting a comment means that the cache is invalidated and needs to be populated, what's the ratio of hits versus hits where the cache is cleared? Glad you asked. That's why I made this page:

www.peterbe.com/stats/

It allows me to monitor how often a new blog comment or general time-out means poor django needs to re-create the HTML using SQL.

At the time of writing, one in every 25 hits to the homepage requires the server to re-generate the page. And still the content is always fresh and relevant.

The next level of optimization would be to figure out whether a particular page update (e.g. a blog comment posting on a page that isn't featured on the home page) should or should not invalidate the home page. esp

String length truncation optimization difference in Python

March 19, 2012
8 comments Python

We have a piece of code that is going to be run A LOT on a server infrastructure that needs to be fast. I know that I/O is much more important but because I had the time I wanted to figure out which is fastest:


def a(s, m):
    if len(s) > m:
        s = s[:m]
    return s

...or...


def b(s, m):
    return s[:m]

Truncated! Read the rest by clicking the link below.

When to deepcopy classes in Python

March 14, 2012
9 comments Python

When using mutables in Python you have to be careful:


>>> a = {'value': 1}
>>> b = a
>>> a['value'] = 2
>>> b
{'value': 2}

So, you use the copy module from the standard library:


>>> import copy
>>> a = {'value': 1}
>>> b = copy.copy(a)
>>> a['value'] = 2
>>> b
{'value': 1}

That's nice but it's limited. It doesn't deal with the nested mutables as you can see here:


>>> a = {'value': {'name': 'Something'}}
>>> b = copy.copy(a)
>>> a['value']['name'] = 'else'
>>> b
{'value': {'name': 'else'}}

That's when you need the copy.deepcopy function:


>>> a = {'value': {'name': 'Something'}}
>>> b = copy.deepcopy(a)
>>> a['value']['name'] = 'else'
>>> b
{'value': {'name': 'Something'}}

Now, suppose we have a custom class that overrides the dict type. That's a very common thing to do. Let's demonstrate:


>>> class ORM(dict):
...     pass
... 
>>> a = ORM(name='Value')
>>> b = copy.copy(a)
>>> a['name'] = 'Other'
>>> b
{'name': 'Value'}

And again, if you have a nested mutable object you need copy.deepcopy:


>>> class ORM(dict):
...     pass
... 
>>> a = ORM(data={'name': 'Something'})
>>> b = copy.deepcopy(a)
>>> a['data']['name'] = 'else'
>>> b
{'data': {'name': 'Something'}}

But oftentimes you'll want to make your dict subclass behave like a regular class so you can access data with dot notation. Like this:


>>> class ORM(dict):
...     def __getattr__(self, key):
...         return self[key]
... 
>>> a = ORM(data={'name': 'Something'})
>>> a.data['name']
'Something'

Now here's a problem. If you do that, you loose the ability to use copy.deepcopy since the class has now been slightly "abused".


>>> a = ORM(data={'name': 'Something'})
>>> a.data['name']
'Something'
>>> b = copy.deepcopy(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/python/2.7.2/lib/python2.7/copy.py", line 172, in deepcopy
    copier = getattr(x, "__deepcopy__", None)
  File "<stdin>", line 3, in __getattr__
KeyError: '__deepcopy__'

Hmm... now you're in trouble and to get yourself out of it you have to define a __deepcopy__ method as well. Let's just do it:


>>> class ORM(dict):
...     def __getattr__(self, key):
...         return self[key]
...     def __deepcopy__(self, memo):
...         return ORM(copy.deepcopy(dict(self)))
... 
>>> a = ORM(data={'name': 'Something'})
>>> a.data['name']
'Something'
>>> b = copy.deepcopy(a)
>>> a.data['name'] = 'else'
>>> b
{'data': {'name': 'Something'}}

Yeah!!! Now we get what we want. Messing around with the __getattr__ like this is, as far as I know, the only time you have to go in and write your own __deepcopy__ method.

I'm sure hardcore Python language experts can point out lots of intricacies about __deepcopy__ but since I only learned about this today, having it here might help someone else too.

Persistent caching with fire-and-forget updates

December 14, 2011
4 comments Python, Tornado

I just recently landed some patches on toocool that implements and interesting pattern that is seen more and more these days. I call it: Persistent caching with fire-and-forget updates

Basically, the implementation is this: You issue a request that requires information about a Twitter user: E.g. http://toocoolfor.me/following/chucknorris/vs/peterbe The app looks into its MongoDB for information about the tweeter and if it can't find this user it goes onto the Twitter REST API and looks it up and saves the result in MongoDB. The next time the same information is requested, and the data is available in the MongoDB it instead checks if the modify_date or more than an hour and if so, it sends a job to the message queue (Celery with Redis in my case) to perform an update on this tweeter.

You can basically see the code here but just to reiterate and abbreviate, it looks like this:


tweeter = self.db.Tweeter.find_one({'username': username})
if not tweeter:
   result = yield tornado.gen.Task(...)
   if result:
       tweeter = self.save_tweeter_user(result)
   else:
       # deal with the error!
elif age(tweeter['modify_date']) > 3600:
   tasks.refresh_user_info.delay(username, ...)
# render the template!

What the client gets, i.e. the user using the site, is it that apart from the very first time that URL is request is instant results but data is being maintained and refreshed.

This pattern works great for data that doesn't have to be up-to-date to the second but that still needs a way to cache invalidate and re-fetch. This works because my limit of 1 hour is quite arbitrary. An alternative implementation would be something like this:


tweeter = self.db.Tweeter.find_one({'username': username})
if not tweeter or (tweeter and age(tweeter) > 3600 * 24 * 7):
    # re-fetch from Twitter REST API
elif age(tweeter) > 3600:
    # fire-and-forget update

That way you don't suffer from persistently cached data that is too old.

Python file with closing automatically

December 3, 2011
2 comments Python

Perhaps someone who knows more about the internals of python and the recent changes in 2.6 and 2.7 can explain this question that came up today in a code review.

I suggest using with instead of try: ... finally: to close a file that was written to. Instead of this:


dest = file('foo', 'w')
try:
   dest.write('stuff')
finally:
   dest.close()
print open('foo').read()  # will print 'stuff'

We can use this:


with file('foo', 'w') as dest: 
    dest.write('stuff')
print open('foo').read()  # will print 'stuff'

Why does that work? I'm guessing it's because the file() instance object has a built in __exit__ method. Is that right?

That means I don't need to use contextlib.closing(thing) right?

For example, suppose you have this class:


class Farm(object):
   def __enter__(self):
       print "Entering"
       return self
   def __exit__(self, err_type, err_val, err_tb):
       print "Exiting", err_type
       self.close()
   def close(self):
       print "Closing"

with Farm() as farm:
   pass
# this will print:
#   Entering
#   Exiting None
#   Closing

Another way to achieve the same specific result would be to use the closing() decrorator:


class Farm(object):
   def close(self):
       print "Closing"

from contextlib import closing
with closing(Farm()) as farm:
   pass
# this will print:
#   Closing

So the closing() decorator "steals" the __enter__ and __exit__. This last one can be handy if you do this:


from contextlib import closing
with closing(Farm()) as farm:
   raise ValueError

# this will print
#  Closing
#  Traceback (most recent call last):
#   File "dummy.py", line 16, in <module>
#     raise ValueError
#  ValueError

This is turning into my own loud thinking and I think I get it now. contextlib.closing() basically makes it possible to do what I did there with the __enter__ and __exit__ and it seems the file() built-in has a exit handler that takes care of the closing already so you don't have to do it with any extra decorators.