Filtered by JavaScript, Python

Page 30

Reset

Grymt - because I didn't invent Grunt here

April 18, 2014
3 comments Python, Web development, JavaScript

grymt is a python tool that takes a directory full of .html, .css and .js and prepares the html for optimial production use.

For a teaser:

  1. Look at the "input"

  2. Look at the "output" (Note! You have to right-click and view source)

So why did I write my own tool and not use Grunt?!

Glad you asked! The reason is simple: I couldn't get Grunt to work.

Grunt is a framework. It's a place where you say which "recipes" to execute and how. It's effectively a common config framework. Like make.
However, I tried to set up a bunch of recipes in my Gruntfile.js and most of them worked well individually but it was a hellish nightmare to get it all to work together just the way I want it.

For example, the grunt-contrib-uglify is fine for doing the minification but it doesn't work with concatenation and it doesn't deal with taking one input file and outputting to a different file.
Basically, I spent two evenings getting things to work but I could never get exactly what I wanted. So I wrote my own and because I'm quite familiar with this kind of stuff, I did it in Python. Not because it's better than Node but just because I had it near by and was able to quicker build something.

So what sweet features do you get out of grymt?

  1. You can easily make an output file have a hash in the filename. E.g. vendor-$hash.min.js becomes vendor-64f7425.min.js and thus the filename is always unique but doesn't change in between deployments unless you change the files.

  2. It automatically notices which files already have been minified. E.g. no need to minify somelib.min.js but do minify otherlib.js.

  3. You can put $git_revision anywhere in your HTML and this gets expanded automatically. For example, view the source of buggy.peterbe.com and look at the first 20 lines.

  4. Images inside CSS get rewritten to have unique names (based on files' modified time) so they can be far-future cached aggresively too.

  5. You never have to write down any lists of file names in soome Gruntfile.js equivalent file

  6. It copies ALL files from a source directory. This is important in case you have something like this inside your javascript code: $('<img>').attr('src', 'picture.jpg') for example.

  7. You can chose to inline all the minified and concatenated CSS or javascript. Inlining CSS is neat for single page apps where you have a majority of primed cache hits. Instead of one .html and one .css you get just one .html and the amount of bytes is the same. Not having to do another HTTP request can save a lot of time on web performance.

  8. The generated (aka. "dist" directory) contains everything you need. It does not refer back to the source directory in any way. This means you can set up your apache/nginx to point directly at the root of your "dist" directory.

So what's the catch?

  1. It's not Grunt. It's not a framework. It does only what it does and if you want it to do more you have to work on grymt itself.

  2. The files you want to analyze, process and output all have to be in a sub directory.
    Look at how I've laid out the files here in this project for example. ALL files that you need is all in one sub-directory called app. So, to run grymt I simply run: grymt app.

  3. The HTML files you throw into it have to be plain HTML files. No templates for server-side code.

How do you use it?

pip install grymt

Then you need a directory it can process, e.g ./client/ (assumed to contain a .html file(s)).

grymt ./client

For more options, check out

grymt --help

What's in the future of grymt?

If people like it and want to add features, I'm more than happy to accept pull requests. Some future potential feature work:

  • I haven't needed it immediately, yet, myself, but it would be nice to add things like coffeescript, less, sass etc into pre-processing hooks.

  • It would be easy to automatically generate and insert a reference to a appcache manifest. Since every file used and mentioned is noticed, we could very accurately generate an appcache file that is less prone to human error.

  • Spitting out some stats about number bytes saved and number of files reduced.

COPYFILE_DISABLE and python distutils in python 2.6

April 12, 2014
0 comments Python

My friend and colleague Jannis (aka jezdez) Leidel saved my bacon today where I had gotten completely stuck.

So, I have this python2.6 virtualenv and whenever I ran python setup.py sdist upload it would upload a really nasty tarball to PyPI. What would happen is that when people do pip install premailer it would file horribly and look something like this:

...
IOError: [Errno 2] No such file or directory: '/path/to/virtual-env/build/premailer/setup.py'

What?!?! If you download the tarball and unpack it you'll see that there definitely is a setup.py file in there.

Anyway. What happens, which I didn't realize was that within the .tar.gz file there were these strange copies of files. For example for every file.py there was a ._file.py etc.

Here's what the file looked like after a tarball had been created:

(premailer26)peterbe@mpb:~/dev/PYTHON/premailer (master)$ tar -zvtf dist/premailer-2.0.2.tar.gz
-rwxr-xr-x  0 peterbe staff     311 Apr 11 15:51 ./._premailer-2.0.2
drwxr-xr-x  0 peterbe staff       0 Apr 11 15:51 premailer-2.0.2/
-rw-r--r--  0 peterbe staff     280 Mar 28 10:13 premailer-2.0.2/._LICENSE
-rw-r--r--  0 peterbe staff    1517 Mar 28 10:13 premailer-2.0.2/LICENSE
-rw-r--r--  0 peterbe staff     280 Apr  9 21:10 premailer-2.0.2/._MANIFEST.in
-rw-r--r--  0 peterbe staff      34 Apr  9 21:10 premailer-2.0.2/MANIFEST.in
-rw-r--r--  0 peterbe staff     280 Apr 11 15:51 premailer-2.0.2/._PKG-INFO
-rw-r--r--  0 peterbe staff    7226 Apr 11 15:51 premailer-2.0.2/PKG-INFO
-rwxr-xr-x  0 peterbe staff     311 Apr 11 15:51 premailer-2.0.2/._premailer
drwxr-xr-x  0 peterbe staff       0 Apr 11 15:51 premailer-2.0.2/premailer/
-rwxr-xr-x  0 peterbe staff     311 Apr 11 15:51 premailer-2.0.2/._premailer.egg-info
drwxr-xr-x  0 peterbe staff       0 Apr 11 15:51 premailer-2.0.2/premailer.egg-info/
-rw-r--r--  0 peterbe staff     280 Mar 28 10:13 premailer-2.0.2/._README.md
-rw-r--r--  0 peterbe staff    5185 Mar 28 10:13 premailer-2.0.2/README.md
-rw-r--r--  0 peterbe staff     280 Apr 11 15:51 premailer-2.0.2/._setup.cfg
-rw-r--r--  0 peterbe staff      59 Apr 11 15:51 premailer-2.0.2/setup.cfg
-rw-r--r--  0 peterbe staff     280 Apr  9 21:09 premailer-2.0.2/._setup.py
-rw-r--r--  0 peterbe staff    2079 Apr  9 21:09 premailer-2.0.2/setup.py
-rw-r--r--  0 peterbe staff     280 Apr 11 15:51 premailer-2.0.2/premailer.egg-info/._dependency_links.txt
-rw-r--r--  0 peterbe staff       1 Apr 11 15:51 premailer-2.0.2/premailer.egg-info/dependency_links.txt
-rw-r--r--  0 peterbe staff     280 Apr  9 21:04 premailer-2.0.2/premailer.egg-info/._not-zip-safe
-rw-r--r--  0 peterbe staff       1 Apr  9 21:04 premailer-2.0.2/premailer.egg-info/not-zip-safe
-rw-r--r--  0 peterbe staff     280 Apr 11 15:51 premailer-2.0.2/premailer.egg-info/._PKG-INFO
-rw-r--r--  0 peterbe staff    7226 Apr 11 15:51 premailer-2.0.2/premailer.egg-info/PKG-INFO
-rw-r--r--  0 peterbe staff     280 Apr 11 15:51 premailer-2.0.2/premailer.egg-info/._requires.txt
-rw-r--r--  0 peterbe staff      23 Apr 11 15:51 premailer-2.0.2/premailer.egg-info/requires.txt
-rw-r--r--  0 peterbe staff     280 Apr 11 15:51 premailer-2.0.2/premailer.egg-info/._SOURCES.txt
-rw-r--r--  0 peterbe staff     329 Apr 11 15:51 premailer-2.0.2/premailer.egg-info/SOURCES.txt
-rw-r--r--  0 peterbe staff     280 Apr 11 15:51 premailer-2.0.2/premailer.egg-info/._top_level.txt
-rw-r--r--  0 peterbe staff      10 Apr 11 15:51 premailer-2.0.2/premailer.egg-info/top_level.txt
-rw-r--r--  0 peterbe staff     280 Apr  9 21:21 premailer-2.0.2/premailer/.___init__.py
-rw-r--r--  0 peterbe staff      66 Apr  9 21:21 premailer-2.0.2/premailer/__init__.py
-rw-r--r--  0 peterbe staff     280 Apr  9 09:23 premailer-2.0.2/premailer/.___main__.py
-rw-r--r--  0 peterbe staff    3315 Apr  9 09:23 premailer-2.0.2/premailer/__main__.py
-rw-r--r--  0 peterbe staff     280 Apr  8 16:22 premailer-2.0.2/premailer/._premailer.py
-rw-r--r--  0 peterbe staff   15368 Apr  8 16:22 premailer-2.0.2/premailer/premailer.py
-rw-r--r--  0 peterbe staff     280 Apr  8 16:22 premailer-2.0.2/premailer/._test_premailer.py
-rw-r--r--  0 peterbe staff   37184 Apr  8 16:22 premailer-2.0.2/premailer/test_premailer.py

Strangly, this only happened in a Python 2.6 environment. The problem went away when I created a brand new Python 2.7 enviroment with the latest setuptools.

So basically, the fault lies with OSX and a strange interaction between OSX and tar.
This superuser.com answer does a much better job explaining this "flaw".

So, the solution to the problem is to create the distribution like this instead:

$ COPYFILE_DISABLE=true python setup.py sdist

If you do that, you get a healthy lookin tarball that actually works to pip install. Thanks jezdez for pointing that out!

Buggy - A sexy Bugzilla offline webapp

March 13, 2014
1 comment Web development, Mozilla, JavaScript

Screenshot
Buggy is a singe-page webapp that relies entirely on the Bugzilla Native REST API. And it works offline. Sort of. I say "sort of" because obviously without a network connection you're bound to have outdated information from the bugzilla database but at least you'll have what you had when you went offline.

When you post a comment from Buggy, the posted comment is added to an internal sync queue and if you're online it immediately processes that queue. There is, of course, always a risk that you might close a bug when you're in a tunnel or on a plane without WiFi and when you later get back online the sync fails because of some conflict.

The reason I built this was partly to scratch an itch I had ("What's the ideal way possible for me to use Bugzilla?") and also to experiment with some new techniques, namely AngularJS and localforage.

Live-search

So, the way it works is:

  1. You pick your favorite product and components.

  2. All bugs under these products and components are downloaded and stored locally in your browser (thank you localforage).

  3. When you click any bug it then proceeds to download its change history and its comments.

  4. Periodically it checks each of your chosen product and components to see if new bugs or new comments have been added.

  5. If you refresh your browser, all bugs are loaded from a local copy stored in your browser and in the background it downloads any new bugs or comments or changes.

  6. If you enter your username and password, an auth token is stored in your browser and you can thus access secure bugs.

I can has charts

Pros and cons

The main advantage of Buggy compared to Bugzilla is that it's fast to navigate. You can instantly filter bugs by status(es), components and/or by searching in the bug summary.

The disadvantage of Buggy is that you can't see all fields, file new bugs or change all fields.

The code

The code is of course open source. It's available on https://github.com/peterbe/buggy and released under a MPL 2 license.

The code requires no server. It's just an HTML page with some CSS and Javascript.

Everything is done using AngularJS. It's only my second AngularJS project but this is also part of why I built this. To learn AngularJS better.

Much of the inspiration came from the CSS framework Pure and one of their sample layouts which I started with and hacked into shape.

The deployment

YSlow
Because Buggy doesn't require a server, this is the very first time I've been able to deploy something entirely on CDN. Not just the images, CSS and Javascript but the main HTML page as well. Before I explain how I did that, let me explain about the make.py script.

I really wanted to use Grunt but it just didn't work for me. There are many positive things about Grunt such as the ease with which you can easily add plugins and I like how you just have one "standard" file that defines how a bunch of meta tasks should be done. However, I just couldn't get the concatenation and minification and stuff to work together. Individually each tool works fine, such as the grunt-contrib-uglify plugin but together none of them appeared to want to work. Perhaps I just required too much.

In the end I wrote a script in python that does exactly what I want for deployment. Its features are:

  • Hashes in the minified and concatenated CSS and Javascript files (e.g. vendor-8254f6b.min.js)
  • Custom names for the minified and concatenated CSS and Javascript files so I can easily set far-future cache headers (e.g. /_cache/vendor-8254f6b.min.js)
  • Ability to fold all CSS minified into the HTML (since there's only one page, theres little reason to make the CSS external)
  • A Git revision SHA into the HTML of the generated ./dist/index.html file
  • All files in ./client/static/ copied intelligently into ./dist/static/
  • Images in CSS to be given hashes so they too can have far-future cache headers

So, the way I have it set up is that, on my server, I have a it run python make.py and that generates a complete site in a ./dist/ directory. I then point Nginx to that directory and run it under http://buggy-origin.peterbe.com. Then I set up a Amazon Cloudfront distribution to that domain and then lastly I set up a CNAME for buggy.peterbe.com to point to the Cloudfront distribution.

The future

I try my best to maintain a TODO file inside the repo. That's where I write down things to come. (it's also works as a changelog) since I also use this file to write down what's been done.

One of the main features I want to add is the ability to add bugs that are outside your chosen products and components. It'll be a "fake" component called "Misc". This is for bugs outside the products and components you usually monitor and work in but perhaps bugs you've filed or been assigned to. Or just other bugs you're interested in in general.

Another major feature to work on is the ability to choose to see more fields and ability to edit these too. This will require some configuration on the individual users' behalf. For example, some people use the "Target Milestone" a lot. Some use the "Importance" a lot. So, some generic solution is needed to accomodate all these non-basic fields.

And last but not least, the Bugzilla team here at Mozilla is working on a very exciting project that allows you to register a certain list of bugs with a WebSocket and have it push to you as soon as these bugs change. That means that I won't have to periodically query bugzilla every 30 seconds if certain bugs have changed but instead get instant notifications when they do. That's going to be major! I confidently speculate that that will be implemented some time summer this year.

Give it a go. What are you waiting for? :) Go to http://buggy.peterbe.com/, pick your favorite products and components and try to use it for a week.

Advanced live-search with AngularJS

February 4, 2014
12 comments JavaScript

For people familar with AngularJS, it's almost frighteningly easy to make a live-search on a repeating iterator.

Here's such an example: http://jsfiddle.net/r26xm/1/

Out of the box it just works. If nothing is typed into the search field it returns everything.

A big problem with this is that the pattern matching isn't very good. For example, if you search for ter you get Teresa and Peter.
More realistically you want it to only match with a leading word delimiter. In other words, if you type ter you want it only to match Teresa but not Peter because Peter doesn't start with ter.
So, to remedy that we construct a regular expression on the fly with a leading word delimiter. I.e. \bter.

Here's an example of that: http://jsfiddle.net/f4Zkm/2/

Now, there's a problem. For every item in the list the regular expression needs to be created and compiled which, when the list is very long, can become incredibly slow.
To remedy that we use $scope.$watch to create a local regular expression which only happens once per update to $scope.search.

Here's an example of that: http://jsfiddle.net/f4Zkm/4/

That, I think, is a really good pattern. Unfortunately we've left the simplicity but we now have something snappier.

Unfortunately the example is a little bit contrived because the list of names it filters on is so small but the list could be huge. It could also be that we want to make a more advanced regular expression. For example, you might want to allow multiple words to match so as ter ma should match Teresa Mayers, John Mayor and Maria Connor. Then you could make a regular expression with something like \b(ter|ma).

For seasoned Angularnauts this is trivial stuff but it really helped me make an app much faster and smoother. I hope it helps someones else doing something similar.

Sorting mixed type lists in Python 3

January 18, 2014
4 comments Python

Because this bit me harder than I was ready for, I thought I'd make a note of it for the next victim.

In Python 2, suppose you have this:


Python 2.7.5
>>> items = [(1, 'A number'), ('a', 'A letter'), (2, 'Another number')]

Sorting them, without specifying how, will automatically notice that it contains tuples:


Python 2.7.5
>>> sorted(items)
[(1, 'A number'), (2, 'Another number'), ('a', 'A letter')]

This doesn't work in Python 3 because comparing integers and strings is not allowed. E.g.:


Python 3.3.3
>>> 1 < '1'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unorderable types: int() < str()

You have to convert them to stings first.


Python 3.3.3
>>> sorted(items, key=lambda x: str(x[0]))
[(1, 'A number'), (2, 'Another number'), ('a', 'A letter')]

If you really need to sort by 1 < '1' this won't work. Then you need a more complex key function. E.g.:


Python 3.3.3
>>> def keyfunction(x):
...   v = x[0]
...   if isinstance(v, int): v = '0%d' % v
...   return v
...
>>> sorted(items, key=keyfunction)
[(1, 'A number'), (2, 'Another number'), ('1', 'Actually a string')]

That's really messy but the best I can come up with at past 4PM on Friday.

Credit Card formatter in Javascript

November 19, 2013
28 comments JavaScript

I looked around for Javascript libs that do automatic input formatting for credit card inputs.

The first one was formatter.js which looked promising but it weighs over 6Kb minified and also, when you apply it the placeholder attribute you have on the input disappears.

So, in true software engineering fashion I wrote my own:


function cc_format(value) {
  var v = value.replace(/\s+/g, '').replace(/[^0-9]/gi, '')
  var matches = v.match(/\d{4,16}/g);
  var match = matches && matches[0] || ''
  var parts = []
  for (i=0, len=match.length; i<len; i+=4) {
    parts.push(match.substring(i, i+4))
  }
  if (parts.length) {
    return parts.join(' ')
  } else {
    return value
  }
}

And some tests to prove it:


assert(cc_format('1234') === '1234')
assert(cc_format('123456') === '1234 56')
assert(cc_format('123456789') === '1234 5678 9')
assert(cc_format('') === '')
assert(cc_format('1234 1234 5') === '1234 1234 5')
assert(cc_format('1234 a 1234x 5') === '1234 1234 5')

Check out the Demo

From Postgres to JSON strings

November 12, 2013
14 comments Python

No, this is not about the new JSON Type added in Postgres 9.2. This is about how you can get a record set from a Postgres database into a JSON string the best way possible using Python.

Here's the traditional way:


>>> import json
>>> import psycopg2
>>>
>>> conn = psycopg2.connect('dbname=peterbecom')
>>> cur = conn.cursor()
>>> cur.execute("""
...   SELECT
...     id, oid, root, approved, name
...   FROM blogcomments
...   LIMIT 10
... """)
>>> columns = (
...     'id', 'oid', 'root', 'approved', 'name'
... )
>>> results = []
>>> for row in cur.fetchall():
...     results.append(dict(zip(columns, row)))
...
>>> print json.dumps(results, indent=2)
[
  {
    "oid": "comment-20030707-161847",
    "root": true,
    "id": 5662,
    "name": "Peter",
    "approved": true
  },
  {
    "oid": "comment-20040219-r4cf",
    "root": true,
    "id": 5663,
    "name": "silconscave",
    "approved": true
  },
  {
    "oid": "c091011r86x",
    "root": true,
    "id": 5664,
    "name": "Rachel Jay",
    "approved": true
  },
...

This is plain and nice but it's kinda annoying that you have to write down the columns you're selecting twice.
Also, it's annoying that you have to convert the results of fetchall() into a list of dicts in an extra loop.

So, there's a trick to the rescue! You can use the cursor_factory parameter. See below:


>>> import json
>>> import psycopg2
>>> from psycopg2.extras import RealDictCursor
>>>
>>> conn = psycopg2.connect('dbname=peterbecom')
>>> cur = conn.cursor(cursor_factory=RealDictCursor)
>>> cur.execute("""
...   SELECT
...     id, oid, root, approved, name
...   FROM blogcomments
...   LIMIT 10
... """)
>>>
>>> print json.dumps(cur.fetchall(), indent=2)
[
  {
    "oid": "comment-20030707-161847",
    "root": true,
    "id": 5662,
    "name": "Peter",
    "approved": true
  },
  {
    "oid": "comment-20040219-r4cf",
    "root": true,
    "id": 5663,
    "name": "silconscave",
    "approved": true
  },
  {
    "oid": "c091011r86x",
    "root": true,
    "id": 5664,
    "name": "Rachel Jay",
    "approved": true
  },
...

Isn't that much nicer? It's shorter and only lists the columns once.

But is it much faster? Sadly, no it's not. Not much faster. I ran various benchmarks comparing various ways of doing this and basically concluded that there's no significant difference. The latter one using RealDictCursor is around 5% faster. But I suspect all the time in the benchmark is spent doing things (the I/O) that is not different between the various versions.

Anyway. It's a keeper. I think it just looks nicer.

Lazy loading below the fold

October 26, 2013
2 comments Web development, JavaScript

I've started experimenting with my home page to make it load even faster.

Amazon famously does this too which you can read more about in this Steve Souders post. They make sure everything that needs to be visible above the fold is loaded first, then, it starts loading all the other "stuff" below the fold. The assumption is that the user requests the page, watches it render, and sometimes after it has rendered reaches for the mouse and starts scrolling down for more content. Or perhaps, never bothers to scroll down at all. Either way, everything below the fold can wait. We have more time, to load that in, later.

What we want to avoid is a load graph like this:

big html document delays loading other stuff

The graph is deliberately zoomed out so that we don't get stuck on the details of that particular graph. But basically, you have a very heavy document to load which needs to be fully loaded (and partially rendered) before it can load all other stuff that that page entails. As you can see, the first load (the HTML document) is taking up a majority of the load time. Once that's downloaded the browser can start parsing it and start rendering it. Simultaneously it can start downloading all the mentioned resources such as images, javascript, and CSS.

On WebPagetest they call this Speed Index; "The Speed Index is the average time at which visible parts of the page are displayed."
So basically, you want to display as much as you possibly can and then load in other things that are necessary but can wait in the background.

So, how did I accomplish this on my site?

Basically, the home page uses as piece of Django code that picks up the 10 most recent blog posts and includes them into the template. Instead, I made it only pick up the first 2 and then after window.onload a piece if AJAX code loads the HTML for the remaining 8 blog posts.
That means that much less is required to load the home page. The page is smaller and references less images. The AJAX code is very crude and simple but works enough:


onload = function() {
  microAjax("/rest/2/10/", function (res) {
    document.getElementById('rest').innerHTML = res;
  });
};

The user probably won't notice a huge difference if she avoids looking at the loading spinner of her browser. Only if she is really really fast at scrolling down will she notice that the rest of the page (about 80% of its vertical space) comes in a little bit later.

So, did it work?

I hope so! The theory is sound. However, my home page is, unlike an Amazon.com product page, very sparse. The page weighs a total of 77Kb (excluding external resources) but now only the first 25Kb is loaded and the rest later.

Here's a measurement before and one after. It's kinda hard to compare because "fluctuations" on network I/O make measurements like this quite unpredictable. Also, there's various odd requests like New Relic and Google Analytics which clouds the waterfall view. However, what really matters is in the "First View" of the after measurement. If you look closely you'll see that now a bunch of images aren't loaded until after the "Document Complete" event has fired. That, to me, is a big win.

Below the fold

If you're interested in how it was done, check out this changeset.

Fastest database for Tornado

October 9, 2013
9 comments Python, Tornado

When you use a web framework like Tornado, which is single threaded with an event loop (like nodejs familiar with that), and you need persistency (ie. a database) there is one important questions you need to ask yourself:

Is the query fast enough that I don't need to do it asynchronously?

If it's going to be a really fast query (for example, selecting a small recordset by key (which is indexed)) it'll be quicker to just do it in a blocking fashion. It means less CPU work to jump between the events.

However, if the query is going to be potentially slow (like a complex and data intensive report) it's better to execute the query asynchronously, do something else and continue once the database gets back a result. If you don't all other requests to your web server might time out.

Another important question whenever you work with a database is:

Would it be a disaster if you intend to store something that ends up not getting stored on disk?

This question is related to the D in ACID and doesn't have anything specific to do with Tornado. However, the reason you're using Tornado is probably because it's much more performant that more convenient alternatives like Django. So, if performance is so important, is durable writes important too?

Let's cut to the chase... I wanted to see how different databases perform when integrating them in Tornado. But let's not just look at different databases, let's also evaluate different ways of using them; either blocking or non-blocking.

What the benchmark does is:

  • On one single Python process...
  • For each database engine...
  • Create X records of something containing a string, a datetime, a list and a floating point number...
  • Edit each of these records which will require a fetch and an update...
  • Delete each of these records...

I can vary the number of records ("X") and sum the total wall clock time it takes for each database engine to complete all of these tasks. That way you get an insert, a select, an update and a delete. Realistically, it's likely you'll get a lot more selects than any of the other operations.

And the winner is:

pymongo!! Using the blocking version without doing safe writes.

Fastest database for Tornado

Let me explain some of those engines

  • pymongo is the blocking pure python engine
  • with the redis, toredis and memcache a document ID is generated with uuid4, converted to JSON and stored as a key
  • toredis is a redis wrapper for Tornado
  • when it says (safe) on the engine it means to tell MongoDB to not respond until it has with some confidence written the data
  • motor is an asynchronous MongoDB driver specifically for Tornado
  • MySQL doesn't support arrays (unlike PostgreSQL) so instead the tags field is stored as text and transformed back and fro as JSON
  • None of these database have been tuned for performance. They're all fresh out-of-the-box installs on OSX with homebrew
  • None of these database have indexes apart from ElasticSearch where all things are indexes
  • momoko is an awesome wrapper for psycopg2 which works asyncronously specifically with Tornado
  • memcache is not persistant but I wanted to include it as a reference
  • All JSON encoding and decoding is done using ultrajson which should work to memcache, redis, toredis and mysql's advantage.
  • mongokit is a thin wrapper on pymongo that makes it feel more like an ORM
  • A lot of these can be optimized by doing bulk operations but I don't think that's fair
  • I don't yet have a way of measuring memory usage for each driver+engine but that's not really what this blog post is about
  • I'd love to do more work on running these benchmarks on concurrent hits to the server. However, with blocking drivers what would happen is that each request (other than the first one) would have to sit there and wait so the user experience would be poor but it wouldn't be any faster in total time.
  • I use the official elasticsearch driver but am curious to also add Tornado-es some day which will do asynchronous HTTP calls over to ES.

You can run the benchmark yourself

The code is here on github. The following steps should work:

$ virtualenv fastestdb
$ source fastestdb/bin/activate
$ git clone https://github.com/peterbe/fastestdb.git
$ cd fastestdb
$ pip install -r requirements.txt
$ python tornado_app.py

Then fire up http://localhost:8000/benchmark?how_many=10 and see if you can get it running.

Note: You might need to mess around with some of the hardcoded connection details in the file tornado_app.py.

Discussion

Before the lynch mob of HackerNews kill me for saying something positive about MongoDB; I'm perfectly aware of the discussions about large datasets and the complexities of managing them. Any flametroll comments about "web scale" will be deleted.

I think MongoDB does a really good job here. It's faster than Redis and Memcache but unlike those key-value stores, with MongoDB you can, if you need to, do actual queries (e.g. select all talks where the duration is greater than 0.5). MongoDB does its serialization between python and the database using a binary wrapper called BSON but mind you, the Redis and Memcache drivers also go to use a binary JSON encoding/decoder.

The conclusion is; be aware what you want to do with your data and what and where performance versus durability matters.

What's next

Some of those drivers will work on PyPy which I'm looking forward to testing. It should work with cffi like psycopg2cffi for example for PostgreSQL.

Also, an asynchronous version of elasticsearch should be interesting.

UPDATE 1

Today I installed RethinkDB 2.0 and included it in the test.

With RethinkDB 2.0

It was added in this commit and improved in this one.

I've been talking to the core team at RethinkDB to try to fix this.

In Python you sort with a tuple

June 14, 2013
9 comments Python

My colleague Axel Hecht showed me something I didn't know about sorting in Python.

In Python you can sort with a tuple. It's best illustrated with a simple example:


>>> items = [(1, 'B'), (1, 'A'), (2, 'A'), (0, 'B'), (0, 'a')]
>>> sorted(items)
[(0, 'B'), (0, 'a'), (1, 'A'), (1, 'B'), (2, 'A')]

By default the sort and the sorted built-in function notices that the items are tuples so it sorts on the first element first and on the second element second.

However, notice how you get (0, 'B') appearing before (0, 'a'). That's because upper case comes before lower case characters. However, suppose you wanted to apply some "humanization" on that and sort case insensitively. You might try:


>>> sorted(items, key=str.lower)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: descriptor 'lower' requires a 'str' object but received a 'tuple'

which is an error we deserve because this won't work for the first part of each tuple.

We could try to write a lambda function (e.g. sorted(items, key=lambda x: x.lower() if isinstance(x, str) else x)) but that's not going to work because you'll only ever get to apply that to the first item.

Without further ado, here's how you do it. A lambda function that returns a tuple:


>>> sorted(items, key=lambda x: (x[0], x[1].lower()))
[(0, 'a'), (0, 'B'), (1, 'A'), (1, 'B'), (2, 'A')]

And there you have it! Thanks for sharing Axel!

As a bonus item for people still reading...
I'm sure you know that you can reverse a sort order simply by passing in sorted(items, reverse=True, ...) but what if you want to have different directions depend on the key that you're sorting on.

Using the technique of a lambda function that returns a tuple, here's how we sort a slightly more advanced structure:


>>> peeps = [{'name': 'Bill', 'salary': 1000}, {'name': 'Bill', 'salary': 500}, {'name': 'Ted', 'salary': 500}]

And now, sort with a lambda function returning a tuple:


>>> sorted(peeps, key=lambda x: (x['name'], x['salary']))
[{'salary': 500, 'name': 'Bill'}, {'salary': 1000, 'name': 'Bill'}, {'salary': 500, 'name': 'Ted'}]

Makes sense, right? Bill comes before Ted and 500 comes before 1000. But how do you sort it like that on the name but reverse on the salary? Simple, negate it:


>>> sorted(peeps, key=lambda x: (x['name'], -x['salary']))
[{'salary': 1000, 'name': 'Bill'}, {'salary': 500, 'name': 'Bill'}, {'salary': 500, 'name': 'Ted'}]

UPDATE

Webucator has made a video explaining this blog post as a video.

Thanks Nat!