Filtered by Python

Page 13

Reset

ElasticSearch, snowball analyzer and stop words

September 25, 2015
1 comment Python

Disclaimer: I'm an ElasticSearch noob. Go easy on me

I have an application that uses ElasticSearch's more_like_this query to find related content. It basically works like this:

>>> index(index, doc_type, {'id': 1, 'title': 'Your cool title is here'})
>>> index(index, doc_type, {'id': 2, 'title': 'About is a cool headline'})
>>> index(index, doc_type, {'id': 3, 'title': 'Titles are your big thing'})

Then you can pick one ID (1, 2 or 3) and find related ones.
We can tell by looking at these three silly examples, the 1 and 2 have the words "is" and "cool" in common. 1 and 3 have "title" (stemming taken into account) and "your" in common. However, is there much value in connected these documents on the words "is" and "your"? I think not. Those are stop words. E.g. words like "the", "this", "from", "she" etc. Basically words that are commonly used as "glue" between more unique and specific words.

Anyway, if you index something in ElasticSearch as a text field you get, by default, the "standard" analyzer to analyze the incoming stuff to be indexed. The standard analyzer just splits the words on whitespace. A more compelling analyzer is the Snowball analyzer (original here) which supports intelligent stemming (turning "wife" ~= "wives") and stop words.

The problem is that the snowball analyzer has a very different set of stop words. We did some digging and thought this was the list it bases its English stop words on. But this was wrong. Note that that list has words like "your" and "about" listed there.

The way to find out how your analyzer treats a string and turns it into token is to the the _analyze tool. For example:

curl -XGET 'localhost:9200/{myindexname}/_analyze?analyzer=snowball' -d 'about your special is a the word' | json_print
{
  "tokens": [
    {
      "end_offset": 5,
      "token": "about",
      "type": "<ALPHANUM>",
      "start_offset": 0,
      "position": 1
    },
    {
      "end_offset": 10,
      "token": "your",
      "type": "<ALPHANUM>",
      "start_offset": 6,
      "position": 2
    },
    {
      "end_offset": 18,
      "token": "special",
      "type": "<ALPHANUM>",
      "start_offset": 11,
      "position": 3
    },
    {
      "end_offset": 32,
      "token": "word",
      "type": "<ALPHANUM>",
      "start_offset": 28,
      "position": 7
    }
  ]
}

So what you can see is that it finds the tokens "about", "your", "special" and "word". But it stop word ignored "is", "a" and "the". Hmm... I'm not happy with that. I don't think "about" and "your" are particularly helpful words.

So, how do you define your own stop words and override the one in the Snowball analyzer? Well, let me show you.

In code, I use pyelasticsearch so the index creation is done in Python.


STOPWORDS = (
    "a able about across after all almost also am among an and "
    "any are as at be because been but by can cannot could dear "
    "did do does either else ever every for from get got had has "
    "have he her hers him his how however i if in into is it its "
    "just least let like likely may me might most must my "
    "neither no nor not of off often on only or other our own "
    "rather said say says she should since so some than that the "
    "their them then there these they this tis to too twas us "
    "wants was we were what when where which while who whom why "
    "will with would yet you your".split()
)

def create():
    es = get_connection()
    index = get_index()
    es.create_index(index, settings={
        'settings': {
            'analysis': {
                'analyzer': {
                    'extended_snowball_analyzer': {
                        'type': 'snowball',
                        'stopwords': STOPWORDS,
                    },
                },
            },
        },
        'mappings': {
            doc_type: {
                'properties': {
                    'title': {
                        'type': 'string',
                        'analyzer': 'extended_snowball_analyzer',
                    },
                }
            }
        }
    })

With that in place, now delete your index and re-create it. Now you can use the _analyze tool again to see how it analyzes text on this particular field. But note, to do this we need to know the name of the index we used. (so replace {myindexname} in the URL):

$ curl -XGET 'localhost:9200/{myindexname}/_analyze?field=title' -d 'about your special is a the word' | json_print
{
  "tokens": [
    {
      "end_offset": 18,
      "token": "special",
      "type": "<ALPHANUM>",
      "start_offset": 11,
      "position": 3
    },
    {
      "end_offset": 32,
      "token": "word",
      "type": "<ALPHANUM>",
      "start_offset": 28,
      "position": 7
    }
  ]
}

Cool! Now we see that it considers "about" and "your" as stop words. Much better. This is handy too because you might have certain words that are globally not very common but within your application it's very repeated and not very useful.

Thank you willkg and Erik Rose for your support in tracking this down!

django-semanticui-form

September 14, 2015
2 comments Python, Django

I'm working on a (side)project in Django that uses the awesome Semantic UI CSS framework. This project has some Django forms that are rendered on the server and so I can't let Django render the form HTML or else the CSS framework can't do its magic.

The project is called django-semanticui-form and it's a fork from django-bootstrap-form.

It doesn't come with the Semantic UI CSS files at all. That's up to you. Semantic UI is available as a big fat bundle (i.e. one big .css file) but generally you just pick the components you want/need. To use it in your Django templates simply, create a django.forms.Form instance and render it like this:


{% load semanticui %}

<form>
  {{ myform | semanticui }}
</form>

The project is very quickly put together. The elements I intend to render seem to work but you might find that certain input elements don't work as nicely. However, if you want to help on the project, it's really easy to write tests and run tests. And Travis and automatic PyPI deployment is all set up so pull requests should be easy.

peepin - a great companion to peep

September 10, 2015
0 comments Python

I actually wrote peepin several months ago but forgot to blog about it.
It's a great library that accompanies peep which is a wrapper on top of pip. Actually, it's for pip install. When you normally do pip install -r requirements.txt the only check it does is on the version number, assuming your requirements.txt has lines in it like Django==1.8.4. With peep it does a checksum comparison of the wheel, tarball or zip file. It basically means that the installer will get EXACTLY the same package files as was used by the developer who decides to add it to requirements.txt.

If you're using pip and want strong reliability and much higher security, I strongly recommend you consider switching to peep.

Anyway, what peepin is, is a executable use to modify your requirements.txt automatically for you. It can do two things. At least one.

1) Automatically figure out what the right checksums should be.
2) It can figure out what is the latest version on PyPI.

For example:

(airmozilla):~/airmozilla (upgrade-django-bootstrap-form $)$ peepin --verbose django-bootstrap-form
* Latest version for 3.2
https://pypi.python.org/pypi/django-bootstrap-form/3.2
* Found URL https://pypi.python.org/packages/source/d/django-bootstrap-form/django-bootstrap-form-3.2.tar.gz#md5=1e95b05a12362fe17e91b962c41d139e
*   Re-using /var/folders/1x/2hf5hbs902q54g3bgby5bzt40000gn/T/django-bootstrap-form-3.2.tar.gz
*   Hash AV1uiepPkO_mjIg3AvAKUDzsw82lsCCLCp6J6q_4naM
* Editing requirements.txt

And once that's done...:

(airmozilla):~/airmozilla (upgrade-django-bootstrap-form *$)$ git diff
diff --git a/requirements.txt b/requirements.txt
index a6600f1..5f1374c 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -83,8 +83,8 @@ BeautifulSoup==3.2.1
 django_compressor==1.4
 # sha256: F3KVsUQkAMks22fo4Y-f9ZRvtEL4WBO50IN4I3IuoI0
 django-cronjobs==0.2.3
-# sha256: 2G3HpwzvCTy3dc1YE7H4XQH6ZN8M3gWpkVFR28OOsNE
-django-bootstrap-form==3.1
+# sha256: AV1uiepPkO_mjIg3AvAKUDzsw82lsCCLCp6J6q_4naM
+django-bootstrap-form==3.2
 # sha256: jiOPwzhIDdvXgwiOhFgqN6dfB8mSdTNzMsmjmbIBkfI
 regex==2014.12.24
 # sha256: ZY2auoUzi-jB0VMsn7WAezgdxxZuRp_w9i_KpCQNnrg
 

If you want to you can open up and inspect the downloaded package and check that no hacker has meddled with the package. Or, if you don't have time to do that, at least use the package locally and run your tests etc. If you now feel comfortable with the installed package you can be 100% certain that will be installed on your server once the code goes into production.

Be careful with using dict() to create a copy

September 9, 2015
9 comments Python

Everyone who's done Python for a while soon learns that dicts are mutable. I.e. that they can change.

One way of "forking" a dictionary into two different ones is to create a new dictionary object with dict(). E.g:


>>> first = {'key': 'value'}
>>> second = dict(first)
>>> second['key'] = 'other'
>>> first
{'key': 'value'}
>>> second
{'key': 'other'}

See, you can change the value of a key without affecting the dictionary it came from.

But, if one of the values is also mutable, beware!


>>> first = {'key': ['value']}
>>> second = dict(first)
>>> second['key'].append('second value')
>>> first
{'key': ['value', 'second value']}
>>> second
{'key': ['value', 'second value']}

This is where you need to use the built in copy.deepcopy.


>>> import copy
>>> first = {'key': ['value']}
>>> second = copy.deepcopy(first)
>>> second['key'].append('second value')
>>> first
{'key': ['value']}
>>> second
{'key': ['value', 'second value']}

Yay! Hope it helps someone avoid some possibly confusing bugs some day.

UPDATE

As ëRiC reminded me, there are actually three ways to make a "shallow copy" of a dictionary:

1) some_copy = dict(some_dict)

2) some_copy = some_dict.copy()

3) some_copy = copy.copy(some_dict) # after importing 'copy'

Introducing optisorl

August 18, 2015
0 comments Python

optisorl is a Python package for sorl-thumbnail which is a kick-ass Python package for Django. sorl-thumbnail is pretty popular and used by a lot of people who have images they want to display as thumbnails.

A problem you find is that oftentimes the PNG thumbnails aren't as optimized as they can be. A great tool for having a second optimization pass on an PNG file is pngquant. You basically, run it like this:

$ ls -l bugzilla.png
-rw-r--r--@ 1 peterbe  staff  12188 Dec 12  2014 bugzilla.png
$ pngquant bugzilla.png
:~/Downloads$ ls -l bugzilla-fs8.png
-rw-r--r--@ 1 peterbe  staff  6630 Aug 18 13:15 bugzilla-fs8.png

That's a 140x140 pixel PNG that became 5,558 bytes smaller (46% saving).

Anyway, this is where optisorl comes in. It's an extension to sorl-thumbnail that is able to execute pngquant on the PNG right after the thumbnail file has been created. It does so by calling out a sub-process command to pngquant. See the code here which is all the magic there is to it really.

The reason I built this was to reduce the images on Air Mozilla. At the time I did the measurement, the PNGs total weight on the home page was 129KB and after running them all through optisorl the total weight was only 65KB.

To install, it just pip install it like so:

$ pip install optisorl

And you need to install pngquant like brew install pngquant or apt-get install pngquant.

Then, to activate it you need to set this Django setting:


THUMBNAIL_BACKEND = 'optisorl.backend.OptimizingThumbnailBackend'

If you decide to put the pngquant executable somewhere not on the PATH you can add to your settings.py file something like this:


PNGQUANT_LOCATION = '/path/to/bin/pngquant'

There's a bunch of features it doesn't have but we can work together on that. For example, there are certain PNG images that you might want to display as thumbnails but due to something about the image, e.g. its use of Alpha channels, you might want to explicitly disable optimizations.

Premailer.io

July 8, 2015
3 comments Python, Web development, AngularJS, JavaScript

Premailer is a Python library for turning a HTML + CSS into HTML with all the CSS embedded as inline style attributes. This is sadly very necessary to ensure that your fancy HTML emails look spiffy across all email clients and email webapps.

So, last week I put together a little site to test the library via a browser: Premailer.io

It's just a simple webapp with a form where you can enter HTML in three different ways; textarea, by URL and by file upload.

You can also override all the possible advanced options that premailer supports.

What's kinda cool is that you can get a preview of how the HTML document will look like in an iframe that is dynamically loaded with the result from the conversion.

The webapp is of course open source and available on github.com/peterbe/premailer.io. The front-end is an AngularJS app and the build system is Lineman.js. The server is a Falcon server running on uWSGI via Nginx.

There's very little fancy here. There's no limitations or protections. I just hope it becomes handy for people to test premailer out.

The inspiration came from MailChimp's CSS Inliner Tool which is cute but very basic and doesn't allow you the same kinds of input.

If anybody with some AngularJS or highlight.js chops has time I'd love to help fix why the HTML is not syntax highlighted.

Find what indentation your files use

July 7, 2015
0 comments Python

Over the years, style guides have come and gone. And contributors have come and gone.
Some people, at some times use 2-spaces indentation in JavaScript. Some prefer 4-spaces.

Even I have changed my mind over the years and now I'm content to do either. I just go by whatever the projectroot/.editconfig config tells me.

So I wanted to clean up all the files so that they are use the same type of indentation (as dictated by the project's .editorconfig file). But which files are what indentation? I could open each file in turn and look at it and keep a tally of which is what. Or I can script it.

I wrote a script. Usage example included in the gist.

Now I easily see which files use what indentation. That makes it easy to file bugs for refactoring.

Some of the files in this grep search include vendor scripts that I'm not going to touch but as you can see, most files use 4 spaces but some still us 2 spaces.

4   base/static/angular/watchcounter.js
2   base/static/dropzone/dropzone.js
4   base/static/js/base.js
2   base/static/js/gallery_select.js
1   base/static/js/libs/include.js
4   base/static/js/libs/moment.js
4   base/static/select2/select2.js
4   comments/static/comments/js/comments.js
1   main/static/main/fullcalendar/gcal.js
4   main/static/main/js/autocompeter.js
4   main/static/main/js/calendar.js
4   main/static/main/js/discussion.js
4   main/static/main/js/download.js
4   main/static/main/js/edit.js
4   main/static/main/js/embed.js
4   main/static/main/js/event_video.js
4   main/static/main/js/eventstatus.js
4   main/static/main/js/include-tabzilla.js
4   main/static/main/js/jwplay.js
4   main/static/main/js/livehits.js
4   main/static/main/js/nav.js
4   main/static/main/js/playbackrate.js
4   main/static/main/js/tabzilla.js
4   main/static/main/js/tearout.js
4   manage/static/manage/js/autocompeter.js
1   manage/static/manage/js/bootstrap-datepicker.js
2   manage/static/manage/js/bootstrap-typeahead.js
4   manage/static/manage/js/channel-html-edit.js
4   manage/static/manage/js/confirm-delete.js
4   manage/static/manage/js/cronlogger.js
4   manage/static/manage/js/dashboard.js
4   manage/static/manage/js/dashboard_graphs.js
4   manage/static/manage/js/discussion-configuration.js
4   manage/static/manage/js/event-archive.js
4   manage/static/manage/js/event-assignment.js
4   manage/static/manage/js/event-edit.js
4   manage/static/manage/js/event-request.js
2   manage/static/manage/js/event-tweets.js
4   manage/static/manage/js/event-upload.js
4   manage/static/manage/js/event-vidly-submissions.js
4   manage/static/manage/js/eventmanager.js
4   manage/static/manage/js/events.js
4   manage/static/manage/js/form-errors.js
4   manage/static/manage/js/locations.js
4   manage/static/manage/js/mainmanager.js
4   manage/static/manage/js/manage.js
4   manage/static/manage/js/picture-add.js
4   manage/static/manage/js/picturegallery.js
4   manage/static/manage/js/staticpage-edit.js
4   manage/static/manage/js/suggestions.js
4   manage/static/manage/js/survey-edit.js
4   manage/static/manage/js/tagmanager.js
4   manage/static/manage/js/url-transforms.js
4   manage/static/manage/js/user-edit.js
4   manage/static/manage/js/usermanager.js
4   manage/static/manage/js/vidly-media-timings.js
4   manage/static/manage/js/vidly-media.js
4   new/static/new/js/RecordRTC.js
4   new/static/new/js/app.js
1   new/static/new/js/ccv.js
4   new/static/new/js/controllers.js
2   new/static/new/js/humanize-duration.js
4   new/static/new/js/services.js
4   starred/static/starred/js/star_event.js
4   starred/static/starred/js/starredevents.js
4   suggest/static/suggest/js/details.js
4   suggest/static/suggest/js/discussion.js
4   suggest/static/suggest/js/file.js
4   suggest/static/suggest/js/start.js
4   suggest/static/suggest/js/suggest.js
4   surveys/static/surveys/js/survey.js
2   uploads/static/uploads/js/s3upload.js
4   uploads/static/uploads/js/upload.js
2   webrtc/static/webrtc/js/camera.js
4   webrtc/static/webrtc/js/libs/RecordRTC.js
4   webrtc/static/webrtc/js/photobooth.js
4   webrtc/static/webrtc/js/summary.js
4   webrtc/static/webrtc/js/video.js
4   webrtc/static/webrtc/js/webrtc.js

How I git

June 18, 2015
1 comment Python, Linux

tl;dr I use bgg to shortcut a lot of tedious git commands.

Once a certain pattern appears where you find yourself doing the same thing over and over the first thing that should spring to mind is: let's automate that!

So a couple of years ago I started writing simple Python scripts that would wrap various git operations so I could do things like G merge or G rebase. That has helped me tremendously and when I at first showed these scripts to some people I was amazed how unimpressed they were. I guess that's because they have their own scripts or a geeky reluctance to adopting someone elses shortcuts unless you've personally be apart of going from tedious to shortcut.

So, a crucial part of my work here at Mozilla is to look at a Bugzilla and start a topic branch based on it and when it's done, push that into a Pull Request on GitHub.

The first command is G start. It takes a single optional argument. If an argument is provided it has to be a Bugzilla bug number. If you supply a Bugzilla ID it will fetch the title of that bug (assuming you're online) and store that so that it can be used to mention it in the git commit message. For example:

(airmozilla):~/dev/MOZILLA/AIRMOZILLA/airmozilla (master)$ G start 1174316
You're currently on branch master
Summary ["Start duration fetching when stopping a live event"]:
Switched to a new branch 'bug-1174316-start-duration-fetching-when-stopping-a-live-event'

The git branch name becomes a "slugified" version of the bug summary. But note, it merely sets the default. I could override it if I want to.

Then you do some work on it and when you're done you type the next command; G commit. It basically runs git commit -a -m "..." using the bug number, the bug summary, optionally asking if you want to prefix the commit message with fixes and then pushed it to your fork. Example speaks for itself:

(airmozilla):~/dev/MOZILLA/AIRMOZILLA/airmozilla (bug-1174316-start-duration-fetching-when-stopping-a-live-event *)$ G commit
MSG:
    bug 1174316 - Start duration fetching when stopping a live event

OK? [Y/n]
Add the 'fixes ' prefix? [N/y] y
NOW, feel free to run:

git checkout master
git merge bug-1174316-start-duration-fetching-when-stopping-a-live-event
git branch -d bug-1174316-start-duration-fetching-when-stopping-a-live-event

OR

git push peterbe bug-1174316-start-duration-fetching-when-stopping-a-live-event

Run that push? [Y/n]
To git@github.com:peterbe/airmozilla.git
 * [new branch]      bug-1174316-start-duration-fetching-when-stopping-a-live-event -> bug-1174316-start-duration-fetching-when-stopping-a-live-event

You get the picture. It's interactive and mostly you just hit enter and it does stuff saving you copious milliseconds.

Other noteworthy commands:

G rebase - whilst on a branch, jumps over to the master branch, updates from the origin, then goes back to the branch you were on preparing you for an interactive git rebase.

G merge - goes over to the master branch, merges the branch you were on and if it works out, deletes the branch.

G getback - you're in a branch you know was merged (using GitHub's green merge button), it switches to the master branch, updates master and deletes the local topic branch (that was merged) and deletes the remote topic branch on your fork.

G cleanup [search] - you're on some other branch other than the one you search for. It finds that branch (if only 1 match) and does that G getback does.

G branches [search] - lists all your branches sorted by most recently worked on last also indicate how long ago you worked on it and if it has already been merged.

The reason I'm mentioning this isn't to convince you to use my tool to do your git but perhaps to inspire you to write your own scripts that wrap things you find yourself doing repetitively.

I know my own battle isn't over. I'm still finding things that I have to do additionally on an almost perfectly predictable basis. Thankfully I now have an infrastructure to add more scripting.

Python slow-down of exception handling or condition checking

May 14, 2015
0 comments Python

It's the old problem of "Do I seek permission or ask for forgiveness?". It's rarely easy to know which one to use in Python because working with exceptions in Python is so damn easy.

Generally I prefer neither. I.e. just do. Don't write defensive code if you don't have to. Only seek permission or ask for forgiveness if you expect it to happen and that that's normal.

Consider the following three functions:


def f0(x):
    return PI / x


def f1(x):
    if x != 0:
        return PI / x
    else:
        return -1


def f2(x):
    try:
        return PI / x
    except ZeroDivisionError:
        return -1

Which one do you think is the fastest? If I run this 1,000,000 times and never pass in a value for x=0 will it make any difference?

Before you look at it, what do you think the result will be?


The answer is below.


Read on.


Scroll down for the results.


Have you made a guess yet?


What do you think it's going to be?


Scroll some more.


Almost there!


Ok, the results are as follows when running each of the above mentioned functions ~33,000,000 times on my MacBook:

f0 4.16087803245
f1 4.84187698364
f2 4.73760977387
(smaller is better)

Conclusion, the difference is miniscule. The fastest is to not do any exception handling or condition checking but it's generally no big difference.

This test was done with Python 2.7.9. You can try the code for yourself.

Just one more thought

As I wrote this post I started thinking more and more about the "code style aspect" rather than the performance.

Basically, I think it boils down to the following rules:

  1. If you're working with external I/O (e.g. network or a database) use the "ask for forgiveness" approach (aka. exception wrapping). I.e. don't do if requests.head(url).status_code == 200: stuff = requests.get(url)

  2. If you want to make a really user-friendly Python API, use the "seek permission" approach (aka. if-statement first). E.g. def calculate(guests): if isinstance(guests, basestring): guests = [guests]

  3. All else just do. That makes the code more Pythonic. If you have a sub-routine that sends in variable of the totally crazy-wrong type to your function, don't change the function, change the sub-routine.

UPDATE

Here are the numbers for PyPy:

f0 0.369750552707
f1 0.321069081624
f2 0.411438703537
(smaller is better)

That's after averaging 15 runs of the script.

Note that the function with the extra if statement is faster.

And here are the numbers of Python 3.4.2:

f0 4.99579153742
f1 5.77459328515
f2 5.38382162367
(smaller is better)

That's averaging 10 rounds.

One almost interesting thing about these numbers is that the sum of them are different and tells us a tiny story about performance for the language:

Python 2.7.9   13.74036478996
PyPy 2.4.0     1.102258337868
Python 3.4.2   16.15420644624
(smaller is better)

UPDATE 2

Here's the node equivalent version and its times:

f0 0.215509441
f1 0.228280196357
f2 0.316222934714
(smaller is better)

That means that my Node v0.10.35 is 45% faster than PyPy. But please, don't take that seriously.

premailer 2.9.0 and new rules for `base_url`

May 11, 2015
0 comments Python

I just pushed out a new release of premailer which comes with a pretty big change.

What it means is that the way the base_url and any href= or src= gets combined. For example, you used to be able to set Premailer(html, base_url='http://example.com/subfolder') and combined with <img src="/images/foo.png"> it would become <img src="http://example.com/subfolder/images/foo.png">.

Not any more. The joining works exactly like the Python built-in urljoin() works. E.g.


>>> from urllib.parse import urljoin  # python 3
>>> urljoin('https://example.com', '/image.png')
'https://example.com/image.png'
>>> urljoin('https://example.com/subfolder', '/image.png')
'https://example.com/image.png'
>>> urljoin('https://example.com/subfolder/', '/image.png')
'https://example.com/image.png'
>>> urljoin('https://example.com/subfolder/', '//image.png')
'https://image.png'
>>> urljoin('https://example.com/subfolder/', '//mycdn.com/image.png')
'https://mycdn.com/image.png'
>>> urljoin('http://example.com/subfolder/', '//mycdn.com/image.png')
'http://mycdn.com/image.png'
>>> urljoin('https://example.com/subfolder', 'image.png')
'https://example.com/image.png'
>>> urljoin('https://example.com/subfolder/', 'image.png')
'https://example.com/subfolder/image.png'

So basically, if you think you tried to do something odd with your base_url check it over carefully when you upgrade to version 2.9.0.

Thank you @ewjoachim and @graingert for your help!