In MDN I noticed a function that turns a piece of text (Python 2 unicode) into a slug. It looks like this:


    non_url_safe = ['"', '#', '$', '%', '&', '+',
                    ',', '/', ':', ';', '=', '?',
                    '@', '[', '\\', ']', '^', '`',
                    '{', '|', '}', '~', "'"]

    def slugify(self, text):
        """
        Turn the text content of a header into a slug for use in an ID
        """
        non_safe = [c for c in text if c in self.non_url_safe]
        if non_safe:
            for c in non_safe:
                text = text.replace(c, '')
        # Strip leading, trailing and multiple whitespace, convert remaining whitespace to _
        text = u'_'.join(text.split())
        return text

The code is 7-8 years old and relates to a migration when MDN was created as a Python fork from an existing PHP solution.

I couldn't help but to react to the fact that it's a list and it's looped over every single time. Twice, in a sense. Python has built-in tools for this kinda stuff. Let's see if I can make it faster.

The candidates


translate_table = {ord(char): u'' for char in non_url_safe}
non_url_safe_regex = re.compile(
    r'[{}]'.format(''.join(re.escape(x) for x in non_url_safe)))


def _slugify1(self, text):
    non_safe = [c for c in text if c in self.non_url_safe]
    if non_safe:
        for c in non_safe:
            text = text.replace(c, '')
    text = u'_'.join(text.split())
    return text

def _slugify2(self, text):
    text = text.translate(self.translate_table)
    text = u'_'.join(text.split())
    return text

def _slugify3(self, text):
    text = self.non_url_safe_regex.sub('', text).strip()
    text = u'_'.join(re.split(r'\s+', text))
    return text

I wrote a thing that would call each one of the candidates, assert that their outputs always match and store how long each one took.

The results

The slowest is fast enough. But if you're still reading, here are the results:

_slugify1 0.101ms
_slugify2 0.019ms
_slugify3 0.033ms

So using a translate table is 5 times faster. And a regex 3 times faster. But they're all sufficiently fast.

Conclusion

This is the least of your problems in a world of real I/O such as databases and other genuinely CPU intense stuff. Well, it was fun little side-trip.

Also, aren't there better solutions that just blacklist all control characters?

Comments

James Bennett

I remember this one, and I'm the original author of that piece of code.

When first written, the slow looping approach was actually the simplest solution for the underlying problem, which was the specific way the previous wiki engine had encoded section titles for use in HTML IDs. The old wiki would replace these characters with a sequence of hex values of the character's UTF-8 bytes, each preceded by a dot. So a space in a section title, for example, would become '.20' in the generated ID.

At the time that had to be preserved so that existing links to specific sections of MDN documents would continue to work after the move to Django. You can see the original replacement code in the commit that introduced it:

https://github.com/mozilla/kuma/commit/be10b92234bda15a86f98a893b38fc1dce56e1a9

It would have been possible to write a function that transformed only the characters needing encoding, and map() over the input applying that, but the loop approach, while slightly less efficient, seemed clearer and more readable to me (and the extra time it took was more than lost in the noise, anyway; kuma's page rendering was a hugely expensive operation, for a variety of reasons).

Nowadays, it appears MDN no longer enforces the requirement to remain compatible with MindTouch section IDs, so it'd make sense to me to just go ahead and replace this code with a more idiomatic approach like the translation table (and then another tiny piece of code I wrote would vanish out of MDN...).

Peter Bengtsson

Thank you for posting that! That MindTouch legacy is still lurking about.

I'm still fond of my conclusion (even though it wasn't particularly surprising) that these little details don't actually matter all that much. I/O rules the latency and creating slugs isn't something that needs to be done every couple of milliseconds. Perhaps I blogged about it just to go for a walk.

Anonymous

To be fair, the `translate_table` creation should be inside the `_slugify2` function, which is the only one that uses it.

In addition, maybe you should use `timeit` to run them more than once.

upp

Not really, that's unfair because you recreate translate_table everytime you call 2.

Your email will never ever be published.

Related posts