Filtered by JavaScript, Python

Page 41

Reset

html2plaintext Python script to convert HTML emails to plain text

August 10, 2007
12 comments Python

From the doc string:


A very spartan attempt of a script that converts HTML to
plaintext.

The original use for this little script was when I send HTML emails out I also
wanted to send a plaintext version of the HTML email as multipart. Instead of 
having two methods for generating the text I decided to focus on the HTML part
first and foremost (considering that a large majority of people don't have a 
problem with HTML emails) and make the fallback (plaintext) created on the fly.

This little script takes a chunk of HTML and strips out everything except the
<body> (or an elemeny ID) and inside that chunk it makes certain conversions 
such as replacing all hyperlinks with footnotes where the URL is shown at the
bottom of the text instead. <strong>words</strong> are converted to *words* 
and it does a fair attempt of getting the linebreaks right.

As a last resort, it strips away all other tags left that couldn't be gracefully
replaced with a plaintext equivalent.
Thanks for Fredrik Lundh's unescape() function things like:
   'Terms &amp;amp; Conditions' is converted to
   'Termss &amp; Conditions'

It's far from perfect but a good start. It works for me for now.

Version at the time of writing this: 0.1.

I wouldn't be surprised if I've reinvented the wheel here but I did plenty of searches and couldn't really find anything like this.

Let's run this for a while until I stumble across some bugs or other inconsistencies which I haven't quite done yet. The one thing I'm really unhappy about is the way I extract the body from the BeautifulSoup parse object. I really couldn't find another better way in the few minutes I had to spare on this.

Feel free to comment on things you think are pressing bugs.

You can download the script here html2plaintext.py version 0.1

UPDATE

I should take a second look at Aaron Swartz's html2text.py script the next time I work on this. His script seems a lot more mature and Aaron is brilliant Python developer.

Spellcorrector

April 18, 2007
3 comments Python

Spellcorrector being used on my not-yet-released web app I think a lot of Python people have seen Peter Novig's beautiful article about How to Write a Spelling Corrector. So have I and couldn't wait to write my own little version of it to fit my needs.

The changes I added were:

  • Python 2.4 compatible
  • Uses a pickleable dict instead of a collection
  • Compiled a huge list of Swedish words
  • Skipped edit distances 2 of words longer than 10 characters
  • Added a function suggestions()
  • All Unicode instead
  • A class instead of a function
  • Ability to train on your own words and to save that training persistently

Truncated! Read the rest by clicking the link below.

setAttribute('style', ...) workaround for IE

January 8, 2007
41 comments JavaScript

I knew had I heard it before but I must have completely missed it anyway and forgotten to test my new Javascript widget in IE. None of the stylesheet worked in IE and it didn't make any sense. Here's how I did it first:


var closer = document.createElement('a');
a.setAttribute('style', 'float:left; font-weight:bold');
a.onclick = function() { ...

That worked in Firefox of course but not in IE. The reason is that apparently IE doesn't support this. This brilliant page says that IE is "incomplete" on setAttribute(). Microsoft sucked again! Let's now focus on the workaround I put in place.

First I created a function to would take "font-weight:bold;..." as input and convert that to "element.style.fontWeight='bold'" etc:


function rzCC(s){
  // thanks http://www.ruzee.com/blog/2006/07/\
  // retrieving-css-styles-via-javascript/
  for(var exp=/-([a-z])/; 
      exp.test(s); 
      s=s.replace(exp,RegExp.$1.toUpperCase()));
  return s;
}

function _setStyle(element, declaration) {
  if (declaration.charAt(declaration.length-1)==';')
    declaration = declaration.slice(0, -1);
  var k, v;
  var splitted = declaration.split(';');
  for (var i=0, len=splitted.length; i<len; i++) {
     k = rzCC(splitted[i].split(':')[0]);
     v = splitted[i].split(':')[1];
     eval("element.style."+k+"='"+v+"'");

  }
}

I hate having to use eval() but I couldn't think of another way of doing it. Anybody?

Anyhow, now using it is done like this:


var closer = document.createElement('a');
//a.setAttribute('style', 'float:left; font-weight:bold');
_setStyle(a, 'float:left; font-weight:bold');
a.onclick = function() { ...

and it works in IE!

is is not the same as equal in Python

December 1, 2006
8 comments Python

Don't do the silly misstake that I did today. I improved my code to better support unicode by replacing all plain strings with unicode strings. In there I had code that looked like this:


if type_ is 'textarea':
   do something

This was changed to:


if type_ is u'textarea':
   do something

And it no longer matched since type_ was a normal ascii string. The correct wat to do these things is like this:


if type_ == u'textarea':
    do something
elif type_ is None:
    do something else

Remember:


>>> "peter" is u"peter"
False
>>> "peter" == u"peter"
True
>>> None is None
True
>>> None == None
True

Fastest way to uniqify a list in Python

August 14, 2006
92 comments Python

SEE UPDATE BELOW

--

Suppose you have a list in python that looks like this:


['a','b','a']
# or like this:
[1,2,2,2,3,4,5,6,6,6,6]

and you want to remove all duplicates so you get this result:


['a','b']
# or
[1,2,3,4,5,6]

How do you do that? ...the fastest way? I wrote a couple of alternative implementations and did a quick benchmark loop on the various implementations to find out which way was the fastest. (I haven't looked at memory usage). The slowest function was 78 times slower than the fastest function.

Truncated! Read the rest by clicking the link below.

Unicode strings to ASCII ...nicely

August 8, 2006
20 comments Python

This has been a problem for a long time for me. Whenever someone enters a title in my CMS the id of the document is derived from the title. Spaces are replaced with '- and &' is replaced with and etc. The final thing I wanted to do was to make sure the Id is ASCII encoded when it's saved. My original attempt looked like this:


>>> title = u"Klüft skräms inför på fédéral électoral große"
>>> print title.encode('ascii','ignore')
Klft skrms infr p fdral lectoral groe

But as you can see, a lot of the characters are gone. I'd much rather that a word like "Klüft" is converted to "Kluft" which will be more human readable and still correct. My second attempt was to write a big table of unicode to ascii replacements.

It looked something like this:


u'\xe4': u'a',
u'\xc4': u'A',
etc...

Truncated! Read the rest by clicking the link below.

slim, a new free web service for white space optimisation

July 25, 2006
1 comment Python

If you have some code that you need to optimise, like some Javascript code that is well commented but costs too many bytes of download for your users then you might want to use my slimmer web service. I'll let a Python example speak for itself:


>>> import xmlrpclib
>>> s=xmlrpclib.Server('https://www.peterbe.com/')
>>> css='h1 { font-family: Arial, Verdana; }'
>>> s.slim(css)
'h1{font-family:Arial,Verdana}'

Truncated! Read the rest by clicking the link below.

Nice stats added to RememberYourFriends.com

July 15, 2006
0 comments Python

Nice stats added to RememberYourFriends.com I've just added some nice stats to RememberYourFriends.com. It's basically two line charts. One of how many reminders have been sent that week and one of how many reminders have been sent, ever up to a particular week from the very start.

To do this I use the wonderful CharDirector package for Python which is really fast and very easy to implement. These graphs are created on the fly and apart from generating the image it obviously needs to generate the data. Pretty fast I'd say. Will be interesting to see how it will fair when the load starts to get interesting.

Once I get a bit more users I'll start thinking of other funky charts to draw. It's fun.

DifferenceFinder (aka. humanreadablediff.py)

July 6, 2006
4 comments Python

I've just quickly put together a little script that computes the difference between two texts in a human readable format. The result when you run diff is a bit difficult to understand for a human being and I wanted something more "humane" that quickly summarises what's different on one simple line. Eg. "Added 2 lines, change 1 line".

This little script is going to be part an undo function in our new CMS that I'm working on. Instead of just pinpointing which revision date you want to go back to you'll also be able to see what the differences were between each revision in the undo history for the CMS.

Truncated! Read the rest by clicking the link below.