Bleach is awesome. Thank you for it @willkg! It's a Python library for sanitizing text as well as "linkifying" text for HTML use. For example, consider this:
>>> import bleach >>> bleach.linkify("Here is some text with a url.com.") 'Here is some text with a <a href="http://url.com" rel="nofollow">url.com</a>.'
Note that sanitizing is separate thing, but if you're curious, consider this example:
>>> bleach.linkify(bleach.clean("Here is <script> some text with a url.com.")) 'Here is <script> some text with a <a href="http://url.com" rel="nofollow">url.com</a>.'
With that output you can confidently template interpolate that string straight into your HTML.
Getting fancy
That's a great start but I wanted a more. For one, I don't always want the rel="nofollow"
attribute on all links. In particular for links that are within the site. Secondly, a lot of things look like a domain but isn't. For example This is a text.at the start
which would naively become...:
>>> bleach.linkify("This is a text.at the start") 'This is a <a href="http://text.at" rel="nofollow">text.at</a> the start'
...because text.at
looks like a domain.
So here is how I use it here on www.peterbe.com to linkify blog comments:
def custom_nofollow_maker(attrs, new=False):
href_key = (None, u"href")
if href_key not in attrs:
return attrs
if attrs[href_key].startswith(u"mailto:"):
return attrs
p = urlparse(attrs[href_key])
if p.netloc not in settings.NOFOLLOW_EXCEPTIONS:
# Before we add the `rel="nofollow"` let's first check that this is a
# valid domain at all.
root_url = p.scheme + "://" + p.netloc
try:
response = requests.head(root_url)
if response.status_code == 301:
redirect_p = urlparse(response.headers["location"])
# If the only difference is that it redirects to https instead
# of http, then amend the href.
if (
redirect_p.scheme == "https"
and p.scheme == "http"
and p.netloc == redirect_p.netloc
):
attrs[href_key] = attrs[href_key].replace("http://", "https://")
except ConnectionError:
return None
rel_key = (None, u"rel")
rel_values = [val for val in attrs.get(rel_key, "").split(" ") if val]
if "nofollow" not in [rel_val.lower() for rel_val in rel_values]:
rel_values.append("nofollow")
attrs[rel_key] = " ".join(rel_values)
return attrs
html = bleach.linkify(text, callbacks=[custom_nofollow_maker])
This basically taking the default nofollow
callback and extending it a bit.
By the way, here is the complete code I use for sanitizing and linkifying blog comments here on this site: render_comment_text
.
Caveats
This is slow because it requires network IO every time a piece of text needs to be linkified (if it has domain looking things in it) but that's best alleviated by only doing it once and either caching it or persistently storing the cleaned and rendered output.
Also, the check uses try: requests.head() except requests.exceptions.ConnectionError:
as the method to see if the domain works. I considered doing a whois lookup or something but that felt a little wrong because just because a domain exists doesn't mean there's a website there. Either way, it could be that the domain/URL is perfectly fine but in that very unlucky instant you checked your own server's internet or some other DNS lookup thing is busted. Perhaps wrapping it in a retry and doing try: requests.head() except requests.exceptions.RetryError:
instead.
Lastly, the business logic I chose was to rewrite all http://
to https://
only if the URL http://domain
does a 301 redirect to https://domain
. So if the original link was http://bit.ly/redirect-slug
it leaves it as is. Perhaps a fancier version would be to look at the domain name ending. For example HEAD http://google.com
301 redirects to https://www.google.com
so you could use the fact that "www.google.com".endswith("google.com")
.
UPDATE Oct 10 2018
Moments after publishing this, I discovered a bug where it would fail badly if the text contained a URL with an ampersand in it. Turns out, it was a known bug in Bleach. It only happens when you try to pass a filter to the bleach.Cleaner()
class.
So I simplified my code and now things work. Apparently, using bleach.Cleaner(filters=[...])
is faster so I'm losing that. But, for now, that's OK in my context.
Also, in another later fix, I improved the function some more by avoiding non-HTTP links (with the exception of mailto:
and tel:
). Otherwise it would attempt to run requests.head('ssh://server.example.com')
which doesn't make sense.