I've been working on something like this recently, and experiencing similar problems.
Have you considered checking the url for spaminess and malware somehow? Especially after manually reviewing lots of submitted urls, and when old urls get thier domains taken over by other dodgey websites. There's a few services around for this, but I'm not sure which is good. The obvious one is the Google Safe Browsing service, which firefox used (or used to?) https://developers.google.com/safe-browsing/
Speaking of domains getting taken over, or going down... having a link to archive.org is kind of nice. I haven't implemented this, but an easy way might be to put an archive icon link next to every link. Not sure how to check if 'the url is sort of what it should be', because content and site design can change.
Expansion of shortened urls, and https checking would be good (there's still lots of sites with broken https). The link shorteners are often trackers. But at least now many of the mappings are tracked by 301works: https://archive.org/details/301works&tab=about So now when link shorteners stop working, or are changed, it should be possible to find out where things went to before it changed.
For comments, I've trained a spam classifier with a bunch of blog comments collected over the years. Additionally, putting limits on how many comments can be posted per hour stops bots from hammering the commenting endpoints has helped a lot. Finally, having moderation tools for site admins to mark comments as spam/not spam has helped.
In short, the world is your oyster when you have a tool like this. But one thing you can definitely do is train a spam classifier separately just on the URLs. That way, if the spammers use "good words" around spamming URLs you can catch them based on the URLs.
On this my own blog I manually moderate all comments. It sucks but it's relatively quick. The blue links stand out and alert me to take a closer look.
Comment
Very nice.
I've been working on something like this recently, and experiencing similar problems.
Have you considered checking the url for spaminess and malware somehow? Especially after manually reviewing lots of submitted urls, and when old urls get thier domains taken over by other dodgey websites. There's a few services around for this, but I'm not sure which is good. The obvious one is the Google Safe Browsing service, which firefox used (or used to?) https://developers.google.com/safe-browsing/
Speaking of domains getting taken over, or going down... having a link to archive.org is kind of nice. I haven't implemented this, but an easy way might be to put an archive icon link next to every link. Not sure how to check if 'the url is sort of what it should be', because content and site design can change.
Expansion of shortened urls, and https checking would be good (there's still lots of sites with broken https). The link shorteners are often trackers. But at least now many of the mappings are tracked by 301works: https://archive.org/details/301works&tab=about So now when link shorteners stop working, or are changed, it should be possible to find out where things went to before it changed.
For comments, I've trained a spam classifier with a bunch of blog comments collected over the years. Additionally, putting limits on how many comments can be posted per hour stops bots from hammering the commenting endpoints has helped a lot. Finally, having moderation tools for site admins to mark comments as spam/not spam has helped.
Replies
In short, the world is your oyster when you have a tool like this. But one thing you can definitely do is train a spam classifier separately just on the URLs. That way, if the spammers use "good words" around spamming URLs you can catch them based on the URLs.
On this my own blog I manually moderate all comments. It sucks but it's relatively quick. The blue links stand out and alert me to take a closer look.