Pork with Ham, Salt, Water, Sugar, Sodium Nitrite
I guess the primary lesson of the Internet is that there are people who will waste their time annoying the piss out of you if they think there’s any chance they can make a buck. We’re all familiar with email spam, but the Web opened up a new variety of ways to try to trick people into visiting some stupid online pharmacy and Texas hold-em site. While we were all busy remixing and mashing-up and all that other nonsense, spammers were busy figuring out the implications of link-based scoring systems. The rel="nofollow" trick seemed like a good idea, but its spotty deployment makes it easier for a spammer to just aim the shotgun of suck everywhere instead of figuring which sites are actually vulnerable. We all screwed up at the beginning, so we all get to pay.
Comments are an obvious spam vector, often allowing arbitrary links to be posted in most any webpage, and this is where most of the anti-spam work goes. I don’t do anything fancy myself, any the half-assed bot test I took from NBCom has so far only allowed one spammer through. Trackbacks were a neat idea from Movable Type to allow for a blog entry to be a comment to another blog entry, but that protocol is impossible to effectively protect from spammers. The only protective measure I’ve seen is to check the commenting page for the relevant link when a trackback ping is received. This makes the implementation of pings annoying by forcing blog entry creation and publishing to be closely linked, but it is an effective and, at some level, sensible (if you’re sending me a thing saying you linked to my site, why is there no link?) device. It’s the only thing I use, and I’ve never gotten any trackback spam. This is quite possibly because my trackback links are completely broken, but I suppose that works, too.
The form of Web spam that has most recently bothered me is referer [sic] spam. The motivation behind this spam is the same as the others. Some people post referer statistics on their sites, and since the Referer HTTP header is just an unverified chunk of text, this allows for spammers to inject their URLs into other people’s pages. I don’t make my log statistics publicly accessible, but I do like to watch them from time to time just out of curiosity. The referer stats have been completely unreadable lately because of the flood of crap, so I figured it was maybe time to do something.
The most popular means of I’ve seen of blocking referer spam in Apache are to either use mod_rewrite to deny requests based on a list of undesirable strings or to filter the logs themselves either to remove bad lines before processing or to generate new rewrite rules based on excessive hit counts. Several other people have posted a lot of good information on setting up these techniques (ReferrerCop for white/black list filtering, spam list 1, spam list 2, spam fucker script to generate rewrite rules), so I’ll just be going over some of the variations I used in my implementation. There are some RBL lists similar to what things like Spamcop do for email, but I didn’t look heavily into those since I don’t think that the lookup delay is appropriate for HTTP. I went with a set of regular expressions developed from my own spam logs, but I went about it a little differently.
Returning a 403 for bad hits is great and all, but those still get logged. Many directives in Apache can use environment variables as a conditional, so you can set an environment variable once based on regular expressions and use that condition to both deny requests and change logging behavior. So I use something like this to set up the spam strings:
SetEnvIfNoCase Referer “^http://([^/]*\.)?hejazholdem\.com” refspam=1
My log line now looks like this:
CustomLog /var/log/httpd/access.log combined env=!refspam
And I added the following to the <Directory> section for the main site:
Order allow,deny
Deny from env=refspam
Allow from all
This gives me enough flexibility to do what I want with the referer matches, and I can add new rules by making new SetEnv cases similar to how I handle blocking hotlinked images on myspace and xanga. The rules are probably best served by either a personal reflection on site statistics or the use of one of those lists from somewhere else, but I did notice a particular pattern in the spam urls. Several of them were of the form “http://clickme/#”; that is, they had a “#” at the end of the URL. About the only time this bare anchor shows up in the wild is in bad javascript, and I’ve never seen any sort of URL fragment sent as a valid referer. Forbidding any use of “#” may be a bit extreme, and that sounds like a good way to find that one guy using the browser that sends fragments in referers, but I found the addition of “^http://.*#$” to my pattern list to be very useful.
I hope this helps.