Wednesday, July 25, 2007

Harvesting the noise while it's still fresh; SPF found potentially useful

In previous installments of my greylisting and greytrapping posts, I've described how I found that after publishing my traplist, I maintained the list mainly by searching the mail server logs for "Unknown user" messages and by running a tail -f of my spamd log in a terminal window.

It dawned on me a couple of days back that finding the "Unknown user" entries in the mail server logs means I find only the backscatter bounces that have managed to clear greylisting, sent by real mail servers which are misconfigured to deliver spam to their users. Clearing greylisting may take a while, but once the IP address enters the whitelist and the machine does not try to send again to any address which is already in the traplist, it will be able to deliver its spam or backscatter.

Harvest the noise while it's fresh
Fortunately it's very easy to harvest the noise data while it's fresh. You search the greylist instead. A simple

$ sudo spamdb | grep GREY

gives you a list of all currently greylisted entries at that spamd instance, in a format which is well documented in the spamdb man page:


There will more likely be more than one, and in this format it's fairly easy to see at least two traplist candidates, and I have no idea if, or would ever have cleared greylisting, but now that and are in my traplist (and yes, at least part of the process should be very easy to automate), they'll be stuttered at, starting with the next time they try to connect and most likely until they give up.

Now it's probably still useful to tail -f your spamd log anyway, but you can leave the harvesting off until you see a marked increase in simultaneous connections to spamd, as in when the first number in parentheses starts rising sharply. Here the number is low (the second number is the number of currently blacklisted hosts):

Jul 25 22:17:16 delilah spamd[11839]: connected (12/12), lists: spamd-greytrap
Jul 25 22:17:35 delilah spamd[11839]: connected (13/13), lists: spamd-greytrap
Jul 25 22:17:36 delilah spamd[11839]: connected (14/14), lists: spamd-greytrap

When the first number rises sharply -- that's when the first wave of spam or backscatter hits, and you can harvest the noise while it's still fresh.

A good harvest means less work for your mail server.

SPF found potentially useful
One recurring theme in greylisting discussions is how to deal with sites which do not play nicely with greylisting, specifically sites with many outgoing SMTP servers and no guarantee that the retries will come from the same IP (you can find a rather informal discussion in the PF tutorial, for example). If you can't get those sites to do the do the retry magic, you probably need to whitelist them, but in the case of large sites like google, how do you find out just which machines to white list?

For well run sites the answer is simple: if they publish SPF data, you use that. After all, that data is their own list of valid outgoing SMTP senders. The solution presented itself in a recent openbsd-misc post by Darrin Chandler. If you need to whitelist a site with many potential outgoing SMTP servers, the command is

$ host -ttxt

That is, look up the text data in the domain's DNS data, which is where SPF data lives.
The answer would typically be something like descriptive text "v=spf1 mx -all"

which essentially means, "for the domain, only the mail exchangers are valid SMTP senders". The next step is easy: if the answer contained IP addresses or ip address ranges, you put those in your whitelist, in this case a

$ dig mx

would get you the data you need (possibly after a few more host commands).

Frankly it would be a lot better if those sites learned to play well with greylisting, but if you choose to whitelist them anyway, at least this way you take their word for what their valid senders are.

Sunday, July 22, 2007

The noise, we ignore it

If anybody had expected my earlier posts about harvesting addresses to a traplist and publishing it and observing the results would lead to earth-shattering discoveries or headlines in major publications worldwide, I can tell you now: They did not.

This could mean that spam is now boring and ignorable, and in fact the data I've accumulated indicate that when it comes to spam and spammers, it all falls into a category of noise we are more than happy to just ignore.

In the semi-random samplings from the noise the spammers generate, there are some interesting observations. For one thing, one or more spammer operations have picked my handful of domains purported return addresses for their messages, and they have been doing this at least since some time in June, possibly longer. Judging from the addresses in the backscatter, there is likely two or three groups actively generating or making up addresses, with distinct methods.

One is to pick a domain and a word, feed it to the program which then generates a pair of addresses. Spammer picks the keyword flaunting and the robot spits out and

The other method is to just pick a word and stick the at sign and the domain on afterwards, such as

The third one, which appears in several variations, is to generate or make up what could at a stretch look like aliases based on people's first and last names, such as and (nevermind that amidala, the now-retired laptop never ran its own mail service) or just the first name such as

Another variation which had me a bit puzzled was what is probably designed to look like our domain is testing for mail deliverability, such as

And finally, of course there's the bottom feeders who try to use message-IDs and other junk extracted from news spools or the local Microsoft Outlook user's mailbox, any of the addresses with fsf@ in them and a few others clearly fall into that category, and is a likely indicator that somebody, somewhere has news or mailing list mail which originated at my current laptop stored where spamware can find it.

Spammers have been working very hard lately. On July 17th, Bob Beck's traplist (which is generated by greytrapping at University of Alberta and which Bob makes available to anyone who wants it - see the PF tutorial for details), reached what I believe is the all time high, just a few short of a hundred thousand addresses. The actual number was 99941, at 20:00 CEST (that's 8PM in Imperial measure), it's been dropping off since then.

My more or less purely backscatter based lists during the same period grew to roughly 500 addresses in the local traplist, and the number of hosts actually trapped here as far as I can tell never went much over 400 at any time.

One interesting factlet that came up during the week is that Google, by the looks of it, is using SPF correctly. One BLUG member reported that one of my messages about the traplist to the BLUG mailing list had been tagged by Google Mail as possible spam. Exactly what triggered the classification was never revealed to me (he had already deleted the message when I asked him to take a look). But it made me go back and check the SPF records for our domains, and quite right, they were overly permissive. Editing them took a few moments, and the test messages sent only a couple of minutes later went through with headers indicating that they do check for SPF. Nice, GoogleMail! Next, might we have a chat about playing nice with greylisting?

The conclusions come rather naturally: Spam in general gets ignored or filtered correctly by the vast majority of Internet domains, the free tools are extremely effective, and even when the spammers go all-out in their efforts and possibly are even trying to actively inconvenience somebody, they're not getting much traction at all.

The main technical problem remaining is that vast number of unmaintained machines out there which bend over obediently to let the spamware install itself and run by remote. On the server side there are sites which still do not play nice with greylisting, but they will see sense eventually I hope.

We just ignore the noise, except for a few of us who see patterns in the noise and a few hopeless cases which it seems will do their traplist time indefinetely.

That chapter is a bitch to write, but getting there.

Thursday, July 19, 2007

Linux is easier than Windows, hands down

It all started a few weeks back when my wife and I decided to get her a new laptop. It was also conveniently close to her July 14th birthday and Dell had a huge campaign going in the newspapers. Dell had just started selling laptops with various Linuxes preinstalled, and Ubuntu was the environment of choice, so I felt reasonably sure that the machine would be usable even if the Linux preinstalled option is not available in Norway yet. After all, our daughter had essentially no trouble at all getting her laptop (a sleek little LG number) going with Ubuntu earlier this year.

So I ended up clicking my way on to the Dell web site and ordering an Inspiron 1520 with 1440x900 resolution and the Ruby Red option (actually it's just the lid, but it does look distinctive).

As luck and Dell's suppliers would have it, the color added some time to the delivery. The machine apparently shipped earlier than expected, and arrived here last Monday afternoon, most likely very close to when the plane with my wife and daughter touched down at Madeira.

I had already burned a 7.04 desktop iso to CD and booted from that. As with anything that comes with Microsoft preinstalled, you need to be really careful to hit the right key at the right moment unless you want to do battle with the Microsoft installer.

Eventually I succeeded, though, the thing booted, went into the traditional graphic Ubuntu startup and dropped me to a shell. Odd. Fortunately from my OpenBSD laptop I was able to find this post which explains in some detail what needs to be done. In particular, the piece about how to fix the graphics driver is quite instructive.

Getting things done right when the graphics card and the wireless network card both need proprietary drivers loaded is a little puzzling when the system is supposed to be all graphic and your default network is wireless, but by following the instructions from the forum I made it eventually.

The short version is, in addition to the oddities involved in getting proprietary drivers the machine requires, the Feisty installer kernel for some reason does not have the proper support for the SATA controller in this Dell model.

Going back to the earlier version lets you do a clean upgrade, and to get all the bits working you need to have both the universe and the multiverse repositories in your package manager's configuration. Not hard for somebody who has not yet recycled all the grey cells with Debian etched into them, just a bit tedious, and even though all steps of the process was accompanied by sensible messages from the system, our only live Ethernet is in the attic where the servers live.

Anyway by elevenish in the evening I had the system all installed with native resolution and 32 bit color depth, on the wireless network and Just Working. The process took significantly less time than putting Windows from restore CDs back on the system it came with.

Yet another evening spent on other things than I had intended (such as writing those crucial bits of the book), but seriously folks, you must never miss an opportunity to make your wife happy.

With that out of the way, I can state with even more confidence:

Linux is easier and more user friendly than Windows.

By now I consider that to be documented beyond question.

But for the things I care about, I really prefer OpenBSD or FreeBSD.

Updates: The spamtrap is growing by a few new addresses a day. I sometimes spot them flying by in the xterm where my spamd log is tail -f'd, sometimes I grep them out of the mail server's logs. Some patterns are emerging, but more later when I have more data.

Friday, July 13, 2007

Spam is a solved problem

Executive summary: Spam is a solved problem, email works again. There are a few knuckle draggers out there who haven't noticed yet, but we'll get around to dealing with those shortly.

I've been looking over my log summaries again. My regular logs get rotated out of existence after seven days, but from the summaries I do keep around, it looks like various made up addresses have been used as spammers' fake From: addresses for about a month. I was too busy with other Very! Urgent! Things! to notice at first, but it finally dawned on me when I searched my mail server logs for "Unknown" as in "Unknown user" and saw from the results that somebody, somewhere, was using that domain for generating sender addresses.

After about two weeks of observation and collecting made up or generated addresses for my traplist, my conclusion is what the title of this post says. Spam is not a problem anymore. I know, of course, that "how to cope with email and spam" self help guides are best sellers, and a recent piece even went so far as saying about email,
Now, it seems, we're drowning. There's simply too much e-mail. The tide of spam buries valuable messages.

That kind of surprises me, because it's not what we're seeing here at all. Of course we know that there's a lot of junk being sent, but ask any of the people on the sites I run on any given day how much spam they've received recently and they have to look up the date of the last one in their "Junk mail" folders.

I do see some spam myself, mainly because I still fetch and read mail I receive at an ISP address I've used a lot for USENET and mailing lists over the years. And since unfortunately no method ever has a zero error rate, occasionally a spam message or two trickles through that shouldn't have on the systems I run myself. But if the tide of spam buries valuable messages, you haven't kept up with the technology, plain and simple.

By and large, from the perspective of somebody who has been the purported sender of an unknown portion of the tide which drowns out the writer's messages, it looks like spam is treated correctly or at least in ways that do not annoy others unnecessarily at most sites. (In all fairness that piece is more about email versus other types of writing than technical matters, and certainly worth reading for that reason. The same writer has written a number of other articles which are worth your time too.)

At the last count, our main spamd running gateway had all of 316 addresses in the local spamd-greytrap table, meaning that only that many hosts have actually tried to send mail to one or more of the addresses listed at our spamtrap page during the last 24 hours. Some of the trapped machines would have been active spam senders, and most of the rest seem to have been sites which were configured to receive spam and bounce back to the From: address when the spam was not deliverable.

That is an important point to note. If your system sends a 'message undeliverable' bounce message for spam sent to a non-existent user, it is configured to deliver spam to the users you do have, and there are certainly ways to avoid that. I've decided not to plug any of my other writing directly in this post, but you should be able to find the references easily enough if you're interested.

Reading the spamd logs is sometimes quite entertaining if you're that kind of guy or girl. Here is one example of a site with clearly deficient spam and/or malware filtering, possibly their own homebrew:

Jul 14 09:13:28 delilah spamd[29851]: From:
Jul 14 09:13:28 delilah spamd[29851]: To:
Jul 14 09:13:28 delilah spamd[29851]: Subject:
Delivery Status Notification (Failure)

It does not matter much to us, but they'll be unable to get mail through to us for the next 24 hours.

The next one is clearly problematic, since whoever set up the system appears to have left back in the time when there still was a chance that spammers used real addresses. Or maybe the poor wretch stayed on and now suffers from delusions, incompentence or both:

Jul 13 14:36:50 delilah spamd[29851]:
Subject: Considered UNSOLICITED BULK EMAIL, apparently from you
Jul 13 14:36:50 delilah spamd[29851]:
From: "Content-filter at" <>
Jul 13 14:36:50 delilah spamd[29851]: To:

Would SPF have helped? Possibly. We have our records set up, but clearly these guys are not using it in any meaningful way, and after -- what is it -- five years it's still not clear which of the competing RFCs with varying degrees of proprietary content is going to come out on top.

Staying out of our traplist would have saved some resources on their side. On our side, well, we have a working system. Email works. Mail from us has to go through our mail server. Incoming mail needs to clear spamd's greylisting (and really needs to come from IP addresses which are not in any of the blacklists we use) and pass content filtering inspection by spamassassin and clamav, all of it conveniently within reach on any freely available BSD system. The content filtering packages are available on your favorite Linux as well, but on our sites, we use OpenBSD and FreeBSD.

Living spam free and unworried by malware is possible. If you make a few right choices it's actually easy, it doesn't cost much and just imagine how much of your time you stop wasting.

Monday, July 9, 2007

Hey, spammer! Here's a list for you!

Last week I started noticing from my log summaries that my mail servers had seen a lot more mail to non-existent users than usual. This usually happens when somebody has picked one of our domains as the home of their made-up return addresses for their spam run. This time, from the looks of it, the spam runs were mainly targeted at Russian and Ukrainian users. At least that's where most of the backscatter appears to have come from.

As I've written before in the PF tutorial and the malware paper (updated version available as this blog post -- that's the end of today's plugging, I promise), I've used the "Unknown user" messages as a valuable data source for my spamtrap list, just quitely adding addresses that looked really unlikely to ever become valid. After a brief airing on the OpenBSD-misc mailing list and running it by my colleagues at Datadok and Dataped, I've decided to take it a bit further.

Now that I've got a list of addresses which will never receive any legitimate mail, I really want spammers to try to send mail to those addresses. After all, if they send anything to an address which consists of a random string with one of our domains stuck on after the '@', we know it's all spam from there on.

We don't care about the rest, for the next 24 hours. Your SMTP dialogue with us (actually our spamd) will be all a-stutter, receiving answers one byte at the time until you give up. For the record, that usually takes about 400 seconds, with the really imbecile ones taking a lot longer. See the paper or the tutorial for some numbers.

The other possibility is of course that your system is set up in a way which makes it actually receive and try to deliver spam. Some of the spam will be addressed to non-existent users in your domain, so if your users receive spam, you will be trying to send bounce messages back to the purported sender for spam to non-existent users. That's tough, kid. If you're set up that way, your machine will be treated to the tarpit here for the next 24 hours. All a-stutter and all that. Repeat offenders stay there longer.

Now for the spamtrap list, I've checked that my colleagues and associates have never actually wanted to use those addresses for anything, and I made this page which wraps it all in a bit of explanation. For some reason, the list keeps growing each time I look at my log summaries.

When I get around to it and find a visually not-too-horrible way to do it, I'll include links to that page where they fit naturally on our web sites. In the meantime, here's hoping that the spammers' address harvesting robots find this list and put it to good use.

The chapter, it's improving. More later.

UPDATE 12-jul-2007: The softer side of me ponders the possibility of sending email form letters to the various postmaster@s with the URL to this blog post. On the other hand, I'm not sure I'm ready for another round of finding out that postmaster@ is in fact not deliverable at a surprising number of sites around the world.

One other thing I've noticed since I published the traplist is that bounces to addresses like have started appearing in the logs. I don't see how messages like these could be useful by themselves, but the addresses are of course obvious traplist material.

13-jul-2007: Oddly enough, there's still a stream of backscatter, and my logs tell me a few new addresses turn up every day. This morning's fresh ones were, and Another few bytes to help weed out the bad ones early, thanks to the robots out there.