PDA

View Full Version : anyone using sa-learn?



dansroka
02-08-2004, 11:04 PM
Is anyone successfully using sa-learn?

I have the lastest version of spam assassin installed, but I am still getting 30-50 spam per day. My Apple OS X Mail does a great job of filtering most of them out, but I'd love to have that bayesian filtering happen on the server, instead of locally. So I was thinking of starting to use sa-learn. But when I read the documentation on the SA website, it sounds like you need to save thousands of email to train the filter. Ugh, what a pain.

Also, has anyone figured out a semi-automatic way to run sa-learn?

jalal
02-09-2004, 01:48 AM
It should run automatically. Or at least, it runs on my setup.
In the SpamAssassin headers of the emails that get processed, you should have a little line that says 'learn=yes' (or no) or something similar.
But it is true, until the Bayesian filters have processed a certain number of emails, the results will be meaningless. But at a few hundred a day, that didn't take too long :)

dansroka
02-09-2004, 09:12 AM
That'd be great. So, just turning it on, it will learn to recognize that everything over your score threshold is "spam", and everything below is "ham"?

I was under the assumption you had to collect the email yourself, and then run the sa-learn script on it manually. But, oh, now I see -- you only need to do that when you get false-negatives or false-positves. So, how do you set that up? I assume you set up two dummy mail accounts on the server, one for "ham" and one for "spam". Then, if my email client gets a message that really is spam, I forward it to my "sa-spam" mail account, then eventually run the sa-learn script on it. Is this correct? (And the same for mail marked as spam that really isn't.) SA understands that since I even though I forward the email, that *I* am not the spammer, right?

Yes, I have been getting slammed with spam. One email address in particular gets 90% of it, and even though it is my main personal address, I am seriously considering deleting it. (Sigh.) SA helps a great deal, but I've noticed over the past few months that the spammers seem to be moving faster, and more and more spam gets by SA.

As always, I appreciate your help!

Dan

jalal
02-09-2004, 09:33 AM
My main email address gets around 300 spam a day, and the others get another 50 or so. Each day about 10 get through to me, the rest are sent off to /dev/null.
But, its been a fair bit of work to get to that stage.

I haven't done what you are suggesting, setting up ham and spam boxes. I let SA learn from what goes through the filters. If its flagged as spam, then it also learns from it. The main reason I haven't is that I pull all my mail with fetchmail, so it doesn't stay on the server for very long.

I also thought of changing my main email, but I've been using it for nearly 8 years and it seems like giving in to the spammers to change it, so I've tackled it from a technical point of view. Its been a great learning experience and I'm glad I did it.

:)

dansroka
02-09-2004, 09:43 AM
Exactly! My email address is the only way to conact me that hasn't changed many times over the years. I'd hate to lose it to viagra salesmen.

I'll give the bayesian filter a try, and look into setting up the spam/ham accounts. I only send spam with a score over 10 to /dev/null/, so I have a little buffer against false-positives.

dansroka
02-09-2004, 10:15 AM
I did a little research, and here's what I found. (In case anyone else wants to try this.)

If you are receiving email in an email client (downloaded off the server), you can set up a ham and spam email account to better train the SA filter. However, you need to be careful that your email program doesn't change the mail in any way -- sa-learn will read the mail as is. Read more about this here: http://wiki.spamassassin.org/w/UsingAnAccountForLearning

I use Apple Mail, and it's redirect command only minimally changes the mail, adding a couple additional headers. So I can instruct SA to just ignore those headers, by adding to the local.cf file:

bayes_ignore_header Resent-To
bayes_ignore_header Resent-From


So, now if any mislabeled email gets through, I can redirect the spam to my <spam@myhost.com> email address, and redirect the non-spam to <ham@myhost.com>. Then when I collect enough, I run:



/sa/bin/sa-learn --showdots --mbox --spam /var/mail/spam
/sa/bin/sa-learn --showdots --mbox --ham /var/mail/ham


One this cool about Apple Mail is that it has its own bayesian filter, which I have been training for a couple years. To take advantage of this, I added a rule to its junk mail script, so if it thinks a message is spam, it will automatically redirect it to my spam@myhost.com email address. Two birds with one stone. We'll see how this goes!

Dan

jalal
02-09-2004, 10:29 AM
Hey, that sounds pretty cool. I could probably do something similar here (KDE, KMail, Linux).

Good idea Dan

dansroka
02-09-2004, 10:31 AM
ARGH. OK, maybe not as easy as I thought. My mail program makes many more subtle changes to the email headers (additional Received info, different X-Mailer, content-type, etc), and I don't want to tell SA to ignore all of them (since some of them may be strong spam indicators). So I don't think I can easily redirect missed spam to my SA spam email account. Argh!

I think my short term solution will be to just collect all the spam that SA misses. Since Apple Mail uses mbox format, I could just upload that "spam" mbox file to my server, and run sa-learn on that.

Anyone else have a better way of doing this? I could try writing an applescript to copy all the raw mail text, but... ug, I don't have the time!

FZ
02-10-2004, 01:55 PM
Dan,

I find the best way to do it is to use SSH - specifically, Pine and sa-learn from the command line. Of course, this would only work if the e-mail address you want to learn for is the same one you use to login with (if not, let me know - I'll guide you how to change that). A brief overview of what I do:

+ login via SSH when I see spam in my mail account (via Webmail or an e-mail checking utility - because an e-mail program downloads the mail and then deletes it off the server) that SpamAssassin hasn't caught (or when I see legitimate mail that I want to learn)

+ fire up Pine, navigate to the mail folder in question, and export each message to a mailbox file called "spam" (first export creates it; subsequent requests I just append to file). Do the same for ham (with a file called "ham", obviously).

+ then just sa-learn --showdots --spam --mbox /spam (and --ham --mbox /ham for ham).

Apart from this, you should install Net::DNS (via CPAN) - it enables network tests for SpamAssassin and so allows it to catch spam that would otherwise not generate a high enough score.

Let me know if you need help. Good luck.

dansroka
02-10-2004, 03:48 PM
Thanks Fayez. I tried my method (above) and it seemed to work. Since my email program uses the mbox file format, so I just pulled the spam SA missed into an mbox file, uploaded it to my server, ssh'd, and ran the sa-learn command on it. Processed fine.

I'm interested in anything that makes SA better! I upgraded to SA 2.63, but would love to know how to install Net::DNS. (I think I undrestand what it does.) Thanks!

Dan

FZ
02-10-2004, 04:30 PM
Dan,

SSH in to your account, and then type the following:

cpan

If you have not run CPAN before, it will ask you some questions - you can accept the defaults by just pushing enter after each question. After that, you should be in the CPAN "shell", meaning you should see CPAN> at the prompt. Now type the following:

install Net::DNS

A lot of text will start appearing, you don't need to read it all - just wait until it is finished and you are back at the prompt - then look a few lines above to make sure that it completed successfully. If so, type the following:

exit

spamassassin -D --lint

Have a look at the text that appears, and make sure you see something like the following:


debug: is Net::DNS::Resolver available? yes
debug: trying (3) google.com...
debug: looking up MX for 'google.com'
debug: MX for 'google.com' exists? 1
debug: MX lookup of google.com succeeded => Dns available (set dns_available to hardcode)
debug: is DNS available? 1


It won't necessarily use Google.com for you too - but otherwise the rest of the text that appears should be similar to this and indicate that network tests are now available. Now all you need to do is sit back and let SpamAssassin do its thing. Of course, since all preset scores are very low, you will need to work on incoming mail (and the new tests that will show up) and customise network test scores to make sure it is marked as Spam.

Let me know if you need help.

dansroka
02-10-2004, 05:55 PM
Thanks, easy enough.

When you say "all preset scores are very low", I assume that you mean that SA's scores for DNS testing are set low. Which SA tests have been "empowered by this installation of Net::DNS?

Dan

FZ
02-10-2004, 07:56 PM
Yes, I do mean SpamAssassin scores. As for exactly which scores, I don't know of a list - but check out http://www.spamassassin.org/tests.html and the first few tests where the Area Tested is header are network tests, for example NJABL: and SORBS: By "customising network test scores" I meant you will need to examine the score each of these network tests assigns to your mail (be it legitimate or not), and adjust it accordlingly to suit your needs.

dansroka
02-10-2004, 11:00 PM
Thanks! Makes sense.

Thanks again Jalal and Fayez for all your help on setting this up.

Dan

FZ
02-11-2004, 01:32 PM
No problem :)

ccwebb
02-13-2004, 06:00 PM
After reading the posts to this thread I have a curiosity question.



From Jalal:
I let SA learn from what goes through the filters. If its flagged as spam, then it also learns from it.


What exactly is spamassassin "learning" from processing email that it has flagged as spam?

Thanks....

Charlie

jalal
02-14-2004, 01:56 PM
Patterns of text that it can apply to its Bayesian filters.

SpamAssassin has two techniques, one is the set of filters (that you can read through if you are interested... they're somewhere under /usr/local/perl/lib/....) and the Bayesian learning system. The filters can flag something as spam from their rules, the Bayesian bit can analyse the text and learn from it.

ccwebb
02-16-2004, 06:11 AM
I installed Net::DNS per Fayez's instructions. That seemed to go well.

Then I exited - entered the next command and it failed.

Here is the log:


cpan> exit
Lockfile removed.
[webbplace][~]$ spamassassin -D --lint
sh: spamassassin: command not found
[webbplace][~]$




I think maybe because I installed spamassassin using jalal's procedure that the command has to be different.
Help!
Charlie

ccwebb
02-16-2004, 06:18 AM
I think I have figured it out

I needed to enter /sa/bin/spamassassin -D --lint

Charlie

ccwebb
02-16-2004, 06:25 AM
Can someone help me with what I want to do in order to "turn on" sa-learn? What should the command look like? Does it go into local.cf?

Thanks

Charlie

jalal
02-16-2004, 08:42 AM
It should be on by default.

when you do /sa/bin/spamassassin -D --lint < /dev/null

in the output there should be couple of lines as to what the status of sa-learn is.

ccwebb
02-16-2004, 02:06 PM
jalal:

I do not think it is on. Here is from an email header:


X-Spam-Status: No, hits=2.6 required=4.0 tests=RCVD_IN_DYNABLOCK,
RCVD_IN_SORBS autolearn=no version=2.63


here is from the putty log:


[webbplace][~]$ /sa/bin/spamassassin -D --lint < /dev/null
debug: Score set 0 chosen.
debug: running in taint mode? yes
debug: Running in taint mode, removing unsafe env vars, and resetting PATH
debug: PATH included '/bin', keeping.
debug: PATH included '/usr/local/bin', keeping.
debug: PATH included '/usr/bin', keeping.
debug: PATH included '.', which is not absolute, dropping.
debug: PATH included '/usr/local/apache/bin', keeping.
debug: Final PATH set to: /bin:/usr/local/bin:/usr/bin:/usr/local/apache/bin
debug: ignore: using a test message to lint rules
debug: using "/sa/share/spamassassin" for default rules dir
debug: using "/saconf/mail/spamassassin" for site rules dir
debug: using "/home/webbplace/.spamassassin" for user state dir
debug: using "/home/webbplace/.spamassassin/user_prefs" for user prefs file
debug: Score set 1 chosen.
debug: Initialising learner
debug: is Net::DNS::Resolver available? yes
debug: trying (3) nytimes.com...
debug: looking up MX for 'nytimes.com'
debug: MX for 'nytimes.com' exists? 1
debug: MX lookup of nytimes.com succeeded => Dns available (set dns_available to hardcode)
debug: is DNS available? 1
debug: all '*From' addrs: ignore@compiling.spamassassin.taint.org
debug: running header regexp tests; score so far=0
debug: running body-text per-line regexp tests; score so far=1.27
debug: Razor2 is not available
debug: running raw-body-text per-line regexp tests; score so far=1.27
debug: running uri tests; score so far=1.27
debug: uri tests: Done uriRE
debug: running full-text regexp tests; score so far=1.27
debug: Razor2 is not available
debug: DCCifd is not available: no r/w dccifd socket found.
debug: Current PATH is: /bin:/usr/local/bin:/usr/bin:/usr/local/apache/bin
debug: DCC is not available: no executable dccproc found.
debug: Pyzor is not available: pyzor not found
debug: all '*To' addrs:
debug: RBL: success for 1 of 1 queries
debug: running meta tests; score so far=1.27
debug: is spam? score=1.27 required=5 tests=DATE_MISSING,NO_REAL_NAME
[webbplace][~]$


Charlie

FZ
02-16-2004, 02:32 PM
Hmm, it looks like you don't have DB_File installed - which is required if you are running the latest version of SpamAssassin (2.63, I think). If you don't have it installed, then SA (2.63 onwards) can't make use of Bayesian filtering. For me, after the line "debug: Initialising learner", I have the following:


debug: using "/home/username/.spamassassin" for user state dir
debug: bayes: 29326 tie-ing to DB file R/O /home/username/.spamassassin/bayes_toks
debug: bayes: 29326 tie-ing to DB file R/O /home/username/.spamassassin/bayes_seen
debug: bayes: found bayes db version 2
debug: bayes: Not available for scanning, only 21 ham(s) in Bayes DB < 200
debug: bayes: 29326 untie-ing
debug: bayes: 29326 untie-ing db_toks
debug: bayes: 29326 untie-ing db_seen
debug: is Net::DNS::Resolver available? yes
...

Of course, I only got it working after manually messing around with DB_File (and its prerequisite Berkeley DB) and installing them after many, many failed attempts. Have a look at http://forums.westhost.com/phpBB2/viewtopic.php?p=9127#9127 and http://forums.westhost.com/phpBB2/viewtopic.php?t=1607&postdays=0&postorder=asc&start=15 if you are using 2.63 and need Bayesian filtering.

ccwebb
02-16-2004, 05:19 PM
Fayez:

If I had the DB_file installed where would it be?

Charlie

jalal
02-17-2004, 06:28 AM
I've just realized that the default Westhost installation has 'auto_learn' turned off in the /etc/mail/spamassassin/local.cf' file.

That will also need to be turned on.

FZ
02-20-2004, 11:46 AM
Charlie,

Sorry for the late reply. I'm not actually sure where DB_File is installed by (WestHost) default (if at all). I just went ahead and installed the latest version on my own, though - I do remember Tim saying that an older version of DB_File is installed by WestHost. It (my new installation) seemed to fail, but in actual fact seems to be working fine - refer to the links I posted earlier. Again, you only need it if you want to use Bayesian filtering with the latest version(s) of SpamAssassin.

Fayez

ccwebb
02-20-2004, 01:02 PM
Fayez:

I find myself intimidated by the process of having to install Berkely DB, then DB_File - especially if I have to then mess around with it to get it going. You said you had a lot of trouble with getting it installed.

Since I am a unix lightweight I think I am going to back off trying to install this. Maybe soon there will be an easier way for us non-techie types.

SpamAssassin 2.63, along with my white and blacklists and rule set is getting about 90% of the spam. I guess for now I will be satisfied with that.

Thanks again for your help.

Charlie

FZ
02-20-2004, 01:35 PM
Charlie,

You have technical support (this forum) at hand ;) So if you really needed to, I'd encourage you to go ahead - I'll be here to help, as will others. But, of course, you should evaluate whether you really need it - I receive 400 mails a day (399 of them spam), about 50 of them autolearned as spam, meaning I don't get much ham to show sa-learn. So, even though I've been running it for a while, I haven't been able to enable Bayesian filtering (not enough ham). I've actually never tried it either, so I don't know if it will really work better than SpamAssassin is already. I've found that after installing Net::DNS and a couple of custom rulesets, SpamAssassin's accuracy has gone up to catching about 394 of the 399 spam mails. If you'd like me to post more info on the custom rulesets, let me know.

dansroka
02-20-2004, 05:27 PM
I "force-fed" sa-learn some ham. My mail client (Apple's Mail program) uses the .mbox format, so I took one of my saved mail folders that had a good assortment of emails from different sources, uploaded it to my server, and sa-learn-ed it. The Bayesian filter is now working, although like you said, i have no idea how well it is working.

Installing Net::DNS gave me a similar boost in spam filtering. Although, nothing seems to catch the recent storm of "net pharmacy" spams. Sigh, I don't want any of your discount drugs!

FZ
02-21-2004, 06:21 AM
Maybe it's time you guys checked out some custom rulesets:

1. http://www.emtinc.net/includes/backhair.cf
2. http://www.emtinc.net/includes/chickenpox.cf
3. http://www.emtinc.net/includes/weeds_2.cf

They're easy to install - just upload them into PREFIX/spamassassin/etc/mail/spamassassin (or wherever your local.cf is located) and that's it.

dansroka, you should definitely check out the "drugs" one: http://mywebpages.comcast.net/mkettler/sa/antidrug.cf

These are just a few of the ones I use, if you want more info (and more rulesets) check out http://www.exit0.us/

Note: You'll need to monitor your mail closely after installing these as you don't want it to mark previously legitimate mail as spam - especially not if you delete mail with a certain score!

Good luck.

ccwebb
02-21-2004, 07:53 AM
Fayez:

I did follow your previous posts and did install Net::DNS.

After doing that I do think that the spamassassin accuracy improved but I am not quite sure why. What does it do?

I am constantly tinkering with rules, whitelists and blacklists. I will look at some of the rule set links you just provided to see if there is anything in there that would help. As you know, the spammers are pretty good at staying a 1/2 step ahead.

Charlie

FZ
02-21-2004, 08:22 AM
Charlie,

What Net::DNS does is it enables "network tests" - in other words, it allows SpamAssassin to contact various blacklists that list (for example) IP addresses that are known for sending spam. Now, when you get mail, it will check against these blacklists (which are "live" and updated regularly) and will then add to the score the relevant tests that the mail in question triggered. What you need to do is have a look at the test names and their scores (http://www.spamassassin.org/tests.html - just do a search on the page for "NJABL" and you'll see the network tests) as well as the description and then decide whether you want to increase or decrease the score (or keep it at default if you think that is fine). For example, if legitimate mail often triggers a particular network test, you could disable it by giving it a score of 0. On the other hand, if you find that lots of your spam (that isn't marked as such) triggers one particular network test then you could increase the score for it.

ccwebb
02-21-2004, 08:54 AM
Fayez:

Once again, thanks a lot. You have been most helpful!

Charlie

FZ
02-21-2004, 09:00 AM
No problem. We must all unite in the fight against Spam! :twisted:

dansroka
02-21-2004, 12:27 PM
Fayez, thanks for introducing me to this ruleset idea - awesome! So all you need to do is upload them next to your local.cf file, and SA will include them? I'll give them a try.

What do you do: when you start getting a certain type of spam (e.g. more drug ads), you go to that www.exit0.us site and look to see if anyone created a relevant ruleset? Are there any others you've found very useful?

(By the way, could someone explain what this "wikki" is that I keep seeing on all the SA sites?)

FZ
02-21-2004, 07:53 PM
No problem. At the moment, what I do when I start getting Spam of a certain kind that passes through SpamAssassin, is the following:

1. sa-learn it as spam
2. check the tests it scored on, and adjust the default scores (make them higher)
3. if it did not score on any tests (or scored on those that I cannot raise without marking legitimate mail as Spam) then I just delete it. Now if I start getting a lot of the same kind of mail, and it contains the same pattern of text (e.g. subject or sender) then I use Procmail to /dev/null it.

As for what a "wiki" is, I've read it (the definition) many times, but I keep forgetting: http://www.exit0.us/index.php/WikiWikiWeb