This is only the second time in almost six days that I'm wearing something other than jammies, a sweatshirt, and mucklucks. My Masters Comprehensive Exam paper was due at 2pm in both electronic and hard copy format. I thought I emailed it at about 1:40, though I forgot to run mhn to actually attach the files (so it just showed up as the file names, and I looked like a dolt). My dad gave me a ride to school in the Cadillac, I paid $1.91 off my Buff card to print the bugger, and dashed upstairs, handing it in at 2:05. I claim that's close enough to machine precision. Or clock skew, or something.
It seems that no matter how long in advance I know about my assignment and topic (a whole year, in this case), it seems to take me right up until the last minute. I knew I'd probably be up all night last night as of, say, a month ago. And I did it with only one coffee mug of green tea. At the end I was frantically finding papers to reference and skimming them to include data. I'm really not satisfied with that practice, but there just aren't many (academically) published spam filtering studies. I'd also originally planned to implement several algorithms and compare them. However, as of Friday night I decided that there was enough research comparing good statistical algorithms on a rather lame corpus, so I thought I'd do better (and easier) to study the effects of different tokenization schemes and other parameters. I'll post an HTML version of the paper soon.
But I feel accomplished. I wrote a Naïve Bayesian spam filter which, in the best arrangement, correctly classified all of my mail from October with a "mere" 62 misclassified spam (out of 1200). With a little work and a lot of optimizing, I'll be able to plug it in to my email system. And maybe now I can finally get around to doing something about the fact that I have 7575 messages in my inbox, dating back to late 1999. My ideal email client would allow database-style searching of old messages while keeping the clutter compressed. (Since I use mh, every message is a file, and grepping my whole inbox exceeds the UNIX limit on command line length.) I've idly pondered writing a client with an MH-like interface, a BerkeleyDB (or mysql?) backend, and Perl hooks for easy extension. Using the Mail::* and MIME::* modules makes things pretty easy. Or maybe I should follow Paul Graham
and write a mail client in lisp.