Re: A Plans for Spam

1

As a Wordpress user, I have gotten tremendously satisfying results from Akismet. They just released a version for MT, as well.

horizontal rule
2

I am for anything that would reduce the demand/load on mt-comments.cgi to help keep the server load down. (I know - broken record!) Would this accomplish this or not because the spam messages would still hit mt-comments.cgi and just not get published?

horizontal rule
3

WordPress seems to have a Bayesian spam filter, but I'm guessing you guys aren't about to change blog technologies.

Here's one for MT, but it doesn't seem to recommend itself highly.

CAPTCHA+whitelist sounds more hackable than, say, requiring registration and login, which is basically the same thing. You want to authenticate trusted commenters, right?

horizontal rule
4

Is the (or, is a) problem that MT uses the same script for posting and viewing comments?

horizontal rule
5

I should note that the Akismet peeps are pretty circumspect about identifying how everything works, but it certainly sounds like it's Bayesian filtering.

horizontal rule
6

CAPTCHA+whitelist sounds more hackable than, say, requiring registration and login

Registration + login places a greater burden on new commenters, though. It took me forever to finally leave a comment at Tim Burke's blog, because of the registration process. That probably works to Burke's favor, since he is a Serious Fellow, but we here at unfogged should be more accomodating to passers-through. And while a simple-minded CAPTCHA is more hackable, the relevant question is, would it be hacked? I don't think comment spammers take their cues from what we're up to.

Is the (or, is a) problem that MT uses the same script for posting and viewing comments?

If it weren't for that, we could rename the posting-comments script.

horizontal rule
7

Oh, it's not the CAPTCHA that I'm worried about being hackable... it's the checkbox and cookie. But I'm sure there's more to your scheme that I'm not understanding yet.

And yeah, I get that renaming the mt-comments script won't fly because it would break links; what I guess I meant was, do you think the load problem is because spammers are beating on mt-comments.cgi? Or because all of us clowns keep reloading it?

horizontal rule
8

Related, in case anyone reading this can help: the reason we're getting the Internal Server Errors is that mt-comments.cgi is a memory hog and occasionally hits the limit that causes it to get auto-killed by out new host. I suppose this is preferable to what happened with our old host, which is that they would let it spin out of control until things got so bad that they locked out our site. They doubled our memory limit, which is why the ISEs are a little less frequent now, but I just spoke with them and there's no way for us to upgrade to another shared hosting plan with a higher memory limit. All of their shared plans (including their high-volume ones) have the same limit that we currently have (actually, a lower one since they raised ours) and the only way to get more memory is a dedicated server plan, which is hella expensive.

So, any tips/tweaks that people know that can help us reduce the memory used by mt-comments.cgi would be much appreciated.

horizontal rule
9

Oh, it's not the CAPTCHA that I'm worried about being hackable... it's the checkbox and cookie.

Yeah, the whole thing. (Have you seen the Fistful of Euros CAPTCHA? It's … not that complex. But apparently works fine.) Do comment spammers send along cookies?

Becks, can you share what the current memory limit is?

horizontal rule
10

Could I speak up for a new kind of captcha? I don't mind them. But the random-series-of-characters ones just don't do anything for me spiritually. Why not a graphic that requests the user to type in what it is a picture of? Or alternately a photo of a cock that asks the user to type in its state of tumescence. That would be more fun than SXXILV.

horizontal rule
11

Can I just say that I'm loving The [Choose-Your-Own Adventure] Kid's metacommentary on his comments?

horizontal rule
12

Or even just a line drawing, with instructions "Please identify this cock as flaccid or engorged." I don't think spam bots could handle that.

horizontal rule
13

Again, are we worried about breaking intra-site links or links to Unfogged from other sites? Because changing the intra-site links to point to a renamed mt-comments.cgi would be a pretty easy case of exporting the data from the database, running some data conversion commands, and reimporting it. I just ran a similar script last night to fix links to cached articles...and none of you were the wiser.

horizontal rule
14

Hey thanks for the props, Cala!

horizontal rule
15

I am the wiser!

horizontal rule
16

Ben, their terms and conditions that I was referred to state:

Users may not, through a cron job, CGI script, interactive commands, or any other means, take the following actions on pair Networks servers:
* Run any process that requires more than 16MB of memory space.
* Run any program that requires more than 30 CPU seconds to complete.

They said they doubled our memory limit, so I assume we get killed whenever we exceed 32 MB.

horizontal rule
17

What are our bandwidth/space uses?

horizontal rule
18

Hey thanks for the props, Cala!

Yes, well, I give you the opposite of props. I give you: SPORP. So you have zero net approbations.

horizontal rule
19

We need at least 2 GB of space and 40 GB bandwidth.

horizontal rule
20

If we knew what kind of CPU resources we needed, and what the hell unixshell means by a "unit", and were willing to move hosts, and to move to a host where we'd have to install and configure everything ourselves, we could use the unixshell 160 or 192 plans. That would involve some low-level mucking, of course.

horizontal rule
21

Who is this "A" who is planning for spam?

horizontal rule
22

Firefox remembers our captcha for me. I'm grateful for that feature, b/c MT's 'remember info?' cookie hardly ever works for me nowadays.

horizontal rule
23

A representative of the aesthetic stage of life.

horizontal rule
24

Who is this "A" who is planning for spam?

Surely you of all people have read Perec?

horizontal rule
25

I must say that unixshell's "we give you no support whatsoever" disclaimer gives me pause.

horizontal rule
26

That's par for the course for that kind of virtual-server setup, from what I've been able to tell; after all, you're installing your own OS+programs, so their ability to support you is somewhat hampered by your ability to do whatever the hell you like.

horizontal rule
27

Why is submitting a comment using 32M of memory?

horizontal rule
28

27 - That's what I want to know.

26 - The idea of moving servers again, I suppose I could handle. But the idea of redoing everything we've done in the last week PLUS installing everything makes me want to curl up in the corner and cry. How bare is this server we're talking about? We're not just talking "insall Movable Type" -- we'd even have to do crap like configure sendmail, right?

horizontal rule
29

The checkbox thing is not a great idea, I don't think -- spammers can use cookies (I don't know if they would, but it's possible). I'd suggest making commenters periodically get a cookie with a captcha on the main page (in the sidebar, maybe? duplicate it in comments?).

This could potentially help a *lot* for reducing mt-comments load: you can filter specific requests (e.g. POST to mt-comments.cgi) within .htaccess based on the presence of a cookie (I don't know how, but I believe the ancient legends our server guy tells us -- I'm sure the unfogged technical hivemind could figure it out). That would pretty effectively prevent spammers from introducing a load to the system.

(atm)

horizontal rule
30

Becks: we'd have to do crap like install sendmail. And, for that matter, choose a distro and install that.

horizontal rule
31

I was going to say, how are we sure that the memory problem happens on the POST rather than on comment reads, but then I remembered that we never get the 500s on GETs, just on POSTs.

I don't have an MT installation handy where I can look at the mt-comments source, but I kind of want to check to see what that beast is doing.

horizontal rule
32

Wait, couldn't you split the posting and reading functions into two different scripts? Make mt-comments.cgi just be responsible for reading comments, thereby making sure all old links work, but have it just exit immediately on a POST request.

Then you can make an mt-post-comment.cgi, or whatever, that does all kinds of crazy CAPTCHA stuff, or whatever, and redirects to mt-comments.cgi when it's done. Eh?

horizontal rule
33

30 - Ben, I don't see myself having the time to do anything like that in the next month. I'd be pushing it just to find the time to do another data move.

horizontal rule
34

Also, this seems to be moving in the complete opposite direction of our (well, my) "I don't have time to do a lot of site maintenance so let's find an easy, low-maintenance hosting solution".

Before we bog down in the details of unixshell, I think we need to figure out (1) do we really need to change hosts and (2) if so, isn't there someone who offers more memory but not a bare-bones setup?

horizontal rule
35

32 is awesome. That is totally what you should do. Excellent idea, Mr. H.

horizontal rule
36

I am so rivetingly ignorant about all this stuff, but are there any huge-comment-volume blogs out there whose brains we could pick? I mean, we get a lot of comments, but there are plenty of blogs who get more -- do they all have these problems?

And other than that, I am totally in favor of throwing money rather than time and expertise at this.

horizontal rule
37

Changing hosts should be an option of last resort. 32MB is WAY too much memory for this process to consume. Our server guy considers an Apache thread over 16MB to be abnormal, and we're serving sites that are considerably heavier than Unfogged. It *must* be reindexing old comments, which is a totally stupid thing for it to do. There must be a way to stop it.

My suggestion would be to post a frustrated blog entry that expresses your infinite, cosmic disappointment with Mo/vab/le Ty/pe, the platform you love so well, and lament that you can no longer recommend it to the many well-heeled folks who come to you asking for blog software advice. Ja/y Al/len of Si/xA/part is pretty good about trolling technorati for MT mentions; odds of him swooping in and hooking you up with someone who can diagnose/retrofit mt-comments.cgi are pretty high, I'd say.

horizontal rule
38

might be worth noting that refreshing the comments takes a long time for me, too. maybe you should consider junking the dropdowns and seeing if entry archive-based comments perform better?

horizontal rule
39

Sorry, meant popups, not dropdowns (I'm working on some dropdowns atm)

horizontal rule
40

I know very little about this as well, but I have noticed that the recent comments sidebar looks different from different archive pages. So that the "recent" comments on say, a page from 2004 are actually comments from a few hours or more before whatever time it is you're looking at them. Does that have anything to do with the reindexing?

(Apologies if you've already noticed this, and if it's irrelevant.)

horizontal rule
41

To Tom's third sentence in 37, I say that the last thing the post() function of Comments.pm does before returning is this:
MT::Util::start_background_task(sub {
$app->rebuild_indexes( Blog => $blog )
or return $app->errtrans("Rebuild failed: [_1]",
$app->errstr);
$app->_send_comment_notification($comment, $comment_link,
$entry, $blog, $commenter);
_expire_sessions($cfg->CommentSessionTimeout)
});
It also rebuilds the entry synchronously.

But isn't that necessary for the comment to be reflected on the (static) index page (and also the archive pages, though their new comments sidebar always lags behind, for some reason)? If you think I'm going to examine what rebuild_indexes does, when I should be preparing for a presentation on Kant, well, you're wrong.

horizontal rule
42

People! Where is the love for 32? It would contribute mightily to the anti-spam battle, and it is way easy!

horizontal rule
43

I already thought of the idea proposed in 32, but didn't post about it. Consequently I have no LOVE for it, only (as is appropriate for the day) HATRED for you for stealing my THUNDARR!

horizontal rule
44

Alright... well, is "built w/ indexes" turned off for every index template that doesn't need to be updated whenever a new comment is posted? I assume it is, but want to be sure.

If the rebuild line can be isolated as the cause of the problem, it can probably be written around. It shouldn't be too hard to write some CGI updating a file (to be included by PHP in the index template) to reflect the recent comments -- this would likely be a lot more efficient than whatever involved process MT goes through to rebuild its index templates. I recently switched to doing this for my own archive section, and I like it a lot better as a setup. It appears to me that MT can be made to run a lot more efficiently if you ditch its tag/template system in favor of PHP when appropriate.

horizontal rule
45

The Geens turing test is very effective, but I don't know if it will help w bandwidth. Most spambots go straight for mt-comments.cgi, bypassing the comments box.

I do think you should try out various easily implemented things like dropping popups or turing tests before you start thinking about stuff that invollves a lot of work.

horizontal rule
46

I'm not sure I get why 32 is great, so I suspect I'm misunderstanding what it would do. How would the captcha interfere with reading old comments? And is that what 32 proposes to solve?

horizontal rule
47

Are you running your MT install on mod_perl? If not, would that help? Would it be possible?

horizontal rule
48

Since (apparently) comment-spammers look for scripts called "mt-comments.cgi", if we renamed our script to something else, like mt-fuckyou.cgi, then they wouldn't find it in as great volume. But then all the links to the comments would be broken. So the proposal was to keep the old script name, in read-only fashion.

horizontal rule
49

The more I think about it, the stupider reevaluating the main page template every time there's a new comment seems.

horizontal rule
50

32 is awesome because TMK said so, and TMK is awesome. (Also, Ben: I will be retaining custody of your thunder until such time as you admit that you're just frontin'.)

(I should also note that I spent a few minutes paralyzed by the surplus of thunder/thundar/thundercats jokes I could have made.)

My idea was this: if the problem is that spammers are overloading my-comments.cgi by posing spam comments, then we can stop that by renaming mt-comments.cgi. However, that breaks links to old comments. So my suggestion was to separate the two functions of mt-comments (posting and viewing) into two separate scripts, one of which is well-known and useless to spammers, and one of which is cleverly named and may or may not use CAPTCHAs, as the bloggers decide.

horizontal rule
51

The following templates are all rebuilt with the indexes:
Atom feed (atom.xml)
Bridgeplate feed (bridgeplate.rdf)
Dynamic Site Bootstrapper (mtview.php) [can this be unchecked?]
Full Post w/comments (comments.xml)
Main Index (index.html)
Master Archives (archives.html)
Mobile (mobile.html)
RSD (rsd.xml)
RSS 1.0 (index.rdf)
RSS 2.0 (index.xml)

horizontal rule
52

As a test, you could try turning off the re-indexing and just temporarily remove the lastest comments list on the front page. If that speeds everything up, the diagnosis would be confirmed.

horizontal rule
53

If most of the time on a spam request to mt-comments is taking up in updating the page indices, then that's the problem. There's no need to update page indices if a comment was blacklisted. Otherwise I don't see how avoiding the reindexing would help that much.

horizontal rule
54

The recent comments sidebar on the archive templates has always been hosed. It reflected what comments were recent when the archive was created or some such.

horizontal rule
55

I also like Tom's suggestions in 38/39.

horizontal rule
56

I'm of course totally in the dark on this, but why are people thinking it's a spam problem at all? Obviously spam is bad independently of page slowdowns and errors, but why would it be the cause of those?

horizontal rule
57

I agree with w/d. It could simply be caused by our own commenting practices.

horizontal rule
58

btw, the title is supposed to be a reference to the title of Graham's initial essay, "A Plan for Spam". Except I had two plans, see?

horizontal rule
59

I had gotten the impression somewhere that the spambots were causing most of the requests to mt-comments, and that the resources consumed by each request were about the same whether for each request.

horizontal rule
60

Is it just me or does the site seem kinda snappier now?

horizontal rule
61

Testing

horizontal rule
62

Not fast but a wee bit quicker? I unchecked four files from "build with indexes". (There used to be even more being built than I had listed in 51.)

horizontal rule
63

Are there any more in 51 people think I can uncheck?

horizontal rule
64

Now, I had this really good idea the other day. You could use an XmlHttpRequest in the javascript in the comments page to a page like this:

unfogged.com/new-comments.php?postid=xxx&lastcommentid=yyy

that would return all comments after comment id yyy in a thread and just be added dynamically to the page. That would probably reduce comments-refreshing bandwidth (and probably server cpu) by 95+ percent.

But I'm not sure if it would help with unfogged's problem, since I'm not sure what's taking up all the time here.

horizontal rule
65

You could do something like that to replace the recent-comments sidebar, too, as long as you had an efficient way of getting those.

Except that wouldn't help with updating the comment-count reflected for each post.

horizontal rule
66

Do we really need all of our RSS feeds. I think we need a post-only feed, but we have three of those right now: Atom, RSS 1.0, and RSS 2.0. I think the Bridgeplate feed (comments only) and post+comments feeds are good options, too, but do we really need 3 flavors of post-only?

Also, is Master Archives even used?

horizontal rule
67

There's one good way to find out!

horizontal rule
68

Testing

horizontal rule
69

Also, is Master Archives even used?

I'd be fairly surprised if it wasn't.

Do you rebuild your indexes that often anyway?

horizontal rule
70

Now only the following are rebuilt with indexes:
Atom feed (atom.xml)
Bridgeplate feed (bridgeplate.rdf)
Dynamic Site Bootstrapper (mtview.php)
Full Post w/comments (comments.xml)
Main Index (index.html)
Master Archives (archives.html)
Mobile (mobile.html)
RSD (rsd.xml)
RSS 1.0 (index.rdf)
RSS 2.0 (index.xml)

horizontal rule
71

Do you rebuild your indexes that often anyway?

Apparently, yes. See 41.

horizontal rule
72

Well, things are feeling snappier now for me.

horizontal rule
73

Testing.

horizontal rule
74

I just got an internal server error.

horizontal rule
75

And another.

horizontal rule
76

Oh!

Nothing but index.php needs to be checked, I don't think.

Wouldn't it be simpler to change comments.pm, though?

horizontal rule
77

Index.html, that is.

horizontal rule
78

testing

horizontal rule
79

Ah, ok, I get 32 now. Makes sense. I still think the real problem is the site's size, not the spam, but pursuing anti-spam measures is certainly a good idea.

The idea I proposed of turning off rebuilding on commenting was a stupid one, I now realize -- it'd break a bunch of other things. But perhaps the rebuild script could be taught to ignore all comments prior to a particular date or ID -- that might speed up the rebuild process.

The XmlHttpRequest thing is a pleasantly geeky idea, but would actually result in a lot more load on the site. The merit of the rebuilding system is that these calculations only have to be done once, then are cached on disk.

horizontal rule
80

I was thinking that the new comments info could be written to disk in a file that acted like a ring buffer (uh, somehow—look at my hands wave!) and that would be how the comment sidebar was replaced, one way or another. Though, of course, then the comment-count would never get updated, etc.

horizontal rule
81

Testing.

horizontal rule
82

(I just want to feel helpful)

horizontal rule
83

Armshasher gets the Spirit Award.

horizontal rule
84

And they spelled your name wrong on the trophy. Those bastards.

horizontal rule
85

Testing.

horizontal rule
86

Testing

horizontal rule
87

Testing

horizontal rule
88

More testing

horizontal rule
89

And another. I know, this is entertaining stuff.

horizontal rule
90

I just tried turning off pop-ups and it didn't seem to have an effect.

horizontal rule
91

Uhh...so is this all a function of MT rebuilding pages every time someone comments? And isn't that why Henley, among others, moved to Word Press?

NB: I wouldn't swear I know what any of the words in the above mean.

horizontal rule
92

Further evidence that the memory wall is hit when rebuilding indices is that the comment is actually posted, but no email is sent and the sidebar doesn't get rebuilt right away. However, the entry does get rebuilt. That leaves rebuilding the indices as the only candidate.

horizontal rule
93

91: wordpress actually has similar problems of its own -- instead of running periodic tasks on a cron, it has an event loop that fires on a percentage of all requests. jumps in traffic can result in much more load than is actually necessary. I'm no WP expert, but a coworker has been having big trouble with his site as a result (and similar problems finding an ISP that will tolerate the load he introduces to their system).

horizontal rule
94

test

horizontal rule
95

test

horizontal rule
96

final test

horizontal rule
97

Yeah, WP isn't perfect, but it has some decent caching options.

horizontal rule
98

testing

horizontal rule
99

Testy today, are we?

horizontal rule
100

One hundred! Test! More testing!

horizontal rule
101

Wow, it is awesome and meta to get comments spam on a thread of this title.

horizontal rule
102

You know, I just had an idea, which may be MADNESS, but I thought I would mention it.

In order to update the "Recent comments" sidebar the site has to rebuild the main page. But it is possible to make the sidebar not an integral part of the main page, but an extra blog that publishes into a file that is then included into the main page. That's how my sidebar works. (Like this.)

Would it be any use to shunt the sidebar and "Recent Comments" into an extra blog like that, so that the comments process would involve making extra entries in that blog rather than doing stuff to the sidebar? Thinking it over, I suspect not, because after making entries to the new blog you have to rebuild the main page anyway, but I thought I'd mention it.

Another thought: Would it help in any way to drop the "Recent Comments" from archive pages? Those are always messed up anyway.

I realize that Becks is celebrating transferring the reading group archives, so feel free to put this in a little envelope marked "do not open till Xmas," or to ignore it entirely.

horizontal rule