Ask HN: 2 karma minimum to lock out spam bots

ComputerGuru · on April 24, 2009

That would encourage them to try to generate spam comments in the slim hope that they'd get upvoted.... which would lead to a much worse spam situation.

For instance, they'd employ Markov Chains to re-hash comments in a post, or from older posts, or from the text of the article.

One thing I've learned: spammers don't give up. If you give them an easy way to submit that's even easier for you to clean up, that's better than starting an all-out war that'll just increase the spam content and make it more difficult to filter out.

jasonkester · on April 24, 2009

Very true.

The best you can do is allow the post to go thru and make it appear to the spammer that they have succeeded. Let them view their post on the homepage and go away with a nice warm feeling of satisfaction, secure in the knowledge that they don't need to improve their algorithm at all for this site.

It's pretty straightforward to modify your display mechanism to include spam posts if and only if they originate from the same IP address as the viewing client.

I implemented this on Blogabond, and it had two positive effects: My Spam corpus is growing at a faster rate (thus making it more effective), and the sophistication level of the average attack has dropped sharply.

philh · on April 24, 2009

The problem is detecting the spam posts in the first place. Your idea would be incompatible with the proposed method, because you don't want to do the same thing to legitimate new users.

jasonkester · on April 24, 2009

Certainly, HN must have a bayesian classifier looking at the content of its submissions and deciding whether they are "Hacker Newsworthy". By now, the site must have sizeable corpi (corpuses?) of good articles and spam links to use in such an assessment.

Since this site was built by Mr Bayesian himself, it never occurred to me that it might have weak spam filtering.

dagobart · on May 7, 2009

> The best you can do is allow the post to go thru and > make it appear to the spammer that they have succeeded.

I read the cons on this suggest, but what about turning the approach just around: Keep the spam in by default, just the folks with higher karma get the spam filtered out? That way, the successfully submitted spam could not only be verified from the original source but also from every other node of, say, a bot net.

febeling · on April 24, 2009

Spam bot generating comments that get voted up? Reaching the right level of sophistication, these spam bots would suddenly turn into quite acceptable members of the community, even if they are completely automated. If readers in general find these comments worth up-voting, than it would not matter much what the underlying intent was, because it is not harmful anylonger.

Rexxar · on April 24, 2009

A spam bot can find a link that is published at the same time on reddit (by example) and HN, take some comments with lots of upvote and submit-it to HN. He will probably get some upvotes.

ubernostrum · on April 25, 2009

There are actually already spambots that do this on reddit -- they find a highly-voted comment in a reddit thread, follow the link to the article and repost the comment there (filling in the URL they want if the site in question allows it). Unless you look carefully at what people claim as "their site", or happen to read reddit as well, they can be hard to catch.

pmjordan · on April 24, 2009

Isn't there an easier way to reach the 2 karma threshold? Like, creating 2 accounts, the second's only purpose being upvoting some throwaway comment the first one creates and subsequently deletes?

jauco · on April 24, 2009

In the proposed scheme you can't upvote unless someone upvoted you first. So you can't create a second account to upvote the first, because that second account needs its own upvote before it's usable.

iamwil · on April 24, 2009

A human can use that account first to get beyond two votes, and then subsequently pass that off to a spam bot

jauco · on April 24, 2009

Yes, but that would mean that he took the effort and used his skill to present a truly interesting comment. This isn't particularly hard, but I think harder than a bot or hired non-english speaking captcha breaker can perform.

We don't have to make an unbreakable system, just one that makes us less profitable than the rest of the internet.

yummyfajitas · on April 24, 2009

AFAIK, most captcha breakers aren't hired. They are simply bribed with porn and similar things:

"To see naked pictures of Marissa Mayer, please type the words below into the box."

dagobart · on May 7, 2009

Honestly, Marissa Mayer would appeal only to geeks & nerds. And those might be more interested on what she can do, than look.

mixmax · on April 24, 2009

Would a spammer really go through that just to submit a spam link that probably wouldn't get beyond the new page anyway? It's quite a bit of work, even if you have the same framework that you can use across multiple sites. My experience is that a post that doesn't get upvoted will receive around 30 hits.

Surely there are easier ways of getting 30 hits.

mooism2 · on April 24, 2009

They only have to write the code once. Then they can get 30 hits to a different site every day.

mixmax · on April 24, 2009

I don't think it's feasible to make "write once run anywhere" code for this. Input fields will be named differently, rules will vary and spam filters will trigger on different things. Even if you only have to spend an hour grooming your script for HN it seems like a lot of work to get 30 hits. You can of course reuse the script on the site once you've made it, but it won't take much before PG figures it out and bans IP-addresses, or some other measure.

lallysingh · on April 24, 2009

Heuristics aren't terribly hard. Scan for an input text box of a minimum certain size. Then try each submit button independently to see which one gets your sample text on the page.

CalmQuiet · on April 24, 2009

Thank you: I needed to hear that. Part of my brain likes to keep scheming about "more absolute" ways to keep them out (and thus avoid any clean up).

Your points suggest that my utopia-philic thinking is probably less useful than accepting that some clean up is just a necessary overhead (cf: "Eternal vigilance is the price of liberty."). Thank you.

Dilpil · on April 24, 2009

Combating Markov Chain generated spam sounds like fun.

trickjarrett · on April 24, 2009

I suggest we just add a recaptcha for people posting links below a certain karma threshold.

chanux · on April 24, 2009

I too was thinking of a captcha to block spam on HN. But then again thought it'd be a pain to the users.

+1 for captcha for users under certain karma level. +1 for 2 karma to be eligible to post urls.

frossie · on April 24, 2009

Also, since the OP said the spam was from accounts just created, you could add a third threshold (time since account creation before you can submit a story).

By the way there is no reason to pick these thresholds out of thin air - presumably one can look at say the karma distribution of link submitters and figure out what the highest threshold that would inconvenience the least number of people is. For example, I would be truly surprised if 95% of stories were not submitted by people with a karma of 5 or more.

froo · on April 24, 2009

Instead of a captcha, what about something that is like a puzzle to complete, that way you can also avoid low quality submissions which might come from people not intelligent enough to solve said puzzles and would also be a delightful little game for the rest of us who like to solve puzzles/problems?

bendotc · on April 24, 2009

First, I don't think I would be delighted by an automated puzzle on HN.

Second, the idea that puzzle-solving-ability is somehow closely related to interesting/insightful writing is specious. People are smart in different ways, and as a programmer myself, the kind of smart that I want most is the kind I don't have, which is often not the problem-solving kind.

All that having been said, I'm not against a captcha. I just don't think it'd be delightful and I don't believe that it'd somehow raise the quality of posts around here (beyond from removing some spam).

trickjarrett · on April 24, 2009

I am not against a captcha level puzzle ("which box is blue") or arrange these numbers in order of least to greatest. But to do so to limit "less intelligent" members amounts to censorship and I cannot support it. The goal it stop spammers mechanically, and allow normal people to post, whether their post is of substance or not.

DarkShikari · on April 24, 2009

I saw an interesting method used by spambots on the Doom9 forum: they would take old threads and repost them (with a link to their ad in their signature or such).

Some variant of this method might work to circumvent such a policy; find a similar thread, pick a comment from it, and post it, perhaps? It might only work 20% of the time, but that's good enough to get some spam accounts with URLs approved.

Of course this would all take effort, and might be enough to lock out a lot of would-be spammers, or at least convince them to go somewhere else.

Tichy · on April 24, 2009

Reminds me of what I see on my Wordpress Blog: lot's of 0-content comments like "this is a cool article". At first they didn't make sense (blocked them anyway), but then I remembered the Wordpress settings for comments. By default, it is set to "commentors must have been approved at least once". So I guess if I would let one of those innocent comments through, they would be back with more serious stuff.

Spammers are crafty...

ashleyw · on April 24, 2009

Good idea; though I'd argue 2 is too low, if these aren't "bots" but real humans, it wouldn't be hard for them to adapt by commenting and then upvoting each other.

10+ would be nicer, anyone truly interested in HN would understand why! :)

talison · on April 24, 2009

It's a good point that some spam "bots" could be human. I was running a free email site when we saw a lot of strange account creation. We had recaptcha enabled and knew it hand't been cracked.

It turned out (based on IP address analysis) that the accounts were being created by humans in the Philippines and then handed over to spammers in Dubai. Ah globalization...

If you have an efficient spam vector, it's not unusual to see low-wage humans manipulating the system to get around captcha.

jasonkester · on April 24, 2009

This is a lot bigger than you'd expect. Nearly all the spam that makes it into the database on my sites is human-powered. It's maybe only 1% of the total attack volume, but because simple checks knock out nearly all the noise, it becomes the most significant fraction that you have to deal with.

ErrantX · on April 24, 2009

limit it to 1 account creation per IP per hour :)

Yeh they can use a ton of proxies to get round that but I bet it cuts the account creation right down.

And it shouldnt affect 99.999999% of "real" users.

jasonkester · on April 24, 2009

Ah, but there's the rub. Extrapolating from my comment above, your number one job is to make spammers feel successful when they fail. If you reject new accounts like this, you'll force them to write those little proxies to get around your system.

The better thing to do is to simply notice what they're doing and flip the IsSpammer bit on all those new accounts (including the first one.) That way you can correctly classify any content they may post from those accounts in the future.

ErrantX · on April 24, 2009

There's the age old argument of which system to go for: Passive (my suggestion) or Active (yours).

Probably both have merits but I am inclined to agree yours is the better way :D

chanux · on April 24, 2009

Well.. I don't really agree with you. 10+ would be a bit painful I guess. Let's first keep the bots out. Then let's find a way to keep lamers away. There's always "the erlang week" to scare away lamers anyway.

jauco · on April 24, 2009

Like I said above they can't upvote each other. Because they need the initial upvote from someone who is already "in the system" (ie. at least 2 karma) first.

So at least one of them has to say something that the original community finds useful. And if he/she does manage to upvote his friends, those upvotes will generally be cancelled out by the community (not because they are fighting spam, but because they don't agree with the upvote)

slater · on April 24, 2009

How about we keep with the intellectual stuff around here, and make 'em answer math captchas. Using latin numerals.

eg, what's LV + IV? Answer has to be given in latin numerals, too.

jrnkntl · on April 24, 2009

Some basic Erlang questions are fine with me ;)

dhimes · on April 24, 2009

We'd have to agree on how to write II + II

We could ask for a basic derivative (for example, of a polynomial), or the value of 'x' in a simple algebraic equation x+5 = 2x + 10

cperciva · on April 24, 2009

Problems like that are far too easy for computer programs to solve. Clearly the right solution here is to present new users with a set of Turing machines and ask them which of the specified machines halt.

This would probably be very effective at limiting the growth rate of HN, too.

dhimes · on April 24, 2009

I've never written a bot. I thought perhaps that the process of parsing instructions (we could put more than one variable in the equation), then writing code to solve, might be more effort than it's worth to them. Especially if we're doing something that is somewhat unique so their reward would only be one page. However, phildawes is apparently implying that once the bots have parsed the problem and know what to solve for they could go to a page that implements Mathematica and submit the problem to be solved, therefore saving the time required to write problem-solving code. I didn't realize they worked like that.

rythie · on April 24, 2009

agreed, computers are good at solving math problems or running Turing machines (since they are them). You would be better off with a turing test (http://en.wikipedia.org/wiki/Turing_test) that only a human can answer - and if any does get a computer to solve it you get them to start a company because they have solved a fundamental computing problem.

cperciva · on April 24, 2009

Computers are good at simulating Turing machines. Computers aren't good at determining whether Turing machines halt; in fact, it is impossible for a computer to determine in the general case whether a Turing machine will halt.

rythie · on April 24, 2009

Fair enough, though it might have the effect of not letting humans in either ;-)

phildawes · on April 24, 2009

Cool - a challenge bots can't hope to beat. Oh wait... http://72.3.253.76/webMathematica3/quickmath/page.jsp?s1=equ...

sketerpot · on April 24, 2009

The point is to try to get the spammers to go someplace that's less work to spam.

rythie · on April 25, 2009

There is an article on slate.com about CAPTCHAs http://www.slate.com/id/2216837/ (submitted here: http://news.ycombinator.com/item?id=579051 )

seejay · on April 24, 2009

Hope HN won't introduce anything similar to the burying system on digg which the powerful users on the site use for their advantage.