Why is blog spam always written so badly?
The spammers are automatically generating new comments by taking existing comments and running them through a thesaurus program that replaces words with synonyms or related parts of speech. The result is a sentence which makes sense, but has word choices that no native speaker would ever make:
Where else may I am getting ...
is clearly not something a native speaker would write, but
Where else could she be getting...
is, and can be transformed by a simple substitution of pronouns and synonyms into the spam text.
This way, even if anti-spam forces have a huge database of known-spam comments, the spammers can generate infinitely many new ones that are plausibly English.
I long suspected this was the case but I recently got proof. I now occasionally get comment spam containing the entire substitution script; it'll be something like:
I can't [believe/understand/comprehend] the [great/superior/amazing] [content/information/data]...
Since the spammers were likely non-English speakers to begin with, they didn't notice they were sending the script rather than the output.
If you examine a large enough corpus of spam, you can pretty easily figure out what algorithms they're using. It would be an interesting challenge in reverse engineering to write a program that deduces the algorithms used from the corpus.
I ask because when I first saw it, I thought perhaps they were being genuine but inarticulate.
They fooled you once. It probably won't happen again!
Commenter TildalWave points out:
none of the sample spam messages OP posted actually endorse any products, or are otherwise promoting any other cause.
Well let me give you an example: here's a comment that arrived a few minutes ago on my blog:
user name: cuisinart compact toaster review
user url: toasterovenpicks.com
user email: [email protected]
user IP: 37.59.34.218
Comment contents:
One in particular clue for that bride and groom essential their
own absolutely new everything, actually a surname burned which has a mode,
which render nearly girl thankful recognizing their refreshing surname
therefore distinctively printed.
The product is promoted in the user's metadata, not in the content of the comment. The content is just an attempt to get past the spam filter. (I suspect that in this case the text is not a mutation of an existing text but rather generated by a Markov process over a corpus of documents about wedding planning.)
Obviously anti-spam forces are on to this one too, which is why this was in my spam filter. My spam filter (akismet) on average lets through one spam for every 705 submitted. Again, that's what spammers are going for; they know that 99.9% of their work will never be seen by anyone. They're trying to randomly explore the space of false negatives in spam filters, a space which is getting quite small indeed.
The language may have a little to do with a sig like TidalWave was talking about.
A little harmless spamdexing.
I've been getting a few of the first example on my blog. While it looks harmless, they're actually spamdexing (a little bit of "black hat seo") by trying to associate their user account (and website links by extension) with the keywords in the blog (like Xander was saying, it's marketing). When you click on the link it counts as a positive hit from the blog. If a blog has enough hits positively for a key search their link will get a +1 bump up from the search engines in regard to relativity for the keywords. Most of the search engines have caught onto this and try to prevent it with relevance matching in their formulas.
The downside is if a user comes to your site for something off-topic because of this spam and leaves (bounces) the search engines will penalize your ranking overall (because of lack of substance) as well as your ranking for the page with the off-topic content. While there's not a lot to do with IT Security in spamdexing (unless they use an infected site as their own URL), it does impact the [social] performance of the site negatively overall if enough spammers do this and knock your site down in the rankings.
In regard to the second example it contains a hook for a two post spam operation (Commonly found in forums). The first poster will create an account and post a question that looks like a legitimate concern.
... Where else may I am getting that kind of information written in such an ideal means? ...
A short while later (within 20 minutes or so, up to even a couple of days) another poster (from the same country usually, if not the same IP range) will create a new account and post the answer, which contains the link in relevance to the original poster's question. Since most board moderators won't delete what looks like a real discussion, their spam fools someone again... it's still spamdexing though. A better-crafted marketing-style example might be:
I found a great resource for [keywords here] at [http://www.example.com/]. You should take a look since they have a lot of information related to [more keywords]. It should help you out.
Some of the other tricks they'll do is have a signature image that is a transparent GIF only 1 pixel by 1 pixel and wrapped in an <a>
tag. This creates a link to some other website anywhere the poster has typed out their gibberish content. Just because you can't see it, doesn't mean it's not there.
Not so harmless Spam Threats impact Server Security
Some of the worst examples of spam will actually contain a link to an infected site, or they'll install a javascript keylogger. (I've seen the SVG hack used in signature lines to inject malicious script.) The keylogger is the one you'll need to watch-out for because they can capture the username and password of the blog/site admin or another user with elevated privileges when they try to log in (or any user creating an account) on the same page to delete the spam. Best case scenario, is if the user has enough access to see other users, the attacker will download the list of e-mail addresses from the users and send out spam e-mail messages to a market-targeted (marketing) list.
Innocent new users can have their credentials stolen, and since most people use the same passwords and the same e-mail address everywhere, now their accounts elsewhere can be compromised. (Facebook, LinkedIn, etc)
Worst case scenario, because most web developers of the CMS systems don't expect someone with "skillz" to get into the backend via one of these methods (trusted), they're not doing things like checking all of the admin forms for XSS or MySQL Injections (I've caught a few of my developers cutting corners in this method). From XSS to SQL injection it then depends on the security of the box, the limitations on the user accounts (don't run Apache as root), and the read/write access. Since they would be in the CMS you can assume that the user can likely write anything to the box they want. Delete the database, infect the site with a backdoor... now it's an IT security issue.
The company I used to work for used to do "spinning", which as one of the answers above mentioned is programatically doing thesaurus search and replaces on the text. However, we would do it in multiple, complex layers.
- We actually employed real, American writers to write the original copy.
- Those original writers would mark up their own document using a special syntax that we created, marking words, word groupings, phrases, and entire sentences, including the synonyms that they felt were appropriate for each case. This meant synonyms for entire phrases that could be exchanged without changing meaning. They would do this in a text editing software we created that would provide them with auto-complete suggestions.
- Each time a writer would mark up their document, we would store all of their synonyms and phrases in a dictionary and use them to add suggestions to the writer for their next assignment.
- Hit GO on the machine, and spin out hundreds/thousands of variations.
- Divvy out blocks of variations to our SEO team in the Philippines whose sole job was to find high PR blogs, forums and other websites too dumb to block us.
Interestingly, we never automated the actual posting part, since that was the easiest thing for machines to spot. A real human was posting that trash.
Ah, the good old days of ruining the internet for everyone.