Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The Guardian Is Being Swamped with 'Dark Traffic' (businessinsider.com)
53 points by xvirk on Nov 4, 2014 | hide | past | favorite | 67 comments


I built my personal extension which blocks all the traffic from Dan Pollock's list [1], then blocks all the traffic from major service providers (google analytics, etc) and social networks (fb, tw, google, etc) when not on their website. Referral and user agent headers removed, I haven't found the need to remove other headers. Currently working on preventing (and manually allowing) all xhr/script/image requests 2 seconds after the main frame has loaded.

The internet is a lot faster for me, and the battery appears to last longer. So far, I've only had problems logging into instagram, but once the cookie is set, I can re-enable blocking ads and tracking.

I guess I am one of those 'dark traffickers' :P

[1] http://someonewhocares.org/hosts/


Looks like that list hasn't been updated since our domain switch. If you really want to block our internal analytics (which are in practice fairly harmless) replace "hits.guardian.co.uk" with "hits.theguardian.com".


Nominated for the most classy comment of the month. That the product manager of The Guardian would take the time out from his (no doubt) busy day to help a user to block tracking is an amazing display of trust in that the user knows best what is good for them. Thank you.


It's more like he knows that helping one single user like this isn't going to make a dent in the tracking they do on virtually all other users.

Now, if they voluntarily stopped tracking all or a significant portion of their users, I would be shocked.

Of course, that isn't going to happen.


Of course there is always a way to put a negative slant on just about anything.


I wonder why companies who host their own analytics choose to use a separate request to track users?

Surely it is better just to get the data from your own content web server logs ? Wouldn't you get the same information while saving on the extra HTTP request? It would make the site loading times slightly faster as well.


Server logs will track all requests from crawlers, accelerators, and cancelled navigations. An AJAX callback will only happen for real human visitors that actually view the page.


I admire the fact that you offer this kind of info :-) thanks mate, always admired the Guardian for his articles, authors & general stance towards privacy.


Considering them telling almost 20 other domains about your visit, I would not call that stance on privacy a strong one: 2o7.net, ajax.googleapis.com, chartbeat.com, chartbeat.net, dqwufkbc3sdtr.cloudfront.net, facebook-web-clients.appspot.com, google.com, googleadservices.com, googletagservices.com, imrworldwide.com, mathtag.com, ophan.co.uk, optimizely.com, outbrain.com, scorecardresearch.com, twitter.com, wunderloop.net, www.googleapis.com


nice


I really wonder how this 10KLines is handled by the network subsystem. Is it compiled into a big regexp ? or another form of compacted runnable logic ? or at which point does it slow local name lookup.


At a guess, by doing a match first on a substring using a hashtable. That way if something is a 'candidate' you can hit a more expensive datastructure to figure out if you really have a hit without burning a lot of cycles.


This kind of long blacklists seems so fitting for Tries. I should grep *nix kernels (I wish I had windows >Xp source too) I guess.

Linux (well, glibc, from http://unix.stackexchange.com/questions/81979/how-does-etc-h...):

http://repo.or.cz/w/glibc.git/blob/HEAD:/nss/nss_files/files...

Windows :

#tbd


Very interesting. Please do consider putting the extension up somewhere!


This list is updated more frequently and is what I use : http://winhelp2002.mvps.org/hosts.htm

There are a few things to take out to make certain services I like work (E.G. Hulu), so I usually try it in sections to see which services break and comment those.


Sounds interesting - care to share?


Do you do anything about plugins sharing your fonts list? I found it to be the least known / most creepy way to identify a big number of users. Unfortunately i don't know of any extensions that would block that information.


Noscript will do that for you. To access the available fonts list you need to run a bit of js. So blocking javascript will take care of this.


"""The Atlantic first identified "dark social" traffic back in 2012 to describe traffic coming messaging apps that had been stripped of referrer data because messaging and email use the secure "HTTPS" system rather than the open "HTTP" system used by web pages.

Excuse me?! How does this trash get published?


How can one trust anything that website says after this kind of statement? because clearly,these guys dont know what they are talking about.

pretty sure they dont know what HTTP is and what a header is.

How many factual errors in other articles that dont deal with the web?


Fun thing: I went on twitter to tell him his editor missed fact checking on that completely. Turns out the author of the piece is "Founding editor of Business Insider U.K.".

Update: He has replied and said he would look into it and correct it. Being that he's aware of the discussion here, i suspect other things might get corrected too. :)

https://twitter.com/Jim_Edwards/status/529588985690337280


It is BSinsider, so it is BS as usual.


How does any BusinessInsider trash get on the front page of HN would be the more relevant question.


This is another reason why a lot of links to certain pages have the referral or campaign information hardcoded into their URLs, which removes the need for a referral header. I guess if the Guardian really wants that information they can make a deal with app publishers or Reddit or whoever to add a header like that.

But on the other hand, given how it was the Guardian that published all of the original Snowden articles and information, they of all newspapers should applaud this rapid increase in increased privacy from their readers.

I can't actually read the dates on the charts, but I assume the increase started shortly after the Snowden revelations, when more sites were enabling https by default and people became more privacy-aware.

So the Guardian's a bit inconsistent here, On the one side they go "Big Brother is watching you!", on the other (in this article) they go "We're Big Brother and we can't see you anymore!"


It's the difference between the person you're phoning recording the phone call and the telco doing it.


The funny thing is that there are scores of people here who are actively trying to get the Guardian out of business, because they regard anything related to advertising as evil. Will we be more or less free without the Guardian and similar independent sources of information?


You could of course simply pay for your online news(paper), the same way you used to pay for your paper based one.

Online newspaper is a bit strange, something like a 'plastic glass'.

I'm all for it, but most business models revolve around advertising somehow.


Yes, but advertising plays a big role on paid newspapers too. They would be just too expensive without ads. Moreover, before the free newspapers on the internet, the only mainstream free way to get news was from TV, which is much more superficial and easier to control by governments. I think we are much more free and more informed thanks to ad-supported online news sources. That's why I find all this hate for that business model wrong.

Final remark: A website like Hacker News would not be possible if all the content was behind a paywall. Would that really be a better internet?


That's the beauty of hackernews, it doesn't need a paywall or advertising. The presence of the users is the payment.


But hackernews links to free information sources. No free information sources => no hackernews...


Plenty of the information sources linked to are free of advertising and not behind paywalls. In fact, those are probably the better information sources.

That's how the web started, remember: no ads. Just free sources of information.


Not behind paywalls, I agree. Free of advertising... it would be interesting to have stats at hand, but I'm not so sure.

Anyway, thank you for reminding me I'm old enough to remember how the web started :D


I am find with ads, I am not fine with tracking and data mining on cost of my privacy.

The internet is built on free stuff. It got big on free content. HN is free content. Look at all those comments that people leave for free. The belief that things must cost money is a fallacy!


And if there wasn't a glut of great free content on the internet, it might actually be worth paying for.


Where free means "supported by advertising"?


> So the Guardian's a bit inconsistent here, On the one side they go "Big Brother is watching you!", on the other (in this article) they go "We're Big Brother and we can't see you anymore!"

Like spam it's only bad when other people are doing it.


Interesting, I had never made the relation between referer and advertising, which seems pretty obvious in retrospect.

This parts makes me a bit uneasy.

> The frustration here is that search, apps and HTTPS traffic all represent different types of readers arriving at The Guardian for different reasons — and not knowing that data hurts the Guardian's ability to serve those readers relevant content.

I am not sure I would be interested in a "tailored" news experienced. (Or "relevant content" is a weasel word for "relevant ads")


You get it anyway. Are you kidding me? Were you born yesterday?


It's pretty crazy that browsers send referral in the first place. Getting rid of it, accidentally or not, is not a bug.

I use refcontrol[1] to spoof the referral. I'm always visiting from the front page of the website even though I'm almost never visiting from the front page.

[1] https://addons.mozilla.org/en-US/firefox/addon/refcontrol/


Exactly. On Chrome, I use Referer Control [1], which I guess does the same.

[1] https://chrome.google.com/webstore/detail/referer-control/hn...

It's also very good to avoid those annoying image hostings that serve you a .html page with ads when they detect you click a .png file from another web page or those that just show you a image asking you not to hotlink content.


Not that crazy. It has valid use detecting/preventing hotlinking/bandwidth leeching. Plus otherwise how would you find out what search terms were driving traffic?


"executives at the company cannot figure out where it is coming from"

Maybe get the engineers to have a look instead.


That's what you get from forcing your app on facebook.

P.S. remember this: http://rational.pdimension.net/2011/10/11/do-not-use-the-gua...

This is backlash, enjoy it.


I will never forget it, and I still don't click on links to the guardian.


Ugh, you dolts are like Stallman, other than never having produced anything of value that is.


"<...>not knowing that data hurts the Guardian's ability to serve those readers relevant content."

Actually one of the main reasons why I use various anonymizers is that I don't want relevant content for the same reason I don't want to see Facebooks's "top stories" -- most often it turns out to be totally irrelevant, clickbait or complete bullshit. Leave me the choice to what's interesting for me and what I want to see.


But how are they (FB, Google, guardian) going to make money if you wont let them tell you what to think????


I saw this illustration yesterday and it perfectly illustrates why I won't ever be a "normal" visitor to spying websites. Dear journalists, please consider your integrity to a better society. If you can't publish your thoughts without selling your readers to tracking and other evil, then I won't be crying after you. :\

http://i.imgur.com/AqL7C28.jpg


Funny how the tone suggests that this is something malicious being done to the guardian.


It kind of is. The Guardian is funded by advertising and this limits their ad sales story. As the article points out, the main beneficiary of this is Google - who are essentially competing with the Guardian for ad sales. I don't always agree with the Graun, but I believe in plurality rather than a Google-dominated world.

Throwaway because of inevitable downvotes from the privacy crowd.


> The Guardian is funded by advertising and this limits their ad sales story ... inevitable downvotes from the privacy crowd.

The privacy crowd are right; and so are you. This is a huge internal conflict the web today - how do you make it pay, keep it free and not have it track users?


Indeed. I think we need to be a bit more nuanced about what we ask for.

Simple referrer information is very valuable to the Guardian (and others who sell ads), and is in most circumstances not a significant privacy leak. Cross-site tracking via third-party cookies, however, has significant privacy implications.

I would like to see a user/advertiser understanding of "this far and no further". But unfortunately the debate is sufficiently polarised I can't see any change to the current situation, where the tech-savvy disable everything and the less experienced stick with their defaults, which effectively reduces it to an arms race between the big browser manufacturers and a handful of ad networks.


This presumes that 'keep it free' is actually desirable. Free journalism means the person reading it is the product. Perhaps the less of that, the better.


the existing model has been to subsidize newspaper sales with advertising for a very long time. This same model carried forward through radio and television.

The alternative is a pay-per-view service, or subscribing to wires. I haven't done any research into the viability of that type of service, but it is a paradigm shift.


Pick two.


It has been the "not track users" that has suffered; the other two have been picked.

The gist of this article is that users have found ways to not be tracked: this was inevitable, the person running the browser has a lot of control over what the browser sends and where.

So, that pick isn't working all the time any more.


>The Guardian is funded by advertising

Not entirely, no. It is funded by a trust, and part of the Guardian Media Group is an investment fund.


...yet when you read it, it sounds like the explanation is that someone clicks on a link in an app, which opens that link using Chrome (or whatever the WWW intent app is) and somewhere en route the referrer is lost.

The collar doesn't match the cuffs in this article, methinks.


For me as a Web Analytic Consultant this sounds so wrong:

We track campaigns through campaign-parameters but apps are a blind spot of course, that's nothing new. Relying on the referrer is stupid, most apps don't provide a referrer because they are Apps and not websites! Browsers provide a referrer if you are coming from another site. An App isn't a website, so there's no referrer.

Of course there's no referrer, it's a new browser window! Campaign-Parameters here would also not be very helpful. If the Visitor copies the link not from a guardian.com visit, but after coming to the story through a campaign-URL, he would copy the URL with the campaign parameters and paste it into the app. This would be even more wrong, but happens daily!

We Web Analysts should get used to it: People are becoming aware of privacy more than in the past and we can't always measure everything and everyone. Get over it!


Bot traffic is more likely a bigger culprit for "dark traffic" than people becoming aware of privacy tools.

http://www.bbc.com/news/technology-25346235


For me as a Web Analytics Consultant this sounds very wrong:

A fellow Web Analytic Consultant doesn't know how to remove URL string parameters.

Oh wait! That makes me look better in front of clients. Nevermind, carry on.


"The frustration here is that search, apps and HTTPS traffic all represent different types of readers arriving at The Guardian for different reasons — and not knowing that data hurts the Guardian's ability to serve those readers relevant content."

I think they meant to say 'relevant advertising' there not 'relevant content' as the content should, in theory, be the same regardless of how you got there. The interesting bit is that I've seen advertising contracts where you can't advertise with unapproved networks on a referred link from a Google SERP. Only on the second click can you do that pop-under or egregious flying frisbee ad. So if you are trying to be 'safe' you don't do any of that nonsense if you can't tell the difference, and I'm guessing that cuts into revenue.


Much could also be fake traffic, intended to defraud advertisers or others measuring audiences.

Even assuming The Guardian itself is not a knowing participant in such schemes, its sites could receive such traffic when fraudsters try to make the full behavior of their sources look more legitimate.


I think it's clear that almost all of the dark traffic is simply https sites. It's no necessarily apps, it can just be gmail...

Someone really should make it a default to pass the referrer even for https.


Haha, wat?

edit (less obtuse): there should be less passing of referral headers, not more. Browsing is already such a leaky experience privacy-wise, we shouldn't be clamouring for it to become worse...



They could just randomly survey a sample of the dark traffic user to find out where they're coming from.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: