I built my personal extension which blocks all the traffic from Dan Pollock's list [1], then blocks all the traffic from major service providers (google analytics, etc) and social networks (fb, tw, google, etc) when not on their website. Referral and user agent headers removed, I haven't found the need to remove other headers.
Currently working on preventing (and manually allowing) all xhr/script/image requests 2 seconds after the main frame has loaded.
The internet is a lot faster for me, and the battery appears to last longer. So far, I've only had problems logging into instagram, but once the cookie is set, I can re-enable blocking ads and tracking.
Looks like that list hasn't been updated since our domain switch. If you really want to block our internal analytics (which are in practice fairly harmless) replace "hits.guardian.co.uk" with "hits.theguardian.com".
Nominated for the most classy comment of the month. That the product manager of The Guardian would take the time out from his (no doubt) busy day to help a user to block tracking is an amazing display of trust in that the user knows best what is good for them. Thank you.
I wonder why companies who host their own analytics choose to use a separate request to track users?
Surely it is better just to get the data from your own content web server logs ? Wouldn't you get the same information while saving on the extra HTTP request? It would make the site loading times slightly faster as well.
Server logs will track all requests from crawlers, accelerators, and cancelled navigations. An AJAX callback will only happen for real human visitors that actually view the page.
I admire the fact that you offer this kind of info :-) thanks mate, always admired the Guardian for his articles, authors & general stance towards privacy.
Considering them telling almost 20 other domains about your visit, I would not call that stance on privacy a strong one: 2o7.net, ajax.googleapis.com, chartbeat.com, chartbeat.net, dqwufkbc3sdtr.cloudfront.net, facebook-web-clients.appspot.com, google.com, googleadservices.com, googletagservices.com, imrworldwide.com, mathtag.com, ophan.co.uk, optimizely.com, outbrain.com, scorecardresearch.com, twitter.com, wunderloop.net, www.googleapis.com
I really wonder how this 10KLines is handled by the network subsystem. Is it compiled into a big regexp ? or another form of compacted runnable logic ? or at which point does it slow local name lookup.
At a guess, by doing a match first on a substring using a hashtable. That way if something is a 'candidate' you can hit a more expensive datastructure to figure out if you really have a hit without burning a lot of cycles.
There are a few things to take out to make certain services I like work (E.G. Hulu), so I usually try it in sections to see which services break and comment those.
Do you do anything about plugins sharing your fonts list? I found it to be the least known / most creepy way to identify a big number of users. Unfortunately i don't know of any extensions that would block that information.
"""The Atlantic first identified "dark social" traffic back in 2012 to describe traffic coming messaging apps that had been stripped of referrer data because messaging and email use the secure "HTTPS" system rather than the open "HTTP" system used by web pages.
Fun thing: I went on twitter to tell him his editor missed fact checking on that completely. Turns out the author of the piece is "Founding editor of Business Insider U.K.".
Update: He has replied and said he would look into it and correct it. Being that he's aware of the discussion here, i suspect other things might get corrected too. :)
This is another reason why a lot of links to certain pages have the referral or campaign information hardcoded into their URLs, which removes the need for a referral header. I guess if the Guardian really wants that information they can make a deal with app publishers or Reddit or whoever to add a header like that.
But on the other hand, given how it was the Guardian that published all of the original Snowden articles and information, they of all newspapers should applaud this rapid increase in increased privacy from their readers.
I can't actually read the dates on the charts, but I assume the increase started shortly after the Snowden revelations, when more sites were enabling https by default and people became more privacy-aware.
So the Guardian's a bit inconsistent here, On the one side they go "Big Brother is watching you!", on the other (in this article) they go "We're Big Brother and we can't see you anymore!"
The funny thing is that there are scores of people here who are actively trying to get the Guardian out of business, because they regard anything related to advertising as evil. Will we be more or less free without the Guardian and similar independent sources of information?
Yes, but advertising plays a big role on paid newspapers too. They would be just too expensive without ads. Moreover, before the free newspapers on the internet, the only mainstream free way to get news was from TV, which is much more superficial and easier to control by governments. I think we are much more free and more informed thanks to ad-supported online news sources. That's why I find all this hate for that business model wrong.
Final remark: A website like Hacker News would not be possible if all the content was behind a paywall. Would that really be a better internet?
Plenty of the information sources linked to are free of advertising and not behind paywalls. In fact, those are probably the better information sources.
That's how the web started, remember: no ads. Just free sources of information.
I am find with ads, I am not fine with tracking and data mining on cost of my privacy.
The internet is built on free stuff. It got big on free content. HN is free content. Look at all those comments that people leave for free. The belief that things must cost money is a fallacy!
> So the Guardian's a bit inconsistent here, On the one side they go "Big Brother is watching you!", on the other (in this article) they go "We're Big Brother and we can't see you anymore!"
Like spam it's only bad when other people are doing it.
Interesting, I had never made the relation between referer and advertising, which seems pretty obvious in retrospect.
This parts makes me a bit uneasy.
> The frustration here is that search, apps and HTTPS traffic all represent different types of readers arriving at The Guardian for different reasons — and not knowing that data hurts the Guardian's ability to serve those readers relevant content.
I am not sure I would be interested in a "tailored" news experienced. (Or "relevant content" is a weasel word for "relevant ads")
It's pretty crazy that browsers send referral in the first place. Getting rid of it, accidentally or not, is not a bug.
I use refcontrol[1] to spoof the referral. I'm always visiting from the front page of the website even though I'm almost never visiting from the front page.
It's also very good to avoid those annoying image hostings that serve you a .html page with ads when they detect you click a .png file from another web page or those that just show you a image asking you not to hotlink content.
Not that crazy. It has valid use detecting/preventing hotlinking/bandwidth leeching. Plus otherwise how would you find out what search terms were driving traffic?
"<...>not knowing that data hurts the Guardian's ability to serve those readers relevant content."
Actually one of the main reasons why I use various anonymizers is that I don't want relevant content for the same reason I don't want to see Facebooks's "top stories" -- most often it turns out to be totally irrelevant, clickbait or complete bullshit. Leave me the choice to what's interesting for me and what I want to see.
I saw this illustration yesterday and it perfectly illustrates why I won't ever be a "normal" visitor to spying websites. Dear journalists, please consider your integrity to a better society. If you can't publish your thoughts without selling your readers to tracking and other evil, then I won't be crying after you. :\
It kind of is. The Guardian is funded by advertising and this limits their ad sales story. As the article points out, the main beneficiary of this is Google - who are essentially competing with the Guardian for ad sales. I don't always agree with the Graun, but I believe in plurality rather than a Google-dominated world.
Throwaway because of inevitable downvotes from the privacy crowd.
> The Guardian is funded by advertising and this limits their ad sales story ... inevitable downvotes from the privacy crowd.
The privacy crowd are right; and so are you. This is a huge internal conflict the web today - how do you make it pay, keep it free and not have it track users?
Indeed. I think we need to be a bit more nuanced about what we ask for.
Simple referrer information is very valuable to the Guardian (and others who sell ads), and is in most circumstances not a significant privacy leak. Cross-site tracking via third-party cookies, however, has significant privacy implications.
I would like to see a user/advertiser understanding of "this far and no further". But unfortunately the debate is sufficiently polarised I can't see any change to the current situation, where the tech-savvy disable everything and the less experienced stick with their defaults, which effectively reduces it to an arms race between the big browser manufacturers and a handful of ad networks.
This presumes that 'keep it free' is actually desirable. Free journalism means the person reading it is the product. Perhaps the less of that, the better.
the existing model has been to subsidize newspaper sales with advertising for a very long time. This same model carried forward through radio and television.
The alternative is a pay-per-view service, or subscribing to wires. I haven't done any research into the viability of that type of service, but it is a paradigm shift.
It has been the "not track users" that has suffered; the other two have been picked.
The gist of this article is that users have found ways to not be tracked: this was inevitable, the person running the browser has a lot of control over what the browser sends and where.
So, that pick isn't working all the time any more.
...yet when you read it, it sounds like the explanation is that someone clicks on a link in an app, which opens that link using Chrome (or whatever the WWW intent app is) and somewhere en route the referrer is lost.
The collar doesn't match the cuffs in this article, methinks.
For me as a Web Analytic Consultant this sounds so wrong:
We track campaigns through campaign-parameters but apps are a blind spot of course, that's nothing new.
Relying on the referrer is stupid, most apps don't provide a referrer because they are Apps and not websites! Browsers provide a referrer if you are coming from another site. An App isn't a website, so there's no referrer.
Of course there's no referrer, it's a new browser window!
Campaign-Parameters here would also not be very helpful. If the Visitor copies the link not from a guardian.com visit, but after coming to the story through a campaign-URL, he would copy the URL with the campaign parameters and paste it into the app. This would be even more wrong, but happens daily!
We Web Analysts should get used to it: People are becoming aware of privacy more than in the past and we can't always measure everything and everyone. Get over it!
"The frustration here is that search, apps and HTTPS traffic all represent different types of readers arriving at The Guardian for different reasons — and not knowing that data hurts the Guardian's ability to serve those readers relevant content."
I think they meant to say 'relevant advertising' there not 'relevant content' as the content should, in theory, be the same regardless of how you got there. The interesting bit is that I've seen advertising contracts where you can't advertise with unapproved networks on a referred link from a Google SERP. Only on the second click can you do that pop-under or egregious flying frisbee ad. So if you are trying to be 'safe' you don't do any of that nonsense if you can't tell the difference, and I'm guessing that cuts into revenue.
Much could also be fake traffic, intended to defraud advertisers or others measuring audiences.
Even assuming The Guardian itself is not a knowing participant in such schemes, its sites could receive such traffic when fraudsters try to make the full behavior of their sources look more legitimate.
edit (less obtuse): there should be less passing of referral headers, not more. Browsing is already such a leaky experience privacy-wise, we shouldn't be clamouring for it to become worse...
The internet is a lot faster for me, and the battery appears to last longer. So far, I've only had problems logging into instagram, but once the cookie is set, I can re-enable blocking ads and tracking.
I guess I am one of those 'dark traffickers' :P
[1] http://someonewhocares.org/hosts/