many websites leak information. for example, most websites including a facebook like button make your browser tell facebook “hey, i’m visiting this page”. some for google and their +1 button, and for most other of these fancy little “web 2.0 buttons” you can find nowadays at many places of the web. or they make your browser feed google analytics with your data, or any other web tracking service. and usually, you don’t notice any of this, as it appears hidden in the background.
in this post, i want to shed a bit more light on these things and their (possible) consequences, and tell a bit on how to avoid these problems both as a user and as a webmaster. in the beginning of spielwiese, i already wrote a little bit about this here, and i also mentioned the problem while writing about social networks. you might want to look at the first post if you don’t really know what is happening if you access a webpage with your browser.
in the examples below, i’m assuming that standard javascript and cookie settings are used.
what’s going on?
assume that some website you are visiting includes the (standard) facebook like button (via facebook.com). as soon as you access that page, its html code will contain
<iframe src=”http://www.facebook.com/plugins/like.php?href=…” scrolling=”no” frameborder=”0″ style=”border:none; width:450px; height:80px”></iframe>
your browser automatically accesses http://www.facebook.com/plugins/like.php?href=…
to retrieve the content from that url and essentially puts it into the place of the iframe. moreover, if you’re a facebook user, and you are logged in, your browser will have cookies for *.facebook.com
(containing your user id), which will be sent automatically with this http request. so at this point, without any javascript interaction, facebook already knows whether you are logged on, and who you are if that is the case. note that facebook could also set a facebook.com cookie when none is already set, to be able to further track you. it seems like that is not the case, but if they would do, you probably won’t notice at all. now the html page sent by http://www.facebook.com/plugins/like.php?href=…
includes several javascripts, which your browser will automatically execute, and which could sent more information like your screen resolution. facebook isn’t doing that, as far as i know, but they could with only very few people noticing, if at all.
another resource are web trackers, which try to gather statistical data for the webmaster who included them, but might also use this information for other things. these work similar to the like button: the user has to somehow include them. maybe as a 1×1 transparent pixel, or as a java script, or both. the pixel will ensure that basic data is sent even in case javascript is disabled or not available, as long as images are loaded. the java script will sent additional info making identification of the visitor easier, and with both accesses to the web server of the web tracker, cookies can be retrieved or set, allowing the web tracker to identify you along different sessions. they probably don’t know who you are, but they can distinguish you from your friends, even if you share your internet connection (and the same ip address!) with them and your computers are essentially identically configured.
but there also many other sources of leaking. for example, if a youtube video is included in the web page you’re accessing. in that case, your browser will ask youtube.com
for the flash player, and that retrieve a image from ytimg.com
and, when you click the play button, will stream and play the video from other youtube servers. again, a lot of things can be leaked. if the flash plugin is disabled, this won’t happen, but most people have that one installed and active as otherwise, many sites will not work properly.
how to prevent your browser from leaking.
the radical solution is to disable javascript, disable cookies and disable all plugins like the flash plugin. but that still doesn’t solve the problem that basic http access data (browser id string, referrer, your ip address) is sent to certain included sites, for example if they are included using iframe
s or as images (maybe even transparent of size 1×1 so you won’t notice). so without add-ons, standard browsers can always be made to leak something.
for firefox, there are several helpful addons. two very helpful ones are the following:
- noscript. this addon allows to say from which sites javascripts are allowed to run and from which not. unfortunately, this does not depend on the source site, so if you allow the facebook javascripts to work (which you need if you access facebook itself), they are also allowed to run if another site includes the like button.
-
requestpolicy. this addon allows to block access from sites to other sites, for all kind of requests (loading images, scripts, even loads made by the flash plugin). as the blocking is by default depening on both source and destination of the access, this block access to
facebook.com
from any other site butfacebook.com
, hence making it impossible for facebook to track you except if you allow a site to (temporarily or always from now on) load content fromfacebook.com
.
note that both plugins require a lot of user interaction. most websites won’t work properly, and too many will look very ugly in case you don’t allow certain scripts to run or certain data to be fetched from different servers. it is annoying to find out which things you have to active without giving too much access, and can be very frustrating. but after some time, you’ll have the sites you’re using most set up properly, and most things you do in the web run without any more interaction. visits to new sites, though, are still adventurous.
how to prevent your page from making leak.
first, a few words why you should try not to leak. the first is trust: the visitors of your page trust you. in particular, they usually don’t want that you send their information to not perfectly trustable other sites. and then, there are users who block such things, like me. if i access your page and you rely, say, on google analytics to track and count your users, you won’t be able to see me. i’ll be missing in your statistics, even though i accessed your page. (and since i’m advanced enough to use addons to see where my data is supposed to be sent, i know how much you care about my data, and how much i can trust you.) so i do appreciate if sites do not leak information.
there are two basic strategies to avoid leaking:
-
include external things only when the user needs them. for example, you could use javascript yourself to display a decoy version of the facebook like button or the google+ +1 button (stored on your server!), and as soon as the user clicks on it, your script loads the real button and the associated javascripts and forwards your click to them. or you just use a decoy version of the button (again, stored on your server!) and just link it, for example to a facebook url which will then allow the facebook user to share your page. for example, instead of the like button described above, you could do something like this:
<a href=’http://www.facebook.com/sharer.php?u=http://url.of/your/site&t=Title of your site’ target=’_blank’ title=’Share!’><img alt=’Share!’ src=’images/facebook.gif’ /></a>
here,
images/facebook.gif
should obviously be a image on your server. -
act as a proxy. if you want to show information, like how many people like your page, show some faces who like it, show some statistics (how many users are online right now etc.), the usual solution is to include some javascript/
iframe
from the content provider (facebook, your web tracker, …), so the user’s browser will access the data directly from that provider. with the drawback that the provider knows that the user asked. (which is necessary for web trackers to work, but not for showing the numer of people who like a page.) there are also things which indicate your online status on skype, icq, or other services, which you can include in your page and which make the user’s browser access some other site. for most instances, one can avoid this problem by adding some kind of proxy: instead of making the user retrieve some (automatically generated) image, you link to a script on your page which retrieves the image from the provider’s server and forward it to the user. then the provider just sees that your server is asking for the picture, while the user still sees the information on the picture without giving information to the provider.
note that both strategies give you (more or less) extra work. especially setting up a proxy script on your server is very non-trivial, if you cannot just use something publicly available. and more importantly, if you do not have good enough access to the server – for example if you have a blogspot blog, or a wordpress blog running on wordpress.com
– you are severely limited in what you can do. also note that you can combine the strategies. for example, if you include many youtube videos, and you use a javascript to only start the flash player when the user clicks on it, you could use the proxy strategy to let your server automatically retrieve the picture from ytimg.com
which is shown until the user clicks. [edit: this post describes how this can be done. both spielwiese and musikwiese now use this technique.] note that it is also a good idea to check your own site for leaking, sometimes you’ll get surprises as certain plugins for wordpress for example include javascripts from random places, sometimes completely unnecessarily. for that, you can use firefox with noscript and requestpolicy installed (if you don’t like the addons, disable them but install them anyway, to be able to enable them from time to time to test your own site) and browse your site. in the firefox status bar, you’ll see a flag for requestpolicy; if it is red, then requestpolicy blocks something. click the flag to see what is blocked, and to allow (or block) certain destinations. you can also use this (as well as noscript) to test which (external) javascripts are needed for your page to be usable.
you have to decide for yourself how much data you make your visitors leak. and remember, most visitors appreciate that you don’t spread their data unnecessarily. and some visitors can even check what you’re doing, and know how much information you (try to) make them leak.
comments.
one thing i forgot to add: if you want to track your users, use your server’s log files or install some script which logs the users into a database if you cannot access the logs. there are tons of logfile analyzers out there, as well as there are scripts which do the logging for you.