skip to main content.

posts about data protection.

in my last post, i wrote about how to maximize privacy for your site’s visitor. in the meantime, i programmed a mechanism which maximizes privacy when embedding youtube or vimeo videos. the idea is simple: i embed only a picture, and use javascript to replace the picture with the video player. in case javascript is disabled, the image has a link to the video, and in case javascript is there and the image is replaced by the video player, i enable autostart for the video. for the casual user, the result looks similar to just embedding the video, but the difference is that youtube or vimeo only get the visitor’s information if he wants to watch the video. here is an example:

[[for legal reasons, i do not want to include youtube videos here anymore. please click on this link to watch the video at youtube.]]

now, how do i do this? to retrieve the image, i use a php script so that the user will not interact directly with youtube or vimeo. the script obtains the url for the picture, retrieves the picture, and delivers it to the user. here’s the script:

 1<?php
 2// Tests a file name for being valid
 3function isValidID($id)
 4{
 5    return preg_match('#^([a-zA-Z0-9-_]+)$#', $id) == 1;
 6}
 7
 8// Checks if the source is valid
 9function isValidSource($source)
10{
11    return ($source == 'youtube') || ($source == 'vimeo');
12}
13
14$id = $_GET['id'];
15$source = $_GET['source'];
16
17if (isValidID($id) && isValidSource($source))
18{
19    // Find out URL
20    if ($source == 'youtube')
21    {
22        // Youtube is easy...
23        $url = "http://img.youtube.com/vi/$id/0.jpg";
24    }
25    if ($source == 'vimeo')
26    {
27        $url = "http://vimeo.com/api/v2/video/$id.php";
28        $content = @file_get_contents($url); // avoid error message on 404
29        if ($content)
30        {
31            $content = unserialize($content);
32            $url = $content[0]['thumbnail_large'];
33        }
34        else
35            $url = false;
36    }
37}
38else
39    $url = false;
40
41// Retrieve picture
42if ($url)
43    $content = @file_get_contents($url); // avoid error message on 404
44else
45    $content = false;
46
47// Send picture, or print 404
48if ($content)
49{
50    // Deliver file
51    header('Status: 200 OK', true, 200);
52    header('HTTP/1.0 200 OK', true, 200);
53    header('Content-Type: image/jpeg');
54    header('Content-Transfer-Encoding: binary');
55    header('Content-Length: ' . strlen($content));
56    echo $content;
57    exit(); 
58}
59else
60{
61    // This results in a 404
62    require_once('index.php');
63}

note that the 404 display at the end works thanks to wordpress. if you have the script in another directory as your wordpress installation, or you are not using wordpress at all, you have to modify that part. just replace the require_once line by

1    header('Status: 404 File Not Found', true, 404);
2    header('HTTP/1.0 404 File Not Found', true, 404);
3    echo '<html><body>Error: invalid arguments or cannot retrieve picture.</body></html>';

or something similar.

ok. so let us assume that you put the script somewhere on the server, say at /tubeimage.php as i did. then, you need to insert the following javascript code into the header of your html files. in case you use wordpress, you can write a plugin to do that for you (i did that), or modify the wp-content/themes/yourtheme/header.php file, where you replace yourtheme by the name of the theme you’re using. add the following lines:

 1<script type="text/javascript">//<![CDATA[
 2function youtubeClick(id, w, h)
 3{
 4    elt = document.getElementById('youtube-' + id);
 5    elt.innerHTML = '<object class="youtube-object type="application/x-shockwave-flash" data="http://www.youtube-nocookie.com/v/' + id + '&amp;rel=0&amp;border=0&amp;autoplay=1" width="' + w + '" height="' + h + '"><param name="movie" value="http://www.youtube-nocookie.com/v/' + id + '&amp;rel=0&amp;border=0&amp;autoplay=1" /></object></div>';
 6    elt.onclick = null;
 7}
 8function vimeoClick(id, w, h)
 9{
10    elt = document.getElementById('vimeo-' + id);
11    elt.innerHTML = '<object class="youtube-object" type="application/x-shockwave-flash" data="http://vimeo.com/moogaloop.swf?clip_id=' + id + '&amp;server=vimeo.com&amp;show_title=1&amp;show_byline=1&amp;show_portrait=1&amp;color=00ADEF&amp;fullscreen=1&amp;autoplay=1&amp;loop=0" width="' + w + '" height="' + h + '"><param name="movie" value="http://vimeo.com/moogaloop.swf?clip_id=' + id + '&amp;server=vimeo.com&amp;show_title=1&amp;show_byline=1&amp;show_portrait=1&amp;color=00ADEF&amp;fullscreen=1&amp;autoplay=1&amp;loop=0"/></object>';
12    elt.onclick = null;
13}
14//]]></script>

note that the <![CDATA[...]] is the xhtml way of hiding javascript; if you’re using plain old html, you need to use <!– … –>.

now, you can add youtube or vimeo videos to your page as follows. for example, the video above was embedded as follows: <span id="vimeo-24456787" onClick="vimeoClick('24456787', 425, 341)"><a href="http://vimeo.com/24456787" rel="external" onclick="return false;"><img src="/tubeimage.php?id=24456787&amp;source=vimeo" width="425" height="341" /></a></span> note that 24456787 is the id of the vimeo video (and appears at four places in this snippet), and the player has size 425x341 (appears at two places in this snippet). for youtube videos, you write <span id="youtube-LFtSRR0xFec" onClick="youtubeClick('LFtSRR0xFec', 425, 341)"><a href=http://www.youtube.com/watch%3Fv=LFtSRR0xFec" rel="external" onclick="return false;"><img src="/tubeimage.php?id=LFtSRR0xFec&amp;source=youtube" width="425" height="341" /></a></span> here, LFtSRR0xFec is the id of the youtube video (and appears at four places in this snippet), and the player has size 425x341 (appears at two places in this snippet). if you are using wordpress, it is a good idea to create a plugin which automatically generates all the necessary code, so you can just write something like [youtube id="LFtSRR0xFec" width="425" height="341"] in your posts.

many websites leak information. for example, most websites including a facebook like button make your browser tell facebook “hey, i’m visiting this page”. some for google and their +1 button, and for most other of these fancy little “web 2.0 buttons” you can find nowadays at many places of the web. or they make your browser feed google analytics with your data, or any other web tracking service. and usually, you don’t notice any of this, as it appears hidden in the background.
in this post, i want to shed a bit more light on these things and their (possible) consequences, and tell a bit on how to avoid these problems both as a user and as a webmaster. in the beginning of spielwiese, i already wrote a little bit about this here, and i also mentioned the problem while writing about social networks. you might want to look at the first post if you don’t really know what is happening if you access a webpage with your browser.
in the examples below, i’m assuming that standard javascript and cookie settings are used.

what’s going on?

assume that some website you are visiting includes the (standard) facebook like button (via facebook.com). as soon as you access that page, its html code will contain

<iframe src=”http://www.facebook.com/plugins/like.php?href=…” scrolling=”no” frameborder=”0″ style=”border:none; width:450px; height:80px”></iframe>

your browser automatically accesses http://www.facebook.com/plugins/like.php?href=… to retrieve the content from that url and essentially puts it into the place of the iframe. moreover, if you’re a facebook user, and you are logged in, your browser will have cookies for *.facebook.com (containing your user id), which will be sent automatically with this http request. so at this point, without any javascript interaction, facebook already knows whether you are logged on, and who you are if that is the case. note that facebook could also set a facebook.com cookie when none is already set, to be able to further track you. it seems like that is not the case, but if they would do, you probably won’t notice at all. now the html page sent by http://www.facebook.com/plugins/like.php?href=… includes several javascripts, which your browser will automatically execute, and which could sent more information like your screen resolution. facebook isn’t doing that, as far as i know, but they could with only very few people noticing, if at all.
another resource are web trackers, which try to gather statistical data for the webmaster who included them, but might also use this information for other things. these work similar to the like button: the user has to somehow include them. maybe as a 1x1 transparent pixel, or as a java script, or both. the pixel will ensure that basic data is sent even in case javascript is disabled or not available, as long as images are loaded. the java script will sent additional info making identification of the visitor easier, and with both accesses to the web server of the web tracker, cookies can be retrieved or set, allowing the web tracker to identify you along different sessions. they probably don’t know who you are, but they can distinguish you from your friends, even if you share your internet connection (and the same ip address!) with them and your computers are essentially identically configured.
but there also many other sources of leaking. for example, if a youtube video is included in the web page you’re accessing. in that case, your browser will ask youtube.com for the flash player, and that retrieve a image from ytimg.com and, when you click the play button, will stream and play the video from other youtube servers. again, a lot of things can be leaked. if the flash plugin is disabled, this won’t happen, but most people have that one installed and active as otherwise, many sites will not work properly.

how to prevent your browser from leaking.

the radical solution is to disable javascript, disable cookies and disable all plugins like the flash plugin. but that still doesn’t solve the problem that basic http access data (browser id string, referrer, your ip address) is sent to certain included sites, for example if they are included using iframes or as images (maybe even transparent of size 1x1 so you won’t notice). so without add-ons, standard browsers can always be made to leak something.
for firefox, there are several helpful addons. two very helpful ones are the following:

  • noscript. this addon allows to say from which sites javascripts are allowed to run and from which not. unfortunately, this does not depend on the source site, so if you allow the facebook javascripts to work (which you need if you access facebook itself), they are also allowed to run if another site includes the like button.
  • requestpolicy. this addon allows to block access from sites to other sites, for all kind of requests (loading images, scripts, even loads made by the flash plugin). as the blocking is by default depening on both source and destination of the access, this block access to facebook.com from any other site but facebook.com, hence making it impossible for facebook to track you except if you allow a site to (temporarily or always from now on) load content from facebook.com.

note that both plugins require a lot of user interaction. most websites won’t work properly, and too many will look very ugly in case you don’t allow certain scripts to run or certain data to be fetched from different servers. it is annoying to find out which things you have to active without giving too much access, and can be very frustrating. but after some time, you’ll have the sites you’re using most set up properly, and most things you do in the web run without any more interaction. visits to new sites, though, are still adventurous.

how to prevent your page from making leak.

first, a few words why you should try not to leak. the first is trust: the visitors of your page trust you. in particular, they usually don’t want that you send their information to not perfectly trustable other sites. and then, there are users who block such things, like me. if i access your page and you rely, say, on google analytics to track and count your users, you won’t be able to see me. i’ll be missing in your statistics, even though i accessed your page. (and since i’m advanced enough to use addons to see where my data is supposed to be sent, i know how much you care about my data, and how much i can trust you.) so i do appreciate if sites do not leak information.
there are two basic strategies to avoid leaking:

  • include external things only when the user needs them. for example, you could use javascript yourself to display a decoy version of the facebook like button or the google+ +1 button (stored on your server!), and as soon as the user clicks on it, your script loads the real button and the associated javascripts and forwards your click to them. or you just use a decoy version of the button (again, stored on your server!) and just link it, for example to a facebook url which will then allow the facebook user to share your page. for example, instead of the like button described above, you could do something like this:
    <a href=’http://www.facebook.com/sharer.php?u=http://url.of/your/site&t=Title of your site’ target=’_blank’ title=’Share!’><img alt=’Share!’ src=’images/facebook.gif’ /></a>

    here, images/facebook.gif should obviously be a image on your server.

  • act as a proxy. if you want to show information, like how many people like your page, show some faces who like it, show some statistics (how many users are online right now etc.), the usual solution is to include some javascript/iframe from the content provider (facebook, your web tracker, …), so the user’s browser will access the data directly from that provider. with the drawback that the provider knows that the user asked. (which is necessary for web trackers to work, but not for showing the numer of people who like a page.) there are also things which indicate your online status on skype, icq, or other services, which you can include in your page and which make the user’s browser access some other site. for most instances, one can avoid this problem by adding some kind of proxy: instead of making the user retrieve some (automatically generated) image, you link to a script on your page which retrieves the image from the provider’s server and forward it to the user. then the provider just sees that your server is asking for the picture, while the user still sees the information on the picture without giving information to the provider.

note that both strategies give you (more or less) extra work. especially setting up a proxy script on your server is very non-trivial, if you cannot just use something publicly available. and more importantly, if you do not have good enough access to the server – for example if you have a blogspot blog, or a wordpress blog running on wordpress.com – you are severely limited in what you can do. also note that you can combine the strategies. for example, if you include many youtube videos, and you use a javascript to only start the flash player when the user clicks on it, you could use the proxy strategy to let your server automatically retrieve the picture from ytimg.com which is shown until the user clicks. [edit: this post describes how this can be done. both spielwiese and musikwiese now use this technique.] note that it is also a good idea to check your own site for leaking, sometimes you’ll get surprises as certain plugins for wordpress for example include javascripts from random places, sometimes completely unnecessarily. for that, you can use firefox with noscript and requestpolicy installed (if you don’t like the addons, disable them but install them anyway, to be able to enable them from time to time to test your own site) and browse your site. in the firefox status bar, you’ll see a flag for requestpolicy; if it is red, then requestpolicy blocks something. click the flag to see what is blocked, and to allow (or block) certain destinations. you can also use this (as well as noscript) to test which (external) javascripts are needed for your page to be usable.
you have to decide for yourself how much data you make your visitors leak. and remember, most visitors appreciate that you don’t spread their data unnecessarily. and some visitors can even check what you’re doing, and know how much information you (try to) make them leak.