, metadata, and freedom

Recently, I came across this piece on that dovetailed nicely with a lot of things that I’ve been thinking about lately.

A friend asked what we are supposed to do about it, and since I couldn’t come up with a pithy, sarcastic facebook status, here’s my long-winded answer. I should note that I am in no way an expert on any of this, and that there is a ton of better information out there if you’re willing to look for it. This is likely going to be the first post in several related to this topic.

‘Should This Be the Last Thing You Read on’

The piece is worth reading in is entirety, but essentially, it demonstrates that posting research on is not the equivalent of open-access, particularly because (surprise) is a company.  The business model of is not unique, but it’s also not something that I think a lot of academics have paid much attention to:

“The goal is to provide trending research data to R&D institutions that can improve the quality of their decisions by 10-20%. The kind of algorithm that R&D companies are looking for is a ‘trending papers’ algorithm, analogous to Twitter’s trending topics algorithm. A trending papers algorithm would tell an R&D company which are the most impactful papers in a given research area in the last 24 hours, 7 days, 30 days, or any time period. Historically it’s been very difficult to get this kind of data. Scientists have printed papers out, and read them in their labs in un-trackable ways. As scientific activity is moving online, it’s becoming easier to track which papers are getting more attention from the top scientists. There is also an opportunity to make a large economic impact. Around $1 trillion a year is spent on R&D globally: about $200 billion in the academic sector, and about $800 billion in the private sector (pharmaceutical companies, and other R&D companies).”
– Richard Price, CEO of, quoted in Hall.


In other words, there is significant value in your activities if they can be captured. Knowing what you search for, what you read, and who you look up and contact, is potentially very lucrative for and the companies they are looking to sell this data to. For a company like, the best way to be successful is to collect as much metadata as possible on as many people as possible.


If you don’t know what metadata is, here is a good primer from Edward Snowden, and a related discussion of linkability from Jacob Applebaum. Basically, metadata is data about data.  It is important because it can tell us a lot about a lot of things.  For example, metadata on cellphone calls would include times of phone calls, durations of phone calls, the two participants in the call, etc.  What it does not capture is the content of the call. Linkability is essentially the ability to connect metadata. With linkability and enough metadata, the content of data becomes irrelevant.


This is not new, and many internet companies are based on this model – Facebook, Google, pretty much any online service that is free. And as consumers, we are generally okay with using free services knowing that they are collecting data on us.  Here are a few reasons why we should rethink our relationship with these free services.

1) Metadata does not accurately describe anyone

Metadata is nothing more than an abstraction about you and your activities.  For example, it is not the content of your phone call, it is who you called and when.  So it cannot reveal what you and the store clerk spoke about, but it can demonstrate that you were both there at the same time. I must stress that metadata is not inherently evil, but it does perform a specific task of simplification, of abstraction, and ultimately, provides new mechanisms of control.  In ‘Seeing Like a State‘, Scott articulates this point nicely:
“Certain forms of knowledge and control require a narrowing of vision. The great advantage of such tunnel vision is that it brings into sharp focus certain limited aspects of an otherwise far more complex and unwieldy reality. This very simplification, in turn, makes the phenomenon at the center of the field of vision more legible and hence more susceptible to careful measurement and calculation. Combined with similar observations, an overall, aggregate, synoptic view of a selective reality is achieved, making possible a high degree of schematic knowledge, control, and manipulation.”
(Scott 1998, p.11).


A large amount of metadata allows the behaviour of a large number of people to become legible. Once legible, it can be used: by understanding aggregate groups of individuals through metadata, these groups can be engaged with, knowledge can be extracted about them, and products can be sold to them. uses metadata to understand what academics are reading in order to scoop new technologies and understand the direction research will go in, while the American government “kills people based on metadata.


Again, metadata is not bad in and of itself, but we should be wary of its use and our complicity in perpetuating its use.  The problem is that metadata is an abstraction, and institutions use this metadata to determine how they engage with individuals.


In other words, imagine there is a forest, and in that forest there are many trees.  A surveyor sees the forest from a plane, and says “there are two types of trees – those that have dark green leaves, and those that have light green leaves.”  This is the metadata, it leaves out a tremendous amount of information about the forest – and more importantly, it is organized according to a human being or institution that has a particular interest in organizing the forest in a particular way.  Say trees with dark green leaves have stronger wood (I don’t actually know anything about trees).


Now the forest is legible in a certain way, and because it is legible, it can be used.  Forests with more dark green leaves are cut down first for lumber, and only trees that will produce dark green leaves are replanted, without any regard to the dynamics of the forest itself.  The interesting thing is what happens with those trees that do not have dark or light green leaves – but have something in between.  They must become one or the other. They must fit the categories imposed by the metadata to be useful.


When the world is categorized in a certain way, the world can only be understood through those categories, and institutions have incentives to find ways to ensure that these categories are reproduced so that they can continue to understand the world.


This matters because while we are not merely our metadata, institutions can only understand us through it.  This may not be a huge problem for a company like that (let’s be honest) doesn’t create a lot of value for its users, but it is a serious problem when we are talking about States or institutions that are more critical to your daily life. For example, if you live in a particular neighbourhood, and you have a particular kind of job, and you are a particular age, and a particular race, the risk of a company insuring you is going to be based on metadata profiles of similar users.  Ditto for bank loans.  This is not some dystopian future, it is simply the current way of doing business.

2) Metadata is not actually anonymous

States and institutions conceptualize the world in a certain way and then use that abstraction to draw inferences about individuals.  They can say confidently that 80% of 40-year-old Torontonians use the TTC, but aggregated metadata cannot say that Jon Smith uses the TTC.  In theory, metadata is anonymous, as it does not (or should not) contain identifiable information. This is one of the big arguments for why we should not be concerned with the collection of metadata.

The problem is that with enough metadata, and with linkability, it is not particularly difficult to connect metadata to individuals.  A recent MIT-led study did exactly this. They analyzed “3 months of credit card records for 1.1 million people and show that four spatiotemporal points are enough to uniquely re-identify 90% of individuals.” So while each element of metadata may not uniquely identify you, if there is enough of it, and it has elements in common (such as an IP address), then going from the abstract to the individual is not difficult.

The point here is that institutions do two things – first they create sketches of swathes of groups based on metadata. But then they can also create a sketch of individuals based on their own metadata. An institution thus understands you through two overlapping abstractions. You become a distorted, simplified image – and it is with this distorted, simplified image that states and institutions act.


3) Your labour is being exploited without your knowledge and without remuneration

By participating in these services, by supplying a constant stream of metadata, you are working for these services. Clicking on a paper on creates value – and ultimately money – for the service. In short, you are performing labour and not being remunerated for it. Indeed, the most disturbing thing here is that many people do not see that they are performing labour, and do not realize that it is being exploited.


Take for instance CAPTCHA. In order to ensure that a human and not a robot is accessing certain sites, you have to perform a task that only a human can perform. Reading scanned text, or identifying images that contain cake.  A lot of this is for good, for example, Google is using this data to help digitize books – arguably not a bad thing. Indeed, the motivation for doing so seems pretty legit (this is worth watching).


Importantly, when we encounter a site that has a CAPTCHA, we cannot opt out. And we do not know that we are performing labour. We do not know that our actions are being used for profit, and we do not know how. We don’t know what books we are digitizing, because we didn’t know that we were digitizing books.


Another example is GOOG-411. If you remember, you could literally call the internet, and then get Google search results over the phone. But the point of GOOG-411 wasn’t to allow you or I to call the internet, it was to collect voice data – to improve voice recognition and text-to-speech algorithms. So unwittingly, GOOG-411 users laboured for free to benefit a for-profit institution.


For some, this won’t be a problem.  You may think “whatever, I was going to do the labour anyway.” But you weren’t.  For example, CAPTCHA is designed to extract specific labour from you, labour that you weren’t going to do for fun – unless you willingly spend your afternoons selecting the pictures that contain bananas from a random set of photos?  These services are designed to extract a specific kind of labour from you, and to do so in such a way that you do not see it as labour.


Worse is that they extract from the public in order to enrich the private. As Hall notes, “But just as Airbnb and Uber are parasitic on the public ‘infrastructure and the investment’ that was ‘made by cities a generation ago’ (roads, buildings, street lighting, etc.), so has a parasitical relationship to the public education system, in that these academics are labouring for it for free to help build its privately-owned for-profit platform by providing the aggregated input, data and attention value.”

As citizens, we have helped to pay for the scaffolding that supports a service like, which in turn uses that scaffolding to extract more value from us.  It may all be in the abstract, but at the end of the day, our labour is being used to make someone else money. And it is often done without our consent or knowledge.

Metadata & Freedom

The reason why any of this matters is because it negatively impacts the freedom of individuals.  The most obvious problem is that of privacy. Massive amounts of metadata and linkability destroy privacy, which last I checked is still a fundamental human right. The State’s use of metadata in the name of national security is particularly problematic, and has been written about extensively elsewhere.  But at the end of the day, without the privacy to learn, to read, to look at whatever websites we want, to speak to whoever we want, we are not free. We will self-censor, we will hesitate to discuss certain topics, we will curtail our own freedom.

Relatedly, when institutions prefer a world of dark and light green leaves on trees, we as individuals are systemically encouraged to see the world through that lens. Institutions frame how we understand the world around us, and they frame it in ways that are beneficial to the institutions, not to the citizenry.

A lot of institutions are collecting data about your online activities. The governments of the US and UK collect pretty much every single packet that gets transmitted on the internet.  That has terrifying implications. If everything is being collected and stored, then institutions have the capability to retroactively construct narratives of individuals based on metadata. The NSA’s XKEYSCORE program does just that.

So what can be done about it? It is exceptionally difficult to remain completely anonymous and private while online (not impossible though), but it’s not particularly hard to make it more difficult for people and institutions to use your data and your labour for their own gain.

Here are a few suggestions, I’ll get into more in a future post:

1) Constantly remind yourself that free services aren’t free. It is important to remember that data is always being collected about you in order to try to understand you. What is being done with this data is usually unknowable, so you should always ask yourself how necessary this service is to you before you volunteer to allow an institution to collect your data.

2) Disengage. Obviously the best thing to do is to stop using these services, especially the ones that don’t bring you much value. But for a lot of that, that’s not possible or desirable.  Even if you do disengage though, the metadata collection doesn’t end.  Facebook, for example, has ‘shadow profiles‘ of all its users that contain data that others report about you. More frighteningly, they have ‘shadow profiles’ of people who do not use facebook. So it doesn’t matter if you use facebook or not, they’re collecting data on you. A better solution may be active engagement with an eye to disruption.

3) Participate selectively. In short, don’t participate in ways that are useful to those who are collecting your data. Much of our lives as academics are already on the internet – our affiliations, research interests, etc. So this information is not particularly valuable to an institution like They extract value by understanding your behaviour, so don’t behave.  While it may be valuable for you have a paper posted on for exposure, it only takes a second to find someone’s e-mail address, or to find their publication elsewhere.  If you limit your interactions through the service, you help limit the data that is being collected about you.

4) Disrupt. Others’ collection of metadata cannot really be avoided, but you can do a lot to disrupt it and make it more difficult for others to know much about you.  For fun, try clicking on random things on your Facebook feed – the targeted ads will change as Facebook tries to incorporate this new behaviour with what they already ‘know’ about you.  Do this enough, and you introduce enough noise into their data that the real you fades into the background.

Of course there are a lot of better ways to disrupt and make it more difficult for others to collect data about you. Use a VPN. Block cookies. Use HTTPS. Use an encrypted text messenger. Use a password manager. If you have serious needs for anonymity, use TOR. Oh, and for the love of god, use PGP.

A lot of this seems like it’s overkill, but most of it is extremely simple to integrate, and operates in the background.  I’ll write more about these things in future posts.

Disruption is also kind of fun. It allows you to continue to use the services but you also get the satisfaction of knowing that you’re corrupting their clean data of you. I like to imagine that one day, just maybe, some poor data analyst will come across my file and won’t be able to find anything. Then they’ll be sad, and frustrated, and won’t know what to do – it will be difficult for the State to peer into my life.

And that’s exactly how it should be.