The LU Human Scanners Controversy: is there a better way to think about IT and control?
Two years since classroom scanners were placed at Leiden University to monitor room usage, what have we learnt about the implications of collecting data in public spaces and what can we (and the University) do?
In August 2020, as the ongoing pandemic was making a return to the physical workplace unlikely anytime soon, the university installed cameras/classroom scanners – the two different terms reflecting diverging opinions about their potential (mis)use – at building entrances to count the number of people entering. About a year later, scanners (a total of 371) appeared also in classrooms. Previously – and since September 2022 once again – the function of periodically counting the number of people in a room had been performed by human "counters". However, with the pandemic reaching new heights, this practice became impracticable.
The costly automated alternative opted for by the university was not welcome by students and (some) staff. Articles in Mare highlighted the potential privacy violation -- since these "smart" cameras could record a person's age, gender, height, and mood -- asking why the university community had not been consulted in advance, and a protest at Lipsius on December 7 drew national attention (Volkskrant 7/12/2021). In response, the university shut down the scanners (Trouw 9/12/2021), organized a symposium to discuss the controversy and conducted a test to confirm the resilience of the system to outside hackers (Leidse Dagblad 6/4/2022).
Six months on, it is worth contemplating what we can learn from this experience (and similar ones around the globe) about the potential dangers of turning to technological solutions for everyday problems without carefully considering the risks they may pose to the future well-being of our societies. It turns out we all have a stake in this process: from those who collect the data, whose accountability is at stake, to those whose data are collected, for whom the issue is one of consent; further, both sides share a stake in trust-building and self-determination, which can be curtailed when technology takes over. Finally, we all have a stake in language, which, through its performativity, can help shape new realities by providing a different set of terms to think and talk about data and how they can be used.
Stake #1: Accountability
The first thing curtailed by the university's short-sighted decision to install the scanners was its own accountability. By outsourcing the space-monitoring function previously performed by university employees to a commercial entity outside the university, the company manufacturing the scanners and ultimately responsible for their functioning, the university relinquished at least some control over what happens to the data collected. In other words, by bringing in a third party, the university has potentially less control over something that it is accountable for. Much as it may be bound by contractual agreements, as they always are, this company can now decide to use, record, or withhold data for its own purposes. This is not unheard of: in 2020 a US company providing smart streetlights with video and audio recording capabilities that had been used by the San Diego Police Department denied the city mayor's request to shut down the cameras because it had not been paid.
To the university's defence, one could counter that scanners only collect output data – data collected after the fact – as opposed to using data predictively to make decisions, which is what makes algorithmic bias and algorithmic discrimination imminent. The dangers of algorithmic discrimination became clear in the 2020 "A-level" scandal in the UK, when, unable to hold actual exams due to covid, the government decided to use an algorithm to estimate students’ performance at A-level instead, determining which university they could attend. The algorithm adjusted grades by, among other things, historical data based on school and class size, ultimately privileging students in fee-paying vs. state schools. Almost 40% of students received grades lower than what they had anticipated, sparking public outcry and legal action that led the UK government to retract the grades. This raises a pertinent question for us: can data collected by the classroom scanners be used predictively, now or in the future? As we will see later, because what is considered data is not fixed and can change at any time, this question is less easy to answer than it might seem at first.
Stake #2: Consent
The idea behind providing consent is to give people the option of having their data collected or not: by informing them about the risks and benefits – to them and to society at large – of sharing what is ultimately their own, people can decide for themselves whether they would like to do this; should they choose not to, they avoid the risks but also forego the benefits that could come from sharing their data. This process is standard in research but also in some contexts outside academia, such as health and legal settings.
If information about risks and benefits is not provided before the data are collected, a person's right to provide consent is essentially rendered meaningless. The same can happen when data are collected in a public space or at a person's workplace, where consent is often taken for granted simply because of their presence when they have no other choice but to be there. Note, however, that even CCTV cameras and speed cameras in public spaces are signposted; so although in some cases people are not be able to opt out, informing them when their data are being collected is implemented. This did not happen when the classroom scanners were first installed.
However, there is another aspect, which complicates the issue of consent even further. Output data, such as those collected by the classroom scanners, can be used predictively not when the data are originally collected but at a later time, and for purposes other than the original purpose of collecting them. As noted by Marianne Maeckelbergh, in academic settings, this is sometimes dealt with by viewing consent as a continuous process, something that has to be iteratively secured again and again as part of the process of trust-building between researcher and data provider. However, that is not always practicable and, while European legislation such as the General Data Protection Regulation (GDPR) states that data cannot be used for purposes for which consent has not been explicitly sought, ongoing lawsuits such as the case of the National Healthcare System (NHS) in the UK, which granted US-based Deep Mind researchers access to 1,6 million British NHS patient records in order to develop a life-saving kidney app years after the data had been collected, remind us how fragile the boundaries between private and public can be. Cases such as this are an apt reminder that collecting personal information about people is always a balancing act which involves weighing risks against benefits. If the risks outweigh the benefits, other methods of data collection should be envisaged. To come back to our case, what alternatives to cameras were considered for collecting data about classroom use? Responsible decision making should consider several alternatives before deciding which one to use.
Stake #3: Trust-building
Trust is a relationship between two parties, the one who trusts and the one who is trusted, and it is the latter's competence, sincerity, and benevolence that are crucial to building a trusting relationship between them. During the scanner controversy, it is especially the university's benevolence – the extent to which it appears to care about the well-being of its members and to hold their interests dear – that has suffered. This is because, by not providing the opportunity for consent and not discussing potential risks to those involved, the decision to install the scanners came across as a unilateral one aimed at safeguarding the interests of only one side in the trust process while ignoring the concerns of the other.
When trust has been damaged, post-facto reporting and explanation of how data is used, including commitments to continue providing this information, can help to slowly rebuild it. In the case of data collected by smart streetlights used by the San Diego Police Department cited earlier, the officials involved were later required to submit an annual surveillance report about “how the tech is used, along with its costs and funding sources […as well as to...] show what type and amount of data is collected, who has access, and the race of those impacted by the surveillance program.”
A related notion is “explainability”, that is, the possibility to explain retroactively why and how an algorithm was used to make a specific decision. This is the main approach taken in the GDPR, which stipulates that for any automated decision with “legal effects or similarly significant effects” for an individual – in short, any decision that is consequential for that individual's life – they should be able to seek an explanation from a human who can review the decision with them and explain its logic. The level of risk assessment should be appropriate to the significance of the decision-making in question, which in turn depends on the gravity of the consequences, the number of people and volume of data potentially affected, and the novelty and complexity of algorithmic processing. This “human-in-the-loop” component provides a check on anomalous or unfair outcomes and a possibility to correct them. It is important, for the purpose of trust-building, that there be a channel by which the individual affected by the automated decision can seek an explanation, and this in turn highlights the role of the algorithm operators and developers who, in the words of Cameron Kerry, should always be asking themselves: “will we leave some groups of people worse off as a result of the algorithm’s design or its unintended consequences?”. To what extent was the potential adverse effect from the use of scanners on specific subgroups considered when deciding to install them?
Stake #4: Individual identity
As Stefano Bellucci has already argued in this forum, treating data about persons independently from the contexts in which these data were produced and predicting behaviour based on majority patterns denies each person’s individuality. This works in two directions: i) data from individuals are first collected and stripped of context to make generalizations, ii) these generalizations are then used to make decisions about (other) individuals. In the process, two sets of individuals are denied their uniqueness.
Think of what happens to your own individuality when you are turned into that impersonal number and decisions about others are made based on your behaviour, which was produced in a specific time and place to begin with. Can you be reduced to that individual instance, that single data point, and how fair is it to generalize that to others who share your demographic characteristics? In the UK A-level grading scandal, using an algorithm that put too much weight on schools’ past performance meant that students — especially those from disadvantaged backgrounds — were denied the chance to be seen as individuals.
Producing large, machine-readable datasets requires (human programmer) decisions about which aspects of reality to record and how. As stated by Garfield Benjamin, this process of data collection “defines, reduces and restricts personhood, denying certain groups […] the agency to exist as a person with lived experience embedded within relational contexts.” We are talking here about recording behaviours but behaviours are always produced in response to certain contexts. When stripped of those contexts, the same behaviours make less sense or a different sense. As researchers, we sometimes think we can fix this by using what's called "menu-driven" identities: give people more options or even empty text-boxes and they will be able to find something that represents them. Menu-driven identities, however, create a different problem because they fix one’s identity at a certain point in time, the moment when the data was collected. What they lack is the ability to represent individual identities in their full complexity, multifaceted, and in constant flux. Indeed, how could it be otherwise, if identities are claimed in relation to those who happen to be around us at the time? Standard processes of data collection, however, produce fixed data about a person which persists in representation and decision-making, restricting their future choices. A case in point is that of Ashley April (1935-2021), Britain’s first transgender activist, who campaigned all of her life to have her formal identity changed from that of a man to a woman. She was 70 when that was eventually confirmed on her birth certificate, with the passing in 2005 of the UK Gender Recognition Act.
Stake #5: Language
According to Yanni Loukissas “the problem starts with our language: the widely used term data (pl. for datum=a given) implies something discrete, complete and readily portable. But this is not the case”. In the literature, three main narratives about data can be identified. The first narrative is that of data as a resource. This narrative is present when we speak of data as a flood, a lake, or a pool, as oil (a value with economic potential), or as property (something that can be traded). Such discourses promote an understanding of data as simply out there to be used, in the same way as environmental resources such as oil, water, wood, or livestock were exploited during colonial times and continue to be exploited today. But data isn’t treated with nearly the same stewardship as (some) natural resources: endlessly copying data removes the preciousness associated with oil or gold. Nor is all data of the same value: personal data has more value than other types, depending on who it is about and what it can be used for.
A second prevailing narrative is that of data as a discovery. This narrative implies that data simply exist in the world ready to be collected and used. However, as Loukissas emphasizes, data rely on “insights gleaned from their keepers, who use their own local knowledge to explain the contingencies of the data, which are not apparent otherwise”. A kind of Bourdieuan habitus is necessary to constitute the data and understand it in all of its richness. This labour is not merely one of collection but a kind of production.
The third narrative is that of data as an assumption. Here we need to think of "datafication", the process that turns aspects of reality into data to be collected, traded and so on. Datafication reflects normative assumptions about gender, race, disability and more and is inherent to the design of databases explicitly scripted to operate in machine-readable contexts. However, many social relations are not voluntary, with users often being compelled to engage. For if they do not choose one of the available options, they can remain off the grid, be denied benefits, access and so on. The asymmetrical decision-making in this kind of data collection processes perpetuates assumptions about when data should be used and for what, whose interests it should serve, and ultimately what is and what isn't data worthy of being coveted in today's societies.
Toward socially equitable data
So how do we move beyond this impasse toward a fairer way of using our technological capabilities to collect data for the common good? According to Loukissas, “we must rethink our habits around public data by learning to analyse data settings rather than data sets”; while Benjamin proposes to replace data collection with data compilation, an established term in computing, in order to emphasise the reductive process of generating data from the world and converting it into a machine-readable format, at the same time integrating economic, editorial, and curatorial concerns (as in ‘compiling an anthology’). Rather than obscuring the role of those who write the code or collect the data, Benjamin wants us to make it visible: “we must assert", he writes, "within the use of any term the non-neutrality of the process”. Finally, Sabina Leonelli proposes to think of data not as commodities but as a common good (panel on Future Data Space, 19/1/2022). This means that, like health, or free time, data is something that (i) everyone has a right to, (ii) public money must be spent on, and (iii) must be equally accessible and equally benefit everyone.
In lieu of a summary, below I offer five points that can move us in this direction:
1. The terms we use to talk about data reflect and shape our understanding of the relevant rights and processes.
2. Data collection involves the exercise of power by the data collector over those from whom data is collected.
3. Because data is used to make decisions about groups, groups (not individuals) must be crucially implicated in deciding what is collected, from whom, and how.
4. Data is relational: it is embedded in specific spatiotemporal contexts in relation to which it also acquires its meaning. Because of this, datafication is an interpretative process that bears the mark of the human agents performing it.
5. The process of datafication means that consent must be iterative and makes onward data protections and time-based protections necessary.
(Big) data isn't necessarily bad. By shifting towards mutual need (for data), mutual ownership (of data), and mutual decision-making (about data), we can constitute new contexts for compiling and using data for the common good.