Data Harvesting Problem - Computerphile
Key Takeaways
Data harvesting control and mining issues discussed by Dr Richard Mortier of The University of Cambridge, covering research papers and projects like Databox
Full Transcript
we're all generating lots of data in the things that we do so uh it might be sort of what you think of it explicitly data you generate when you go and put something on Facebook or you tweet something or we have an email or whatever um but even activities like driving or shopping now generate quite a lot of data around them there's a notion of a data digital footprint for example um where that's sort of captured and the we we sort of had this model uh we thought a bit about how how the world's kind of developing this way and so we came up with this notion of something we call human data interaction so the a field in computer science called human computer interaction where people have been studying for good 50 years now what's happening now is it's less about interacting with a machine specifically as it is about interacting with data data was generated about you data was generated by you and this data is used in the environment by computational devices to do things on your behalf or do things to you um provide you with credit scoring or things like this the model we sort of Drew for this was the idea that you've got people and they generate data and that data feeds into analytics algorithms and those analytics algorithms generate actions something happens in the world as a result you get a particular credit score or you get your mortgage or you don't get your mortgage or whatever it might be your insurance premium changes or something and those actions change your behavior perhaps or might involve more data inferences from the data being put back into the data set which are then used subsequently to do more analytics and so on so you got the opportunity for feedback loops in this so the behavior you have generates things that change your behavior and so on the belief we have about this is the way that this is evolving the way this is is coming about uh is missing various features that are necessary and there were three in particular that we defined uh legibility agency and what we call negotiability and so the notion with these was that legibility is the ability to see and understand what's going on well the observation is that for a lot of people it's very difficult to understand all the different sources of data that have been collected about you um and it's difficult if you do understand them to see what that data means and what the implications of somebody having collected that data might be agency then is the capacity to act so if you do see what's going on and you do have some understanding what's going on very often you can't actually do very much about it um so you can't go in and correct incorrect data that's been collected about you and say no that's a mistake it's not not not what I think or if somebody's drawn an inference about you because they've looked at your spending pattern and they've looked at your um demographic and they've looked at your tweets and they've decided that you are definitely a Tory voter um and you want to go and say no no actually I'm not right I I'm an independent or I vote liberal or whatever it might be um and people are doing that sort of analysis to form those kinds of inferences but quite often it's difficult for you to go in and change them and correct them if the inference is incorrect and a lot of these inferencing algorithms while they're quite good and they might be getting 80 85% accuracy that still leaves 15 20% wrong data that's coming out of them that's agency so the capacity to act negotiability then is the idea that actually these systems are Dynamic so a lot of the time when you sign up to services for example you check the box you don't read the terms and conditions just say yeah yeah I want to do it now like give me my Gmail account or let me have Facebook Facebook or something and you don't have the ability to go back once you've understood a bit more about what's happening and change that you tend to be either you've said yes or you've said no very often it's you've said yes and that's it for all time um increasingly now some companies getting better at giving you a chance to pull back out again but it tends to be this complete withdrawal I can take all my Google data out of Google and I can leave but that's still quite a binary kind of you either get it or you don't get it you don't get so much Choice over what you can pick and choose and how you deal with that what I always um think of with this is when you go on Amazon um at some point you bought some niece a birthday present and then you get all the recommendations things are based around that at least in Amazon you suppose you've got the chance to say don't use this but yeah that that's a very simplistic part of data is it in the Amazon example I don't know if Amazon do this but it may well be the case that some companies on the web for example will share that data with other companies and so even if you do go and correct it at source that correction doesn't get followed through to all the other companies that have then picked up that data set previously to you going and correcting it um so you have sort of bad data about you is kind of spreading around the place um so so we felt that a lot of the way that these systems are currently constructed don't pay attention to some of these these features there may be other things that are missing people may disagree about how serious some of these things are but we felt that these things were certainly missing and it was a problem so we were sort of thinking about what could we do in terms of Building Technology to try to create a platform on which these things could be addressed the observation then was that well a lot of the way these systems work is you take data from people and you feed the data uh into some some organization it's supposed to be a factory some organization's machines right so data goes up into the cloud art was not my subject um as you can tell so data gets fed into the cloud somewhere gets taken away gets computed upon inferences are drawn and so the whole thing the whole thing carries on arguably the problem with this is what happens here so once you've taken the data away from the person you lose quite a lot about it so you lose lots of metadata and context about it you don't know where the data came from you don't know that the purchase was for your niece anymore you just know that the purchase was made you've also created quite a technical problem which is that you've now got millions and millions of users where you're trying to get data from all of them and keep that and store that and manage and manipulate that and that could be quite expensive to do could quite difficult to do depending on the size of the company involved um yeah know so companies like Google and Facebook and Microsoft have got the expertise to manage that and they do and they do that really well um but for companies that are sort of a bit smaller than that or don't have that sort of technical backgr ground it's a it's potentially a difficult thing to deal with potentially you've created a honey pot effect here so it's now the case that if you attack that company as an attacker and you get hold of the data you might get millions of people's data at the same time so there's quite a value in being able to get hold of that so there's a sort of an attractiveness to making that an attack Target um and so this the the sort of the the observation is that this all sort of starts from the idea that you taking data away from people um and you're taking it somewhere else to do something with it so what we proposed in this project called Data Boog was that wouldn't be good if we could uh provide some technical means that would allow people to keep their data keep control over that data and uh sort of manage access to that data so we still try and provide all the same facility in terms of the analytics that might be carried out but we try and do so in a way that means that people the people the data subjects to use the technical terminology data subjects are going to be um more able to see what's happening more able to control what's happening and more able to to make decisions about what should happen and shouldn't happen on a kind of a case-by casee basis rather than having this sort of blanket yes I accept all the terms this is fine um there are other sort of benefits as well you have legal benefits for example so there's recent legislation from the European Union that's that's coming in I think is due to coming before brexit happens uh however that's going to happen um and that's going to cause that causes for example uh data protection to be even more stringent than it already is um and the find in place to be even more heavyweight than they already are so I'm not a legal expert but as I understand it there there was legislation in the States from Obama was a consumer protection consumer Privacy Act I think um although I believe that Trump either has or wishes to uh kick that out again I don't know where that's going to end up um I think there's similar similar views held in Japan as well there's also legislation in Japan so this is not just a European thing there there is there is certainly concern about this or growing awareness of this in in a number of places in the world um so you've got this um and as part of that legislation there's there's a lot in there about um trying to design for privacy trying not to collect data you don't need to collect trying to make sure that the data subject is aware of the purpose for which The Collection is happening and the purpose for which the process is happening and what's going to come out of it and so so it's supposed to make it all a bit more transparent so which is I think in the end a good thing because it it's good that PE it would be good if people could effectively trust what was going on and understand what was happening right so the idea was what can we do technically to try and assist people that so the notion with the data box is the idea that we should explore the possibility that instead of the data flowing out and then being taken away and stuff happening with it keep the data locally and then if the company wishes to my fake Factory again not we have factories anymore if the company wishes to process this data what it might do for example is it could send some piece of computation some app and distribute that app to different people's data boxes can't even draw Stickman that app could then compute to cross that data work out some answer and then that single answer gets sent back so we don't need to know your shopping history we just need to know whether you are um a big spender medium spender or a TI that's maybe that's all we care about right um and so I don't need to look at all the all the information there I don't need to sorry I don't need to I need to look at all the information but I don't need to hold all the information um I don't and in fact I may not want a copy of all that information because that has now has certain obligations on me as somebody holding data about you that I have to I'm obligated to do things for you now um provide you with information about what I'm doing with it store it appropriately keep it safe secure all this kind of stuff um so it may in fact be better for me not to have to do that and not to take on that risk and that responsibility I'm looking at your diagram there and I'm thinking this looks a little bit like kind of what we do with mobile phones anyway at the minute don't we kind of put our details into a phone and then say yes or no to apps doing things with that data that sort of permissions model is is not is kind of similar yes um I mean there are a lot of problems without permissions model as well such as you might not understand what the permissions really mean or you again you don't have a chance to choose about the permissions it's like the app demands these permissions you either get them or you don't you grant it or you don't although there there are actually things you can do with some of the phones so for example Android phones um I can't remember the name of it but there's a library you can install if you root your phone which you can then have an app that provides you with control over the permissions being granted on the phone so specifically you can get quite detailed about this so it presumably sits in between the app and the actual data yeah so I mean built into standard Android now there's the ability to control permissions being granted so you can you can take away permissions from apps and they may not work if you do that but you can take them away and so you got this quite granular permission quite granular control over it and the idea here is sort of similar in the sense that we'd like to be able to to enable people to exercise that sort of level of control um there's a few things to say about this though one is that um I'm not saying that everybody should have to exercise that level of control all the time for all the possible uses of all their data because that would be completely infeasible and such a pain um but you should have the ability to do that if you want um and it may well be that we can build tools that use that ability to try and represent what you want to do or give you default behaviors that make sense or at least alert you when things have gone wrong because of releases that have happened but there there's a sort of there's a need for that kind of infrastructure to be made available I think um also it's not just about stopping things happening so it's not just the case that I want to stop data going into the cloud I might want to um it might be beneficial to me that if you try and analyze my shopping habits you don't do from the point you don't do it from the point of view of a single shop um so one of the examples I had for this was when I used to live in Cambridge I'd shop at tesos and at sbur I'd shop at tesos to buy the cleaning products and the fruit and veg because they were quite cheap I didn't think very much of the quality of the Meat and Fish counter the particular test I went to and so I tended to get that stuff from Sams where I thought it was better but if you analyze those data sets then SSB might view me as a filthy uh carnivore while Tesco you is a fastidiously clean vegan and neither of these things are true right I'm neither one nor the other there I'm a mixture of those things um and so the idea here would be that you can actually get benefits from this from both the consumer and the data processor point of view the data subject and the data processor point of view because you get a more you possibly get access to a more complete picture of the individual in question because you can access more of their data because you're accessing it for a specific purpose and you're not just trying to collect as much of it as you can in all these cases it's to give you more say in the matter so that you're not you're not just the product right you are actually an individual in this and you you have some some say in what's what's being done um again I mean it's not it's not as simple as saying that you just get everything and it's all yours for example so one of the other another challenge in this is that the way I've presented this the way we've talked about it and the way a lot of people think about it is the idea that you've got personal data and you own your personal data and it's yours and it's about you and actually when you stop and think about it an awful lot of personal data involves at least one other person right so um you know if you look at I don't know Internet of Things data sensing in the Home Smart Homes and things like that this that's everybody in the household is in some sense represented in that it's not just you unless you really do live alone and never have any visitors um which maybe some people do but but not in most cases um even email email's usually got a sender and a receiver right so there's at least two people that might have a claim on the content of that email um and so this the the notion of ownership is a bit tricky and the notion of personal data being solely owned by a single individual is a bit tricky um and so then you end up with with playing games like well how do you how do you deal with that when you might have two or three people who want to say yes to something and another person who's representing the DAT want to say no to it and kind of what what happens there um so so yeah it's it's it's a sort of challenging there are challenges in this but it see it still seems like this kind of construction where at least we're letting the data subjects have some say in these matters and have some ability to act I'm just thinking about that email thing I mean usually for me it's it's the same person send and receive it's me sending myself an email to remind myself myself but to do something that I'm going to forget
Original Description
How do we control our own data while allowing it to be mined? Dr Richard Mortier of The University of Cambridge discusses some of the issues behind data harvesting.
Thanks to the University of Cambridge Computer Laboratory.
EXTRA BITS: https://youtu.be/6VLzTV-5Orc
The MegaProcessor: https://youtu.be/lNa9bQRPMB8
More info on Dr Mortier's research:
http://hdiresearch.org/
https://databoxproject.uk/
http://www.facebook.com/computerphile
https://twitter.com/computer_phile
This video was filmed and edited by Sean Riley.
Computer Science at the University of Nottingham: http://bit.ly/nottscomputer
Computerphile is a sister project to Brady Haran's Numberphile. More at http://www.bradyharan.com
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Computerphile · Computerphile · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Follow the Cookie Trail - Computerphile
Computerphile
EXTRA BITS - Follow the Cookie Trail - Computerphile
Computerphile
Musical Floppy Drives - Computerphile
Computerphile
The Hair Algorithm - Computerphile
Computerphile
Getting Sorted & Big O Notation - Computerphile
Computerphile
Quick Sort - Computerphile
Computerphile
Hyper History and Cyber War - Computerphile
Computerphile
Entropy in Compression - Computerphile
Computerphile
Original Elite on the BBC B - Computerphile
Computerphile
IP Addresses and the Internet - Computerphile
Computerphile
A Career in Video Games - Computerphile
Computerphile
Error Detection and Flipping the Bits - Computerphile
Computerphile
Programming BASIC and Sorting - Computerphile
Computerphile
Birthplace of the World Wide Web - Computerphile
Computerphile
Punch Card Programming - Computerphile
Computerphile
Programming Paradigms - Computerphile
Computerphile
CERN Computing Centre (and mouse farm) - Computerphile
Computerphile
Error Correction - Computerphile
Computerphile
Home-Made Code - Computerphile
Computerphile
Security of Data on Disk - Computerphile
Computerphile
Gesture Controls - Computerphile
Computerphile
How Intelligent is Artificial Intelligence? - Computerphile
Computerphile
Encryption and Security Agencies - Computerphile
Computerphile
Virtual Machines Power the Cloud - Computerphile
Computerphile
Hacking Websites with SQL Injection - Computerphile
Computerphile
How Huffman Trees Work - Computerphile
Computerphile
Cracking Websites with Cross Site Scripting - Computerphile
Computerphile
Cloud Computing (Cloudy with a Chance of Pizza) - Computerphile
Computerphile
Texting Cabbage with a Recorder - Computerphile
Computerphile
Hashing Algorithms and Security - Computerphile
Computerphile
How YouTube Works - Computerphile
Computerphile
How NOT to Store Passwords! - Computerphile
Computerphile
A New Golden Age of Video Games - Computerphile
Computerphile
A Universe of Triangles - Computerphile
Computerphile
Cross Site Request Forgery - Computerphile
Computerphile
The True Power of the Matrix (Transformations in Graphics) - Computerphile
Computerphile
The Great 202 Jailbreak - Computerphile
Computerphile
EXTRA BITS - Printing and Typesetting History - Computerphile
Computerphile
Triangles to Pixels - Computerphile
Computerphile
The Problem with Time & Timezones - Computerphile
Computerphile
The Visibility Problem - Computerphile
Computerphile
Lights and Shadows in Graphics - Computerphile
Computerphile
The Penguin Barcode - Computerphile
Computerphile
Typesetters in the '80s - Computerphile
Computerphile
The Font Magicians - Computerphile
Computerphile
The Little Mac with the Big Bite - Computerphile
Computerphile
EXTRA BITS - More on the Original Mac at 30 - Computerphile
Computerphile
XP to Ubuntu with an 8yr old Hacktop - Computerphile
Computerphile
EXTRA BITS - Hacktop Real-Time Boot Comparison - Computerphile
Computerphile
EXTRA BITS - Making a Bootable USB in Linux - Computerphile
Computerphile
EXTRA BITS - Installing Ubuntu Permanently - Computerphile
Computerphile
The Dawn of Desktop Publishing - Computerphile
Computerphile
What is Bootstrapping? - Computerphile
Computerphile
Reverse Polish Notation and The Stack - Computerphile
Computerphile
Home-Made Z80 Retro Computer - Computerphile
Computerphile
Should Everybody Learn to Code? - Computerphile
Computerphile
Programming in PostScript - Computerphile
Computerphile
Heartbleed, Running the Code - Computerphile
Computerphile
YouTube's Secret Algorithm - Computerphile
Computerphile
YouTube Search & Discovery - Computerphile
Computerphile
More on: Reading ML Papers
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way
Medium · AI
ICMI 2026 Reviews [D]
Reddit r/MachineLearning
Workshop submission for main conference paper under review [D]
Reddit r/MachineLearning
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
Reddit r/MachineLearning
🎓
Tutor Explanation
DeepCamp AI