"Anonymous" Location Data Problems - Computerphile

Computerphile · Advanced ·📄 Research Papers Explained ·5y ago

Skills: Reading ML Papers90%

Key Takeaways

Andrea Gadotti discusses the problems with anonymous location data, referencing research papers such as Unique in the Crowd and Where You Are Is Who You Are, to demonstrate how easily individuals can be identified.

Full Transcript

i think that everybody has been asked at least once to share their data with a company under the promise that the data is anonymous so the problem is that the data that is collected is very rarely truly anonymous and this is what i'm gonna explain today first of all the intuition of why this is true is that data that is supposedly anonymous can often be re-identified and a very good example of this was given by an article of the new york times which was published i think about one year ago where the journalists could get their hands on a big data set a big location data set about 12 billion and 12 million americans uh so the way this uh data set looks like is basically that uh there are different users so like let's say bob alice carlos and so on and for each of these users there is a series of location and data points so for example for bob we have t1 and l1 so time one and location one and so on teach you and so on and the same for alice and carlos and all the users and of course like we can represent this on a map as a trajectory so we have the first point for bob which would be t1 l1 and i don't i don't know like maybe this could be that you know bob was at madison square garden like you know yesterday at 8pm something like that then bob goes to t2 l2 and we have a record for that and so on and we have of course all of this for all the users in the data set now actually the data set that the new york times journalists could get access to did not have the names in it it was what we say is pseudonymous we can consider this like kind of anonymous but however uh the uh the journalists nevertheless could reidentify some people in this data set and follow their movements and actually some of the people who were reidentified were some collaborators of the u.s president so now the question is uh how did they do it how was it possible for them to identify this data set even though the names were not in it and the intuition of how you could do this is to use some background information about the person that you are trying to identify and use it to match it against the records and the trajectories in the data set so this is precisely the idea behind a study that was published in 2013 by some researchers which studied this fact from a more formal point of view so what the researchers did was to define a metric unicity which we call epsilon p which is defined as the fraction of users in the data set that are unique given p points so at this point we're using this metric they took a data set a location data set very similar to the one of the the new york times so a location data set about uh one and a half million people uh observed over a period of i think 15 months or something like that and they computed the unicity values on this data set and what they found was that epsilon 4 was 95 this means that 95 of the times only four points are necessary to uniquely identify an individual in the data set so in other words if you know you know like where i live where i work uh where i had dinner yesterday and uh maybe where i go to train in the in the weekend then most of the times these are enough to uniquely identify my trajectory in the data set and once you can find my trajectory you can see all the other places that i've been visiting for the past 15 months this study uses a pretty simple attack setting right so basically the adversary is assumed to have access to four points about the the target the victim and uses these four points directly to match them against the data in the data set so okay this is a this is a a potential a potential setting and but you could think of um more complex attack settings for example you could you could assume that the the background information that is available to the adversary uh is actually referred to a period which is different previous to the one of the data set that you are trying to reidentify we can suppose that we have two data sets one from 2018 this data set also contains the names so you know that this one is bob this one is alice the third one is carlos and so on so you have the location trajectories plus the names then you have another data set from 2020 which has again you know all the uh trajectories of all the users and so on but in this case you do not have the names you just have some pseudonyms so for example this is user one user two user three and so on now for simplicity we can also assume that the two data sets contain the same users so we can write this but we do not know which users correspond to which ones right so it might be for example that bob is u3 and alice is u1 and carlos is u2 so the attacker doesn't know this but the the aim of the attacker is precisely to find out the correct maxim between the users in the 2018 data set and the users in the 2020 data set in this case we cannot uh use the the method that we we used before right because if we try to match the points in in the data set on the left with the points in the data set of the right we find no matchings because the time periods that uh this refers to are exactly there is no overlap exactly so we need to find something more more clever and one possible way to approach to approach this problem is to proceed with two steps the first step is to find a similarity score between all the users and the second step is to use all these this scores that we have computed in order to find a good global matching between all the users on the left and all the users on the right so we have here matching so i will go now a little bit more in detail for these two steps i will assume for simplicity that there are only three users in each data set of course this is a an oversimplified setting right but it's just to to make things clearer the first step is to find similarity scores between users what you do would be to take alice's record and you once record and you compute the similarity score between these two and you maybe you find you know like that this is a 26 percent then you do the same between alice and u2 and you find that this is i don't know 35 then ali send you three and you find 87 so you know like in this case it looks quite likely that uh alice and u3 actually are the same the same person but it's not always so easy you know like because the problem is that now you have to do the same also for bob and u1 two and three and carlos and you want u2 and your three and sometimes it's quite clear which one is the right match but sometimes it's not that clear so now the main question for this step is how do we compute the similarity between two trajectories so there are a lot of ways to do this one simple one for example here for alice we compute the set of preferred locations so the locations that alice visits the most often and we do the same for u1 and then we compare them and the idea is that uh if the two sets of preferred location kind of overlap then probably this is a good match and of course like this might not always work right i mean maybe alice changed the workplace or maybe that italian restaurant that she used to love had to close down because of brexit whatever but the idea is that on average if the overlapping is good enough then probably the the matches is a correct okay so and so this this is the first step the second step is the matching so now we have all the scores between the users but we need to find what is the correct matching so here the idea is to select the matching that maximizes the overall similarity between users what you can do is to try all possible matchings and compute the similarity scores for that so for example we can start from this very simple matching in the same order we compute the similarity score for this one and we proceed with the next matching so maybe we can do this one we do this this and this and we compute the similarity matrix so this would be like similarity one similarity score two of course this is the similarity again for all the uh like the sum of all the similarities score for for all the the users and we proceed like this for all the matchings now the problem is that uh this strategy works uh in theory but it doesn't work in practice if the data set is uh too big and actually it doesn't have to be all that big because you know for those of you who love combinatorics you can see that the number of all possible matchings that you have to try is n factorial where n is the number of users in the data set if n is just three users then three factorial is just six so this is totally doable but if the number of users is one thousand for example then one thousand factorial is a huge number it takes i think 2 500 digits just to be written down so basically you know computing the global similarity for 1000 users in this way would be would take forever it would take like you know millions and millions of years so we cannot do that um fortunately at least for the attacker there is uh there is a there are well-known algorithms that do precisely this so they compute a maximum weight maximum matching in in an efficient way so one such algorithm is the hungarian algorithm i will not go into the details of this algorithm because you know like it's quite complex uh but you can find some very good uh videos online if you're if you're interested so basically once you once you have this similarities course you use the hungarian algorithm to find the the maximal matching the the most most likely matching and you use that matching as a guess for uh the identification of the the users in the in the 2020 data set and this attack uh works pretty pretty well at least when the data set is not too big so a few thousand users uh but there are you know like other solutions that are more robust and like you know work also for for uh bigger data sets so uh i hope i gave an idea of why you know anonymous data can often be re-identified so i talked about location data but actually these kind of attacks work for most types of um of a behavioral data so all social networks and so on credit card data so now one when you when you see this these attacks you could think that it's really impossible to to share data to analyze data without compromising uh privacy and this is actually a problem because you know sometimes you do want to uh give the possibility to analyze data so for example you know uh you might want to to be able to share uh health data for medical research but actually i mean there are there are a lot of examples of how to how one can use data for uh for good um so the good news is that uh some i mean many researchers and cryptographers are you know developing technologies that allow precisely to analyze data without compromising the privacy of the individuals who who contribute with their data but this is for another video and then down here for each of the value as i'm writing it out i go round in a loop around the amount of stretches i want and i basically add in new data so if i'm going from this point to this point any device which is keeping track of time in that way we'll get really confused and we basically get

Original Description

How many times have you been asked to share 'anonymous' location data? Andrea shows just how simple it can be to work out who's who. Andrea Gadotti is a researcher in the Computational Privacy Group at Imperial College London (https://cpg.doc.ic.ac.uk/) NYT Article: https://www.nytimes.com/interactive/2019/12/19/opinion/location-tracking-cell-phone.html Papers referenced by Andrea: Unique in the Crowd: The privacy bounds of human mobility (https://www.nature.com/articles/srep01376) Where You Are Is Who You Are: User Identification by Matching Statistics (https://arxiv.org/abs/1512.02896v1) https://www.facebook.com/computerphile https://twitter.com/computer_phile This video was filmed and edited by Sean Riley. Computer Science at the University of Nottingham: https://bit.ly/nottscomputer Computerphile is a sister project to Brady Haran's Numberphile. More at http://www.bradyharan.com

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Computerphile · Computerphile · 0 of 60

← Previous Next →

Follow the Cookie Trail - Computerphile

Follow the Cookie Trail - Computerphile

EXTRA BITS - Follow the Cookie Trail - Computerphile

EXTRA BITS - Follow the Cookie Trail - Computerphile

Musical Floppy Drives - Computerphile

Musical Floppy Drives - Computerphile

The Hair Algorithm - Computerphile

The Hair Algorithm - Computerphile

Getting Sorted & Big O Notation - Computerphile

Getting Sorted & Big O Notation - Computerphile

Quick Sort - Computerphile

Quick Sort - Computerphile

Hyper History and Cyber War - Computerphile

Hyper History and Cyber War - Computerphile

Entropy in Compression - Computerphile

Entropy in Compression - Computerphile

Original Elite on the BBC B - Computerphile

Original Elite on the BBC B - Computerphile

IP Addresses and the Internet - Computerphile

IP Addresses and the Internet - Computerphile

A Career in Video Games - Computerphile

A Career in Video Games - Computerphile

Error Detection and Flipping the Bits - Computerphile

Error Detection and Flipping the Bits - Computerphile

Programming BASIC and Sorting - Computerphile

Programming BASIC and Sorting - Computerphile

Birthplace of the World Wide Web - Computerphile

Birthplace of the World Wide Web - Computerphile

Punch Card Programming - Computerphile

Punch Card Programming - Computerphile

Programming Paradigms - Computerphile

Programming Paradigms - Computerphile

CERN Computing Centre (and mouse farm) - Computerphile

CERN Computing Centre (and mouse farm) - Computerphile

Error Correction - Computerphile

Error Correction - Computerphile

Home-Made Code - Computerphile

Home-Made Code - Computerphile

Security of Data on Disk - Computerphile

Security of Data on Disk - Computerphile

Gesture Controls - Computerphile

Gesture Controls - Computerphile

How Intelligent is Artificial Intelligence? - Computerphile

How Intelligent is Artificial Intelligence? - Computerphile

Encryption and Security Agencies - Computerphile

Encryption and Security Agencies - Computerphile

Virtual Machines Power the Cloud - Computerphile

Virtual Machines Power the Cloud - Computerphile

Hacking Websites with SQL Injection - Computerphile

Hacking Websites with SQL Injection - Computerphile

How Huffman Trees Work - Computerphile

How Huffman Trees Work - Computerphile

Cracking Websites with Cross Site Scripting - Computerphile

Cracking Websites with Cross Site Scripting - Computerphile

Cloud Computing (Cloudy with a Chance of Pizza) - Computerphile

Cloud Computing (Cloudy with a Chance of Pizza) - Computerphile

Texting Cabbage with a Recorder - Computerphile

Texting Cabbage with a Recorder - Computerphile

Hashing Algorithms and Security - Computerphile

Hashing Algorithms and Security - Computerphile

How YouTube Works - Computerphile

How YouTube Works - Computerphile

How NOT to Store Passwords! - Computerphile

How NOT to Store Passwords! - Computerphile

A New Golden Age of Video Games - Computerphile

A New Golden Age of Video Games - Computerphile

A Universe of Triangles - Computerphile

A Universe of Triangles - Computerphile

Cross Site Request Forgery - Computerphile

Cross Site Request Forgery - Computerphile

The True Power of the Matrix (Transformations in Graphics) - Computerphile

The True Power of the Matrix (Transformations in Graphics) - Computerphile

The Great 202 Jailbreak - Computerphile

The Great 202 Jailbreak - Computerphile

EXTRA BITS - Printing and Typesetting History - Computerphile

EXTRA BITS - Printing and Typesetting History - Computerphile

Triangles to Pixels - Computerphile

Triangles to Pixels - Computerphile

The Problem with Time & Timezones - Computerphile

The Problem with Time & Timezones - Computerphile

The Visibility Problem - Computerphile

The Visibility Problem - Computerphile

Lights and Shadows in Graphics - Computerphile

Lights and Shadows in Graphics - Computerphile

The Penguin Barcode - Computerphile

The Penguin Barcode - Computerphile

Typesetters in the '80s - Computerphile

Typesetters in the '80s - Computerphile

The Font Magicians - Computerphile

The Font Magicians - Computerphile

The Little Mac with the Big Bite - Computerphile

The Little Mac with the Big Bite - Computerphile

EXTRA BITS - More on the Original Mac at 30 - Computerphile

EXTRA BITS - More on the Original Mac at 30 - Computerphile

XP to Ubuntu with an 8yr old Hacktop - Computerphile

XP to Ubuntu with an 8yr old Hacktop - Computerphile

EXTRA BITS - Hacktop Real-Time Boot Comparison - Computerphile

EXTRA BITS - Hacktop Real-Time Boot Comparison - Computerphile

EXTRA BITS - Making a Bootable USB in Linux - Computerphile

EXTRA BITS - Making a Bootable USB in Linux - Computerphile

EXTRA BITS - Installing Ubuntu Permanently - Computerphile

EXTRA BITS - Installing Ubuntu Permanently - Computerphile

The Dawn of Desktop Publishing - Computerphile

The Dawn of Desktop Publishing - Computerphile

What is Bootstrapping? - Computerphile

What is Bootstrapping? - Computerphile

Reverse Polish Notation and The Stack - Computerphile

Reverse Polish Notation and The Stack - Computerphile

Home-Made Z80 Retro Computer - Computerphile

Home-Made Z80 Retro Computer - Computerphile

Should Everybody Learn to Code? - Computerphile

Should Everybody Learn to Code? - Computerphile

Programming in PostScript - Computerphile

Programming in PostScript - Computerphile

Heartbleed, Running the Code - Computerphile

Heartbleed, Running the Code - Computerphile

YouTube's Secret Algorithm - Computerphile

YouTube's Secret Algorithm - Computerphile

YouTube Search & Discovery - Computerphile

YouTube Search & Discovery - Computerphile

Andrea Gadotti explains how anonymous location data can be used to identify individuals, referencing research papers and demonstrating the simplicity of the process.

Key Takeaways

Read research papers on location data privacy
Understand user identification techniques
Analyse statistical data on human mobility
Recognise the limitations of anonymous location data

💡 Anonymous location data is not as anonymous as it seems, and can be easily used to identify individuals.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Reading ML Papers

View skill →

Automatic Literature Review with GPT-3 - I embedded and indexed all of arXiv into a search engine!

Automatic Literature Review with GPT-3 - I embedded and indexed all of arXiv into a search engine!

Marcos Lopez Caniego - ESASky's JupyterLab widget| JupyterCon 2020

Marcos Lopez Caniego - ESASky's JupyterLab widget| JupyterCon 2020

Obsidian Zotero Integration Plugin | Streamline Your Research Paper Workflow 📝️

Obsidian Zotero Integration Plugin | Streamline Your Research Paper Workflow 📝️

This FULLY FREE Research Agent can BUILD Reports in Minutes!!!

This FULLY FREE Research Agent can BUILD Reports in Minutes!!!

Claude 3.7 Sonnet API | Build a Research Assistant

Claude 3.7 Sonnet API | Build a Research Assistant

I Built An Obsidian AI Research Assistant with Oz...

I Built An Obsidian AI Research Assistant with Oz...

Related Reads

On July 1, 2026, arXiv will spin out from Cornell University, its home for the past 25 years, to become an independent nonprofit organization. Major funding support from Simons Foundation and Schmidt Sciences. Ditching the red for their website. [N]

arXiv is becoming an independent nonprofit organization after 25 years at Cornell University, backed by major funding, which will impact the future of research and academia

Reddit r/MachineLearning

CS-NRRM™ Official Publications: Paper 1 and Paper 2 Are Now Available

Learn about the CS-NRRM's official publications on a 12-year longitudinal human observation archive and its significance in research and development

Medium · Data Science

Found a potential mistake in an ICLR 2026 blogpost [D]

Verify a potential mistake in an ICLR 2026 blog post and learn how to effectively report errors in academic publications

Reddit r/MachineLearning

Rebuttals Move Peer-Review Scores, but Initial-Review Structure Bounds the Movement

Learn how author rebuttals impact peer-review scores and the factors that influence their effectiveness in ICLR 2024-2025, using LLMs for measurement

How to get started With Drug Discovery using BioAI: Computational Biology ( 4K UHD Med Masterclass )

Sudarshan's Multiverse