Data Quality Management Techniques - The Complete Guide
Key Takeaways
The video discusses data quality management techniques, emphasizing the importance of preventing data quality issues throughout the data lifecycle, and provides strategies for identifying and addressing these issues, with a focus on cybersecurity and cross-functional collaboration.
Full Transcript
[Music] next up we can bring up our next speaker welcome thank you so much for spending some time with us I'm super excited to learn everything you have to share about data quality awesome go ahead and bring your screen up and all right just confirming that you're able to see the screen yes I am able to see it okay perfect um I will get started all right uh hello everyone um welcome to my session I hope you're all really pumped about data quality I'm always very excited when conferences care about this topic uh anything with with data I think it's super important um so today we're going to be talking about active ways to prevent diagnose and cure bad data my name is shaku I go by the she her pronouns um and let's kick it off uh so we have a few quick topics for today we'll examine what is unfit or bad data why anybody should care about it and finally what we can do about it um and before we get into all of that wanted to set the table Stakes of what is bad data even mean um anything that is any data that is inaccurate incomplete or misleading in any way is referred to as bad data some people also call it low quality or unfit data and the problem with it is that it eventually leads to some sort of biased decisions which is what we want to avoid um here I always always um also make it clear that just because data is showing you something that you don't want to see that is not what makes it bad um if the data is accurate if it is complete if it's not misleading people in any way it is not bad data so why should anybody care about this um and this is something that I think a lot of people especially in dat roles want to figure out how to best position it with their um with their stakeholders with their executive you know the the SE Suite like how can you convince other people besides people who experience data quality problems to care about bad data um so the most important reason is cost data when it is when it is low quality it is a burden on your your resources to try to maintain it um you're constantly trying to guess like is this is this right is this not um because when you have historically had a lot of bad data that develops almost this mistrust of accuracy um and so you end up as a team wasting a lot of effort on under on on identifying what the issue is um there's a lot of time and money that's spend trying to reconcile information um and all the extra like time resources tools that you have to spend on fixing issues is ultimately a cost to a business there is also sort of a more um you know there's there's The Upfront cost that you have to fix it but you can also lose opportunity or um directly lose Revenue because you have data quality issues so something simple like you know you were using you was using um you were using an automated pricing model and if that is incorrect and your um your your product is not priced the way it's supposed to be like that can that can lead to revenue loss um or you can also just have um something like the data tells you to do something and that is not what you should have done you should have done you should have picked option A instead of option b that is an opportunity lost where you could have made more money and um and and and that is that is a problem that you didn't that you didn't make it um losing trust so when you have data quality issues especially if they're consistent and they cannot show up you lose uh you lose trust and that's not just a a monetary cost like if your reputation is damaged then people don't trust your results and this can be internal that you have internal stakeholders who don't believe you know your the data is telling them something that they don't want to hear and they blame it on quality issues um but you could also have it external that maybe your customers are inconvenienced um they don't they have seen data quality issues in the past so when you as a as a product tell your customer base to do something uh maybe they're not maybe they're not convinced and um rest be assured like your competition is always going to take advantage of the fact that you have data quality issues um internally people may not trust your work they may not trust your competence um so whether it's internal or external your brand can be affected by that in some Industries and I've worked I like I myself have worked in healthcare and a lot of Industries where there is a lot of compliance in regulation um so there is a legal liability for data quality issues um again in some in some Industries there's very standard rules there's very standard guidelines um and there are monetary and other consequences if you are not compliant um and you know compliance can show up if your if your if your data quality is is is messed up uh but even in other cases even if you're not in a regulated industry um if your data quality ends up showing up to the consumer in a way that causes harm you or like as an individual or as a company you can be held liable um so there are there are a lot of uh legal legal standards here um you know we're going to talk about this a little bit later but like um the the cost of bias um and like how how that relates to ethical Behavior like did you know that your data quality could have caused harm in some way uh all of that are potentials to to think about um and and yeah you know the the bias part is is is I think I think I think there can be a talk just on this just on this topic like how bias in data can lead to harm um um real people can be affected by um data quality issues that are essentially incorporating some hidden biases you know it might be unintentional but um there is there is there is a cost um you know you could you could have some unanticipated scenarios um where maybe you made a decision on the on your data architecture maybe it really didn't meet the moment of what it was supposed to do and there's a there's a there's a domino effect of um problems that happened Downstream um you know you're excluding some individuals from the benefit of a product or causing some very very direct harm um so this especially like I think I think I think rightfully so the industry cares about um things like this and there is a cost that if you have biased data decisions if they lead to harmful consequences it is costly um and causing harm is never never a good sales switch for your for your for your product um productivity so this is more of an internal internal cost um so it's just timec consuming when you have when you don't have clear processes for thinking about data quality and tackling data quality then you're going to keep fixing the problem so anytime it shows up you're going to fix it uh and that can be that can take more time like it can take more time to keep fixing um data quality issues than to just go upstream and have better processes that prevent data quality issues from occurring in the first first place um so anybody you know data scientists like their job is typically they they find the error they communicate it they um uh they they hunt for the source they validate cross check like all of all of that all of that takes a lot of time and it is not the best use of your resources and and that dovetails into sort of the final uh piece of it which is just morale um you know I I truly believe that I think data folks or any anybody in any profession you thrive when your skills are effectively utilized if you feel you are just um you know um doing very low sort of um low skill work in in some way um you know as in like there are there are more important things that you could be doing um and it's just like okay I I I don't want to be fixing um sort of something that can be a larger uh agreement within the company that you know we want to prioritize this we want to put the tools and the processes and the responsibilities behind it instead if you feel that it just ends up being my problem because nobody else will take care of it um that is something that leads to disillusionment and it leads to loss of talent um so these are hopefully some some good reasons why companies should care um and of course I think you know data folks care about data quality issues but if something Rango Bell I hope that's helpful okay so um moving on to sort of the more tactical tactical pieces um what what can you actually do about uh about low low data quality um and if you're disciplined you know this is this is sort of like this is almost a plan like what can you do before what can you do during and what can you sort of like do do Downstream uh so think of it in in in that sense um that when we think about bad data quality the first thing that we should think about is how do you actually prevent it in the first place um you know prevention is usually like it's it's it's hard but it's it's something that can be planned for um so in this I think the first uh stage that I'd like to set is to actually look at the life cycle of data because I think it's much easier to figure out how to prevent something when you understand um how it how it kind of flows through U flows through the sequences anyways so data is typically you know the first stage is definition um we Define features and we align different teams on those on those definitions and at this SP space like product and data teams might be working together on that alignment like what what is even a piece of data um next you log it um so you you you track it you store it and it goes through some sort of an Engineering Process uh after that is when um you transform it so you are applying business rules to that data you are pre-processing the data you are transforming it into something that is actually useful next up you are analyzing it so you know you can model the data you can interpret it to solve various uh various problems uh and finally you will share out the results with stakeholders um and again there can be many different ways of sharing it whether it's a dashboard or a model that predicts something all of that is part of that sharing uh sharing piece um so that wraps up the data life cycle and if you think about it at every stage you can um you can have you can have situations that occur that introduce those data quality issues so um bad data during the definition definition phase um you know you can have something like an uneven feature definition um an example is you know I I I I I still like to use a lot of healthcare examples but like if you're trying to say that okay this disease is what I'm trying to track like when you define the disease um one person can say um you know this is this is a very broad definition like anybody who tests positive for this disease has that disease or you can have something else like you know having the disease could just be that okay you have these these these markers and that implies that you have that disease um so if you have that uneven definition that some people think uh you know this is the definition versus another that leads to that leads to problems uh Downstream you could also have a very uh myopic definition like you know a very narrow definition that uh the covid example that people initially said you have covid if you have the alpha variant and then the Delta variants and flirt variants I I've lost track where we at with that thing but uh you could keep um you could keep having other things that represent that same initial thing that you were trying to track and um you know but your original definition was too narrow um and finally like you could just have incorrect input parameters like you know maybe you made a typo um and You misspelled something so the definition is not reflecting what you truly wanted to to to track um the next stage in the data life cycle is the logging stage so this is where you actually track the features that you have defined um so there's a lot of potential for confusion and inaccuracy at at this stage uh because it is possible that you are you you think you are tracking everything but you are you have some something broken or something incorrect that uh you completely missed out that some people might be um you know some people might be the data could be coming from this other pipeline that you're that you're not even that you're not even thinking about uh so you think you are logging all the data you need but there is some piece of it that is just missing or incomplete the other piece could be um like a faulty pipeline so you know you are you have every intention of tracking uh tracking all the data but then some part of it broke uh you know data coming from mobile is not being tracked any longer or something or something like that um and then there is um you know inconsistent time frames so when you are thinking about data thinking about how long that data is being stored when does it get aggregated what time zones are used uh I think this is a problem that is pretty common that you are um you know for example like some people are aggregating it based on day and there is a difference in in the time time zone expect ations um from who's who's who's doing that logging versus versus who's viewing it um so this is this is you know this is the equivalent of there being a broken connection between the data that you want and the data that you are actually tracking um and as you can imagine it ends up leading to a lot of unintended consequences uh the next phase is the transforming phase so this is where you are trying to pre-process your log data into a usable format with the rules um and when you talk about rules the first thing is are the rules even understood uh you know does everybody um does everybody think of um the rules in the same way um have you you know label data in a way that is unambiguous um I think this is a place where I I highly recommend the use of data dictionaries and more intentional documentation of what's going on with the data um because I think I think it really helps um to just remove remove some of the assumptions that people might make when they are looking at something um another common thing is just meaningless aggregation so um different people think about uh logic and algorithms in a in a different way and uh you know some people like they they jump a few steps so maybe you know you start with this clean table and then um you're trying to get to that end point but the aggregations you make along the way of sort of the stepbystep pre-processing maybe they're not maybe they're not uh the same sort of assumptions that that that people would have to make so the situation that results is that if you spot an issue um you know you're trying to go sort of back up every step to figure out where things broke um and it's just it's it's unnecessary like you have to do a lot of data gymnastics with raw data I I think this is why DBT became popular because you could kind of see the lineage more clearly like what's going on what did it start with what happened happened over over time um but otherwise like that's why like in the transforming phase a lot of things can go wrong and finally logical errors like you know your rules were created in a test environment maybe they didn't include real world scenarios um so it is similar you know the analogy that I use is um I can give people the exact same raw ingredients like I can give them flour egg butter um and everybody can take that and turn it into a different finish product and the reason is because everybody every body is using a different recipe everybody's making different sort of like using different um sort of pieces um so that's why like that transformation if you want it to be predictable if you want it to be consistent you have to very clearly Define uh and agree on on on those rules um the next phase is the analyzing phase so this is where you um you know you really try to just make sure that everybody is trying to answer the same question um you know you can say like something as innocuous as how many users do we have um do you mean how many users do we have in our universe or do you mean how many users do we have who are active or do you mean how many users do we have who are currently like still want to be associated with our platform so you know like ironing out those tiny um word even word like wording difference is can can go a long way in making sure that uh the problem is is something that everybody is aligned with um this is you know my favorite part in some way but um when you have like humans make errors um you know you could you could even if you try to automate as many things as possible there are still like you could use the wrong ml model or technique or there's a mistake in your formula so there's there's a lot of things that can that can sort of go wrong um and the final piece again you know biased algorithms is something that we can talk about it endlessly but how you gathered your data how you chose what you chose to include in your data what is related to your insights um what training data did you use um all of those things can can really make a difference um in in sort of the final insights that you that you get from your thing uh and mistakes in this in this in this part can lead uh again to severe quality issues um that are just again they're inaccurate they're incomplete or they are misleading in some way um and the the the Final Phase is the sharing phase so here um this is a like an actual graphit that I found uh I love finding bad charts it's it's a hobby of mine um so I think someone has shared this with me but um you know this is a bad chart I think everybody can agree that it's it's it's not it's not great um and so but even even outside of that like even if you don't have a glaringly bad chart that doesn't even add up to a whole pie um faulty reporting is something that um you know it it again it can happen very innocently that you had a dashboard uh you stopped maintaining it actively and then someone went and tried to get information and thought it was the truth um so that is you know it's um it's not um it's not intentional but like there can be faulty reporting which is which is you know that it's just the information there is wrong um or there can be information that is um in unintentional um also you you you know you you can't um people people often make assumptions for what they're seeing so when when you have some when you have information out there um if if people don't interpret the results correctly um you know for for example like they look at this chart and they don't realize that it's overlapping like you know somebody could have two different worries um that's a that's a bad thing because um you you don't you want to really work very hard to prevent people from misinterpreting results and you can't do that all the time but uh focusing on the sharing phase giving it like how will different people readed it I think it allows you to prevent some of those some of those issues um and the final piece that can happen um and again you know this is not meant to be an exhaustive list but uh the minute you put information out there the minute you share it um you can't stop people from using it for something that it was not intended for the only thing that you can do is you can give that Nuance upfront that this is what the data is good for this is what it's not good for please do not use it for this other thing so having those documentation and caveats outlined upfront I think help um prevent some un unintended Downstream use which eventually causes a data quality issue that you know you don't really have control over um so I'll I'll I'll pause here for just a second because U you know I know some people like to take a snapshot of this graphic um so like I said these are not um these are not uh this is not an exhaustive list of every possible way uh something during your data life cycle can cause uh a data quality issue uh but it is you know it's it's a starting point so um I think some people like to look at it as a checklist and at least go through some very common scenarios but um feel free to take a snapshot if you like all right uh next up is um you know actually diagnosing so you know hopefully once you understand how bad data gets created it's a little easier to diagnose what exactly is going on um and here I start with how do people even notice that there is there is bad data um I think a very common way is that people notice results that don't match maybe they have a source of Truth or something that trust a lot and then some other data shows up and it doesn't match either in absolute or in aggregation um or you see something that you consider suspicious um you know you see something and you're like this data does not make sense it doesn't match um what I logically thought would happen sometimes maybe it is actually correct but it is still a prompt for um investigating like what's going on so either ways the first step is to look for obvious reasons like why do you think you are encountering or suspecting bad data uh the first question to ask is is it actually bad data um you know if you were trying to match two numbers were they supposed to match or were they always meant to represent two different things um and the suspicious results like what assumptions did you make which now cause you to be suspicious um was the data set built for your use case so this we discussed before that there is unintended Downstream uses that people can come up with so you know is it just a matter of you looking for the right answer but in the wrong place and finally is there a known bug or a pipeline issue that can explain the data quality issues that you're seeing like uh you know did did something happen in your pipeline that you have already documented um so if none of the obvious answers if none of the obvious reasons are the answer then it's time to sort of go go through the data cycle but in reverse order um so you know for each each stage verify like does the phase before it have the right data and and that helps you pinpoint like where are problem is occurring um and then when you identify the phase um think about like okay like what are the hypothesis of what might be happening um and then for each of those like just you know create that hypothesis test validate and keep repeating that process until you find something that explains what your issue is here's a sanity checklist uh so people will um you know some of you will recognize that this is the journalistic um standard of documenting a story I find it very useful because you when you when you start going through the what where when why who how of the problem um you start documenting some of the assumptions and if you do this as a combined exercise um you might actually find through that discussion that oh there is a disconnect um you know between between some things that you thought were true which are not um so this is a good way to um figure out um diagnose that that problem all right um so now we are on to a final final sort of piece um curing like strengthening data quality um you know here I always say that data quality is a cross functional effort it is not just the data team that has to focus on it uh you know hopefully you can use some of the things that I outlined in the beginning um as reason for why everybody should get involved uh because the answers are not just in the data team some of the assumptions may be product assumptions or marketing assumptions or design assumptions and I think it is helpful when you have a bunch of people coming together to fix the issues um I list out a few common uh scripts like you know just being able to compare data across sources being able to quickly identify which data is missing or duplicate um and my favorite like comparing data Trends by Dimensions I think it's a way to sort of quickly identify what is going wrong with your with your situation and um the final U piece is just um some coding practices um I always am a big fan of reusable modules um and making sure you have documentation especially if you have remote teams you want to you want to make sure that everybody can access what you intended um and if you add alerts to your pipelines like now there's a lot of automated tools that do that for you uh but if you add alerts at least for the important stuff you will be able to identify issues early um and yeah prevention is better than cure so maintain stuff reconcile things automate where you can um simplify everything and governance you know cannot stress it enough that it is super important to uh to to think about this um and the final thing is like have an actual plan to audit U measure your data quality issues um you know hit me up later if you want ideas on on on how to do that uh and with that I think that's uh that's my time so um yeah thank you all so much for attending the session if I think I'm at time so I don't know if I have time for questions but um you can connect we have a few minutes okay cool cool yeah U but yeah otherwise people can connect with me later as well if if if you don't get to the questions cool so uh one of the questions that popped up in the chat uh during uh uh Rohan asked how can continuous validation and testing be implemented to ensure data quality is maintained during real time processing yeah it's it's it's interesting because I think when you are um continuously validating uh like I think I think it I think it helps to sort of again go back to the basics of like um you know where is data quality issue popping up that you have noticed uh I think keeping anchored on that that like uh you know if if if if I'm expecting that this is the part that's that's going to break uh then you can then you can sort of like create a process specifically in that zone um that helps you that helps you catch it early that helps you sort of like uh fix it early um as you as you as you sort of go along um so I think there's like specific techniques for like each part of like you know the the validation phase and things like that that you can use um but like again I I would always encourage people to go back to the why like what what can go wrong and how do you do it awesome thank you so much I think we're at time now but really appreciate it we all learned a lot thanks thank you
Original Description
//Abstract
Uncover the secrets to harnessing quality data for amplifying business success. This talk equips you with invaluable strategies and proven frameworks to navigate the data lifecycle confidently. Learn to spot and eradicate low-quality data, fortify decision-making, and build trust with data. With streamlined prevention strategies and hands-on diagnostics, optimize efficiency and elevate your company's data-driven initiatives.
//Bio
Shailvi is a seasoned Data Leader with over seventeen years of experience growing impactful teams and building technology products used by hundreds of millions of users. Her career includes notable technology roles at Salesforce, Fitbit, and as the Head of Data at Strava. As a fractional executive, she has consulted, advised, and invested with multiple high-growth startups. Shailvi has spoken at nearly 100 global conferences, coached more than 500 individuals, and authored the best-selling book "Self-Advocacy."
A big thank you to our Premium Sponsors @Databricks, @tecton8241, & @onehouseHQfor their generous support!
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from MLOps.community · MLOps.community · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Our 1st MLOps Meetup // Luke Marsden // MLOps Meetup #1
MLOps.community
Remote Collaboration as a Data Scientist
MLOps.community
MLOps Manifesto with Luke Marsden from Dotscience
MLOps.community
MLOps lifecycle description
MLOps.community
What Does Best in Class AI/ML Governance Look Like in Fin Services? // Charles Radclyffe // MLOps #2
MLOps.community
Life purpose and too many spreadsheets
MLOps.community
Explainability, Black boxes and EU white paper on reproducibility
MLOps.community
Hierarchy of Machine Learning Needs // Phil Winder // MLOps Meetup #3
MLOps.community
Automatically Retrain Machine Learning Models? Are best practices worth it?
MLOps.community
Building an MLOps Team? Key ideas to keep in mind
MLOps.community
Hierarchy of MLOps Needs
MLOps.community
Bare necessities for getting an ML model into production
MLOps.community
MLOps and Monitoring
MLOps.community
How Phil Winder got into Data Science and Software Engineering
MLOps.community
Provenance and Reproducibility in Machine Learning; what is it and why you need it?
MLOps.community
Friction Between Data Scientists and Software Engineers
MLOps.community
MLOps Problems in different size companies
MLOps.community
ML tooling in large companies
MLOps.community
ML Platforms - The build vs buy question
MLOps.community
ML Services Gateway at SurveyMonkey
MLOps.community
Message buses, Async and sync architecture
MLOps.community
MLOps #4: Shubhi Jain - Building an ML Platform @SurveyMonkey
MLOps.community
Hybrid Data Science Teams @SurveyMonkey
MLOps.community
How do you handle ML version control at SurveyMonkey
MLOps.community
Doing ML with Personal Information
MLOps.community
Evolution of the ML feature store @SurveyMonkey
MLOps.community
Developing a Machine Learning Feature Store
MLOps.community
Auto retrain ML models is not the question
MLOps.community
3 key parts to Machine Learning monitoring
MLOps.community
MLOps Meetup #6: Mid-Scale Production Feature Engineering with Dr. Venkata Pingali
MLOps.community
MLOps meetup #5 High Stakes ML: Active Failures, Latent Factors with Flavio Clesio
MLOps.community
MLOps: Airflow Pros and Cons
MLOps.community
Specific challenges in Machine Learning
MLOps.community
Current State Of Machine Learning
MLOps.community
Humans in the Loop are a defining factor in Machine Learning
MLOps.community
Learning from real life Machine Learning failures
MLOps.community
Survivorship Bias in machine learning tutorials
MLOps.community
Swiss Cheese model in Machine Learning
MLOps.community
Resume driven development in Machine learning & software engineering
MLOps.community
Who has the highest standards in ML?
MLOps.community
Venkata Pingali of Scribble Data Thoughts on the Current State of Machine Learning
MLOps.community
Dependable data and being able to Trust in your Data with Venkata Pengali of Scribble Data
MLOps.community
Speed, Trust, Evolution and Scale in MLOps
MLOps.community
More difficult transition for data scientists to become ML engineers
MLOps.community
How many models in prod til I need a dedicated ML platform?
MLOps.community
Deeper thinking from data scientists around platform blackholes
MLOps.community
Checkpointing, metadata, and confidence in your data
MLOps.community
Adjacent usecases and multistep feature engineering
MLOps.community
Standardization of Machine Learning tools like in Software Engineering with Venkata Pingali
MLOps.community
Reproducability flaws in end to end Machine Learning debugging
MLOps.community
3rd wave of data scientists
MLOps.community
MLOps meetup #7 Alex Spanos // TrueLayer 's MLOps Pipeline
MLOps.community
MLOps Meetup #8 Optimizing Your ML Workflow with Kubeflow 1.0
MLOps.community
Are Kubeflow and Airflow complementary?
MLOps.community
Why Kubeflow gained so much traction=open community
MLOps.community
Who decides the dirrection of Kubeflow
MLOps.community
What do Kubeflow and Arrikto do and how do they work together?
MLOps.community
Versioning your ML steps with Kubeflow
MLOps.community
Machine Learning Lifecycles//Perception vs Reality
MLOps.community
Kubeflow vs SageMaker in Machine Learning
MLOps.community
More on: Data Literacy
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
How I built the OSS alternatives directory: GitHub ETL, Turso, and the UPSERT trap I hit
Dev.to · MORINAGA
Apache Iceberg in Production: Compaction, Catalogs, and the Pitfalls Nobody Warns You About
Dev.to · Gabriel Henrique
Your First Task as a Data Engineer in a New Company? Make the ETL Pipeline Testable
Towards Data Science
From DataStage and Informatica to Databricks Medallion Architecture: Why Migration Is More Than Code Conversion
Dev.to · Amit Kumar Singh
🎓
Tutor Explanation
DeepCamp AI