Getting High-Quality Data for Computer Vision Models | Machine Learning | Community Webinar

Data Science Dojo · Beginner ·📐 ML Fundamentals ·4y ago

Skills: LLM Foundations80%AI Alignment Basics70%CV Basics60%

Key Takeaways

The video discusses the importance of collecting and annotating high-quality data for computer vision models, highlighting the need for a data-centric approach and the potential biases in AI systems. It covers various tools and techniques for data collection, annotation, and visualization, including imagenet, Mechanical Turk, Aquarium, Lightly, and Voxel 51.

Full Transcript

uh thanks everyone for joining today my name is bilal i'm one of the marketing managers at data science dojo i'm joined here today by eva she is founder and ceo at humans in the loop and she'll be presenting on getting high quality data for your computer vision models so i'll let her take over now thanks eva great thank you bilal um hi everyone it's really great uh to be here with you as well said feel free to raise your hand um ask questions anything that comes up i think we can have a really nice discussion here about what is high quality data how to get it and how to make sure that we keep improving it iteratively as we go and as we deploy our models so a bit about me uh as villain mentioned i'm the founder and ceo of humans in the loop we are a professional data set collection and annotation company we provide essentially human input for artificial intelligence and computer vision more specifically so we work a lot in terms of annotating data collecting it but also providing continuous human supervision for models in order to make sure that we're that they're high quality unbiased uh trustworthy and so on personally i'm very passionate about building ethical ai supply chains um mitigating different types of harmful biases in ai systems as i mentioned you know human in the loop systems where humans work hand-in-hand with the ai and together you know they complement each other um i'm passionate about quality data sets for quality models so quality is a big focus for me personally and for us as a company and of course making an impact so i'm going to be telling you a little bit more about our social impact as a company uh later on uh but i think you know of course everything that we do in this world we can think about it in terms of our environmental impact our social impact and make sure that we're aligned with our values you know no matter what we are building so for me you know it's really about making a social impact through the data work that we do and making sure that in our employment and training programs we're actually impacting people who really need this opportunity and who can also contribute with the their diversity to the ai system in terms of high quality data for computer vision the question here is why should we care of course it's kind of natural to assume that we need good data in order to train good models it's actually you know already proverbial that it's you know garbage in garbage out you need good data in order to build good models um and you know for me it's very important to transmit that we do not only care about good data and good models just in terms of our you know outputs in terms of accuracy and efficiency of the ii system and you know how much cost savings it's going to bring us and so on but um also to consider the wider societal impact of the ai that we're building and the fact that it because it produces you know decisions at scale it actually has the um capability to affect people at scale so if it produces harmful or low quality decisions or biased decisions we might actually reduce a lot of harm on the users of the system so we want to avoid situations in which our ai systems are exhibiting harmful biases in terms of for example uh racial and ethnic biases these are just some of the you know most notorious examples of what's happened and you know for me it's it's actually the most curious one of these uh is actually the one on the bottom where we have uh two hands holding uh thermometer and in the first case um the ai classifier interprets this object as a gun while in the second phase it it says it's a monocular just because of the you know different uh skin of the person so this is one example of how nuanced it is um if we endeavor to build systems that are really unbiased we it is not um enough to just think about you know the humans that appear in the distribution of faces and so on we even have to think even if it's an object-focused model what are the co-occurrences and that how many you know people how many hands of different braces do we have holding these objects and so on so it's really non-trivial i would say to build um such datasets that are completely unbiased maybe there is even no such thing and we also want to avoid gender bias of course i mean these are just two of the many potential biases but these are two of the most harmful ones um and two of the most frequently seen ones as well and you know we have uh gender biases in terms of how images of um you know congress people are being classified women are always related to uh their fee their facial and outward appearance like their smile and their hairstyle and their chin while uh men are are classified in terms of you know business and spokesperson and official um so these are again quite subtle we might not be able to prevent all of them uh but at least we want to make sure that our data set is representative and it's balanced enough in all different types of co-occurrences so as to make sure that it treats everybody equally so i already mentioned you know this principle is well known garbage and garbage out but it gets even worse because a lot of models actually amplify those biases that exist in the training data and here i'm citing uh you know this this paper which um treats um i think it was a model for keywords and uh classification um and they found not only that the data set uh for the task contains significant gender bias but also that the models trained on this data set further amplify those biases so for example if we have images of of people in their daily activities and one activity is cooking um in the training data set it appears um you know 33 more likely to include females afterwards when the model is trained on that data set it amplifies the disparity to 68 at this time so again it's even worse and we need to find a way to prevent this of course there are different types of algorithmical ways in which we can try to mitigate biases but the consensus right now is that they can only get you so far so for example the group which started working on the imagenet data set we're actually going to be talking more about it uh later on but it's one of the of course most canonical data sets uh when the first big uh data set which was really you know collect an annotated large scale um and most recently there have been attempts to uh rebalance it and to make it more fair uh however you know um the the authors of the uh paper have concluded that even though algorithmic interventions are possible in order to make sure that the outcomes of the models are less biased than the actual data set it's unlikely that such interventions are going to be the most effective path um so we need to think on the data level um even if we're using an algorithmic approach uh we still need require the protected attributes to be explicitly annotated so we still need to think about the data um just because we won't be able to know what the distribution of our data is and whether it's biased or unbiased unless we have these protected attributes actually encoded into the data and annotated in it so again we have to think about how do we annotate the data and how do we um collect it in a more representative way so um this is actually a good example of this new shift in the industry that that has been happening in recent years from model-centric ai to more data centric ai and right now you know everybody's talking about um data-centric methods and ways in which ai can be trained um one of the you know thought leaders in the space andrea has actually reiterated that model tweaking and working on the algorithmic level can improve models perhaps marginally and this is more in the realm of competitions and so on but for real life data and for real life models if you improve the data this can be the true game changer in order to ensure higher quality so the shift is the following in the model centric way you collect what data you can and then you just build the model good enough to deal with the noise and the data while if you're doing this in a data-centric way you need high consistency and high-quality data so if you have this high quality data set you can train different types of models on it and and it will be much easier um in the model centric way we hold the data fixed and then we iteratively improve the code of the tomorrow but when we switch to a data centric mindset we actually hold the code fixed and then we iteratively improve the data um so in the case of errors for example we fixed the data and we focused on ensuring that the data is good um as compared to for example the model centric way in which you know we see some errors we then tune the model and we focus on just having enough data and we having a lot of data um instead of having perhaps a smaller data set but with really good high quality and consistent data so um given this data-centric mindset that we're adopting right now we really need to think about what does high quality data mean and then afterwards of course how do we go about obtaining it and actually uh using it in our uh model training in terms of high quality data and what does that even mean um we have the take of the eu actually which is quite interesting uh just because the european commission has recently set to create a new ai act regulation this is specifically targeted towards high-risk ai applications so these are applications that might affect fundamental human rights so this will not affect all different types of ai applications but still it's quite interesting because it's the first such legal document which imposes some minimum requirements on the data the quality the human oversight the quality management system the governance and so on of the um of ai systems and of course it talks a lot about the data so what it says is that basically uh training validation and testing data sets should be sufficiently uh relevant representative three average complete in view of the intended purpose of the system so of course these are quite abstract definition uh definitions and uh we're going to be discussing what exactly that means uh but you know these are just some guidelines that the commission is proposing you know in order to create high quality models we need this type of data and uh also interestingly they point out that the data set should have the appropriate statistical properties including as regards the persons or groups which the high-risk ai system is intended to be used on so for example if my system is meant to be used in a very you know diverse place or in a variety of places i actually need to have enough statistical properties in my data to make sure that it's treating everybody in the same way so some people would say especially in you know in the eu we're talking a lot about gdpr and whether we're even allowed to be processing such types of data like the gender of people or their ethnicity and disability status and so on but they actually clarify here that in order to protect the users from this bias the providers will be able to process such types of special categories just in order to ensure uh the bias monitoring detection interaction so again you know we really need to have all of these protected that attributes in your data set in order to make sure that it's fair and uh representative relevant and so on um and and this of course provide this poses a bigger challenge to people who are working with data sets because they have to make sure that all these datasets are trained in terms of their ethnicity you know so every person that appears in the data set has to be annotated in terms of their ethnicity in terms of their gender in terms of their uh you know different other types of special categories um and that might be quite challenging because sometimes you don't know what the gender or ethnicity of a particular person is usually the best approach in this case would be to allow people to self-annotate their own data and to self-classify just because it it's a really bad practice to have an external perhaps you know a crowd or another and either classify you or me in terms of what our gender appearance is uh there might be a lot of bias that is transmitted through this decision uh but of course if we're building a um a high large-scale data set it might be really hard to actually ask each person who appears in that data set how do you want to be classified what is your gender what is your ethnicity and so on so you know all of this is kind of you know legalese these are just requirements that the commission is imposing but how do we actually put them in practice it's a completely you know different topic so you know for example when we talk about the data should be representative what does that mean what are the qualities or how do we even evaluate or quantify representativeness there are of course some ways um and you know there are some really notorious examples of canonical datasets that have been used in the industry and they're actually really uh not representative at all for example labeled faces in the wild um it was sourced through images of notable people in yahoo news and it is estimated to contain more than 77 of male faces and 83 of white individuals so this is you know a rookie mistake you cannot possibly publish such a large-scale data set and have such an imbalance in your data um and um one company called voxel 51 which provides data set visualization actually a very interesting analysis of the composition of the data set and they found out that there is a big distortion in terms of the representation for asian or black ethnicity white ethnicities as we mentioned much more represented and males are much more representative females as well um here the interesting thing is that actually in labeled phases in the wild um these attributes are not binary these are not you know male or female it's something called a floating point value um so that was interesting just because you know different people might have different uh floating point values depending on how much they exhibit this attribute so this is quite an interesting approach but of course these were annotated by crowdsourced workers so it was based on the judgment of the crowdsourced worker and again it's not completely reliable um and it's also you know when talking about the taxonomy it's also quite problematic to just have asian black white and um you know some um specter of um spectrum of male or female um so of course when when building such data says do think about what are your classes what are your what are you classifying people into is it going to be this type of like asian black and white simplistic classification would you like to get even more perhaps uh granular about the different regions that you're targeting how easy would it be to estimate that these are all difficult questions and they don't have a clear cut answer um but after you know these revelations actually um another version of this data set was released and it was called racial faces in the world so this time it was a little bit more balanced in terms of the representation of different classes but again they actually featured i think just four different ethnicities white indian um east asian and black so again you know even though your data set might be representative of these four classes it might be um balanced within them what about other classes which are excluded from these or ethnicities which are in between these or perhaps you know if we're talking about the east asian ethnicity class how is that balanced in terms of for example korean japanese chinese uh you know the different diversity of people living within these data sets within these classes so again non-trivial we have to think very hard about what it means uh to have a representative data set there is no easy answer for this then in terms of free of errors we want to have data sets which are free version of course that's kind of you know natural but it's it's quite subjective so this is again an experiment that was performed on google's open images data set again you know a very large scale data set and um the outcome was actually that most of the false positives were annotation errors so errors in the ground truth data and just because it's such a large-scale data set it has um quite a lot of different classes a taxonomy that's very complex a lot of these stem from that or just you know human error so for example in this case we have um a car which was um detected in the right way um the front wheel is detected and there was a ground turntable for it and then the back wheel was detected by the model that was trained on the open images data set but it actually wasn't present in the ground turns data so it came up as a false positive and this can only be discovered after review of the actual results another type of error which appeared was for example uh differences in the class in the classes that were detected versus the ground truth for example the grand truth um uh featured both of these i think these are meerkats um as they were both labeled as animals while um the model trained on the same data set actually detected one of them as a monkey and the other one is a carnivore which is not technically incorrect it just is not animal uh so because these classes overlap a lot in how they're structured you know an animal a carnivore is also an animal in that case you know it came up as an error while in reality it's not um necessarily an error sorry um i wanted to show you this other example as well where we have a contradiction in terms of how items are grouped so in one case you might have different labels for each item of zucchini um and on the other hand for the corn it was actually labeled as a group you know all three pieces of corn together in the ground truth well the model actually detected each one of them separately so again this would this came up as an error while uh it didn't have to and it was just an error of the ground source in terms of how consistently the data is labeled are we labeling each food item separately or are we labeling them as a group so this type of types of errors in terms of consistency um being on those same page in terms of what classes are to be used in different times and different cases this is actually crucial and of course in our working as a data set annotation company we work a lot to make sure that all of the people who are annotating the data set they're doing this in a consistent way and they're all on the same page but of course it's it's not always easy to guarantee that as you see you know this is a really large scale data that's been through various rounds of qc and again you know there are a lot of these uh errors and in terms of completeness i wanted to mention this very quickly because um the goal here is for our data set to include as many potential edge cases as possible so to make sure that if something unusual comes up we're going to be prepared you know the model will be robust enough to handle it um so for example in a lot of cases especially for objects it was found that um canonical data sets like imagenet they actually feature a diversity of objects but in very canonical situations so you see on the left side under imagenet you have different types of chairs and they're all they all appear the same right that's from more or less the same standpoint and so on um another data set was actually released where the focus was on creating a much higher diversity in terms of the rotation of the object the background the viewpoint uh you know different ways in which t-shirts or teapots might appear you know perhaps upside down or in a very strange location in the house so the goal was to uh create models which are much more robust to all of these you know different unusual situations that um they might face you know especially when we're talking about um applications in the wild so this is one one way in which we can make sure that our data set is more complete the other one is to actually have uh to pay special attention to co occurrences um and again to canonical situations for example an imagenet it has been observed that the skateboard class almost always appears together with a human and whenever there is a skateboard without a human actually you know models cannot recognize a skateboard just because what they're so used to not seeing a skateboard without a person um so again you know when thinking about the representation of different objects and also you know humans and our data set we want to make sure that we're managing the co-occurrences and that there are a lot of unusual cases uh where different objects appear um in situations that are not so canonical let's say okay so we've talked a little bit about what high quality means how do we go about getting high quality data the sad truth is that this is actually a very time intensive and labor intensive process as i mentioned you know we have to consider so many factors and frequently there is not enough resource for that there is not enough time um and it's not as fun so this is one paper published by google research which says that which has established that everyone wants to do the model work and not the data work so they have um identified different types of data cascades so these are different events that cause negative effects from data issues and these are just conventional practices that practitioners use in the ai and ml world but they undervalue the data quality right they focus much more on building the best type of model and and playing around with the weights and so on and they don't care so much about the quality of the data set so they have identified that data cascades are pervasive invisible and delayed but often avoidable avoidable and that's across the industry you know also in terms of um conferences where you know the the data focused publications are usually downgraded and the the ones that are focusing on novel models and approaches are actually given more prestige so this is across the entire industry but they have found that you know the the data centric approach is going to bring much more value um especially if beta work is uplifted and it's given more uh you know more good pr more reputation so that um people are actually attracted to do it and they invest much more effort and and money in it um this is a survey that was conducted uh back in 2017 um but i think it's actually quite interesting uh they asked data scientists about the different tasks that they enjoy the most versus the least and of course you know building and modeling data refining the algorithms mining the data for patterns these are the most enjoyable tasks while collecting the data labeling it cleaning it organizing it this is the most dreaded task of all so again you know this is just to represent how just because it's it's more time consuming it's more manual um it it really is not something that most people prefer to spend time on but if we are to adopt a data center mindset we need to transcend that um this is another question from the same survey about uh where data science scientists usually um obtain their data and actually a lot of them 41 say that they use publicly available data sets uh there are some other approaches such as collecting the data on your own or uh generating the data from your internal systems which is quite uh appropriate and it usually is the best approach when you have the capability and your model will be applied again in your own operations the other options are to collect the the data internally or to outsource it uh so i will be discussing mostly uh the two most problematic approaches here which are to uh use publicly available data sets and to outsource the collection just because uh whenever you or your team collects data or when you do it internally when you generate it from internal systems it's much more reliable you have much more control over it while in the first and last um case it's a little bit more tricky when it comes to large-scale data sets i would say that right now they're not a good idea i mean there there have been so many uh publications and articles about how air-ridden they are um the fact that they exhibit multiple biases uh in them and we've already seen quite a lot of examples in terms of imagine labeled faces in the wild these are data sets that are very frequently used uh in academia in industry as well but they have a lot of um bias issues they have a lot of um copyright issues as well and also consent issues because a lot of them were generated through image extraction from flickr or google images or just some online sources and none of the people who appear in those uh data sets actually consented to appearing in them um in terms of imagenet a lot has actu has been said about its taxonomy the fact that it's based on wordnet which is a taxonomy of different types of words which includes also racial slurs and offensive uh words so essentially image that because it's based on word net it actually emulates this entire structure and it has a lot of images that are either offensive uh or are completely non-imageable so they're trying to collect uh images and annotate them for words that are too abstract in order to illustrate for example here on the left we have the class of second rater or mediocrity and you know some people's images which is really questionable and problematic but because the entire image that you know its goal was to create a data set that um encompasses the entire human world and and all all different types of human notions and words uh what they did was you know just try to collect images for every different type of notion on where that that was really problematic um most recently they have released a new version of imagenet which is much more balanced in terms of representation and it also excludes all of these non-imageable notions plus um you know different offensive images that shouldn't have been there in the first place another such large-scale data set which was pulled down was a tiny images data set it's actually you know it was actually one of the first ones with very tiny images just because the processing power at that moment was um suitable only for very small images but again you know it contained a lot of uh offensive notions which um were actually quite harmful and after 14 years they just had to pull it down it was not feasible to go through it and to try to correct it or to refine the data set just because it was um at such a large scale but of course the models trained on on this data set might be still available some replicas of this data set are still available so be careful what type of data you publish because the consequences of it even if you pull down the original source might still be there and might be echoing in the industry much later um one alternative that has been proposed just recently just because imagenet is still being used very frequently for pre-training of different models before you actually start with your actual training um the authors of this article have proposed that another type of data set should be used which has no humans in it and which is extracted in a completely ethical way they have extracted more than one million internet images with the right copyright license that did not contain humans or body parts you know how i mentioned you know the case with the with the hand holding an object so this is meant in order to make sure that the data set does not contain any um humans and and does not require any type of human consent in order to be able to use it um in addition um this data set uh is is as varied as possible and it doesn't really have uh labels on it so it's only used for self-supervised pre-training um so you can still use imagenet or another dataset in order to provide the actual labels that you want to detect but just for pre-training this is i think right now the most ethical alternative that is available um so we discussed the publicly available data sets i would say not an ideal option try to stay away from them um in terms of outsourcing the collection uh this is again a bit problematic it is the approach that a lot of companies are doing specifically both for annotation and for collection um and the most frequently used way is actually mechanical turk there is a really interesting article by uh kate crawford uh called anatomy of an air system and it talks a little bit about mechanical turk among other things um and i would highly recommend it it actually talks a bit about the history of mechanical turk and um the fact that it actually has um become an artificial artificial intelligence um and as you can see in the last line she mentions that it's driven by a remote dispersed and poorly paid click worker workforce that helps the client achieve their business objectives so there is a lot of controversy around mechanical turk um and even though crowdsourcing is the cheapest option in order to get some data and also to annotate it it's not completely ethical um the payment standards are really low and um the work on conditions the working conditions for click workers are are not ideal at all on on this website and there are also very harmful power dynamics between requesters and and turkers and you can actually see here a graphic that was published in one newspaper article about the rise of the one-cent workforce uh it talks about imagenet and how it was created and the fact that more than 500 000 people on mechanical turk um contributed to it um but most of them were earning less than minimum wage um the minimum payment for a single task is one cent um on mechanical turks so they really had no protections um and essentially you know they labeled more than 14 million images um and still you know they were severely underpaid for the amount of effort that they did and for the amount of value that they also produced because this dataset has been used so much for so many different types of applications and training new models and of course it has produced a lot of cost savings for companies uh so all of that value that was produced by crowd workers was not read by them and in salaries unfortunately so you need to be aware of uh the malpractices in the crowdsourcing industry if you ever decide to go for it there are a lot of success stories about using uh proud workers um and dealing with adversarial workers for example and making sure that you can actually reach a global audience which contributes to the diversity of your data set and these are really nice and it is possible to do it just bear in mind that for example you know if you combine multiple subjective opinions that does not necessarily mean that you're able to get an objective kind of consensus out of them perhaps they will just come lean towards subjectivity um so the different subjectivities of different people do not cancel each other out necessarily so there are a lot of pitfalls and uh data set collection especially crowdsourcing but just in general and i would also recommend this article by kate crawford again um called excavating ai it actually focuses on the politics of image of image collection labeling different types of large-scale data sets how they were sourced uh you know these questions about where did these images even come from why were the people in these photos labeled in this particular way what sorts of politics are at work when pictures are paired with a certain label uh what are the implications of uh using these photos for ai systems so this is really you know a very insightful essay and i would definitely recommend it for any practitioner so um i'm not sure how many answers we gave to both of these questions we mostly gave more uh we mostly suggested even more questions and more uncertainties but i wanted to talk about uh one other thing which is uh how do we iteratively improve the data and maintain its quality so assuming that we did you know all good with the high quality data trying to think about for our application what does that mean what would the ideal data set look like we perhaps obtained a good data set and now how do we go about actually using this data centric mindset that we have and iteratively improve the data and maintain its quality um as we have seen it's really important as we're training and deploying the model to also analyze the results um to perhaps collect new data annotated again retrain the model and create this kind of loop um but um whether that's being done in practice is a whole other topic um from our um experience on the market um i would say that mostly the focus at least among practitioners who are reaching out to us as a data set annotation company is to focus on the gold standard data so i would say that the present is definitely much more focused on annotation than anything else so the demand is the highest there uh 77 of our clients actually request annotation services perhaps they have their own data which has been acquired from different types of internal sources or external ones um so some of them 12 percent need collection and only 10 percent actually require human and the loop services and these are usually the companies that have more high-risk ai systems however the future in my vision and the vision of our company is actually that we're going to have much more focus on human operators to monitor the deployment of ai and to handle alerts um just because also the regular types of annotation are going to be automated and now it's going to become more question of maintaining high quality so um this can actually be seen in two different types of applications one of them is handling the edge cases and high risk systems so these are of course the systems that uh you know if your ml performs in a bad way these are going to create the the worst consequences so the operators can actually be used as a human in the loop to flag low certainty cases in real time and they can prevent these harmful decisions of bi systems and as these ai systems become commonplace it's going to be much more relevant for to have humans in the loop to continuously monitor such flight cases the other way in which a human can be used for monitoring and supervising these models is to provide actionable insight so to perform error analysis and to provide qualitative data and insights on the failure modes and to help mitigate the data drift and the constant drift i'm going to be talking a bit about these as well because i think they're crucial in order for us to think even beyond the train validation and test mentality and to think in a more iterative way so here i would say you know the human can take part in all of these different stages of the um ml life cycle from the data set collection annotation model training and validation then during the deployment providing the human supervision and then providing performance never insight that can be fed back into the process for collection and annotation and so on um so really you know in the future this is our vision that um a lot more humans are going to be needed and they're going to be much more professional in what they're doing um so not only performing the simple data annotation that is required now but really being incorporated into the entire life cycle i would say that there are several best practices that we recommend uh in order to make sure that your data is high quality and your models are high quality and you know in general that what you're producing uh guarantees quality and and has a good impact on its users stakeholders and so on one of them is social responsibility one of my favorite examples is actually a company called understand ai they have a statement on their website which says that they look after our labeling partners they want to ensure reasonable wages they want to ensure reasonable working environment and guarantee sustainability so having such a statement or uh even thinking about the impact of your company and that says about your ai supply chain the suppliers that you're choosing whether you're going for crowdsourced um applications or perhaps a professional company like ours or just you know using your own employees freelancers contractors anything it's important for you as a data practitioner to think about the social responsibility that you have and of course you know there are a lot of uh impact alternatives there are a lot of impact sourcing enterprises that do provide data set curation collection services and human in the loop services like us so there are options the important thing is to have the consciousness and the awareness um our approach specifically focuses on diverse human workforce uh for equality i really focused on training our people to understand the impact of their work and why they're doing what they're doing for example uh we have specific trainings for bias in ai and we make sure that whoever is working on a particular data set knows what the data center is going to be used or how it was a pain can submit feedback to the clients as well about different failure modes that they're noticing so we're really trying to uh elevate uh annotation data workers and make sure that they come from a variety of different countries and that they provide the diverse human input that is needed in order to create high quality ai i would say another best practice is to use the different tools that are available out there in order to visualize your data set and detect different failures and analyze your model there is this really nice tool called aquarium that i highly recommend it allows you to visualize your data set with both the ground tourist labels and your model predictions um and you can essentially create a point cloud with uh the different um compositions of your data set you can uh single out different examples that were perhaps labeled incorrectly so in this case i i'm saying that you know the actual label is a bird but the predicted one by my motto is of course so perhaps i need to uh search among my unlabeled data uh find similar images and then send them for labeling so that i can retrain my model on these uh and really you know this type of model uh observability and observability of the data allows us to pinpoint specific failure modes and really make sure that we're noticing the different types of mistakes and analyzing why these mistakes are happening another really nice tool that i recommend is lightly it has a similar philosophy and similar functions essentially it allows you to explore the samples in your data set perhaps by their composition or sharpness or different types of um annotation attributes that have been annotated on them and you can create again you know these clusters of data visualize the different images and also um curate your data set so perhaps create a smaller subset which is much more representative you don't need to have all of your data especially for example in the case of self-driving cars you know hours and hours of footage how much of that is actually going to be useful and helpful for your model you can create samples of your data set which are much more representative with less data and what i really like about them is that they're actually telling you how much the reduction in your data set is after performing this curation what and what your savings are both in terms of annotation cost savings and also in terms of carbon uh emissions savings so it's really great i would highly recommend it in order to have data sets which are more diverse more representative with even less data and then for error analysis um i would recommend voxel 51 um they have again this really nice panel for visualizing your images visualizing the annotations uh pinpointing images which have been annotated in an incorrect way and then perhaps sending these over for re-annotation um or collecting more images like these analyzing what was the reason behind this uh was it because it's a really interesting edge case and maybe you need to collect more images like that or was it because it was not annotated correctly and so on and then finally before we wrap up i want to recommend a very important best practice which is data documentation there are several different templates for this uh we have data sheets for data sets here i'm featuring data the data nutrition project and the data nutrition labels um they actually allow you to have a similar type of like a food nutrition label with all of the important features of your model its components the expected use cases different badges or alerts that users of your data set might need to take into account so feel free to use this especially if you're publishing a data set that will be used by other practitioners or just by other people in your company or in your team it's really important in order to make sure that our data is is um documented in a detailed way we're uh taking a note of potential biases alerts and so on um so i would say that um this is it for now of course there are plenty of other things that we can discuss but i'm really interested in hearing whether you guys have any questions uh about data quality how to obtain images with high quality how to improve them iteratively or anything else that you might have thought about during the presentation so i think we don't have any questions posted or from the audience i'm good uh well if anybody has a question that they think of after the the presentation or if you watch this afterwards and you want to comment something or ask anything get in touch with me this is my email you can follow us on social media as well and i'd be happy to engage in in the conversation and talk about quality data and quality act thanks so much ava for the wonderful talk it was very interesting topic and thank you everyone who joined us today on zoom and also on our live streams if you want to rsvp for upcoming webinars we have few scheduled uh you can go to our website i'll share the link in the chat and rsvp for a couple upcoming events we will be posting the recording of this session uh soon on our youtube channel and you'll also see on our events page make sure to follow us and subscribe on youtube and if you have any questions reach out to eva on on our email address and thanks so much everyone for joining thank you eva have a great day thanks everyone bye

Original Description

Learn how to collect and annotate high-quality data for a computer vision model. Thought leaders in the Artificial Intelligence space such as Andrew Ng have been advocating for a shift from model-centric to data-centric AI. The idea behind this campaign is that AI models can be only marginally improved through tweaks in the algorithm but considerable change can only be achieved by using high-quality data. However, what does "high-quality data" mean and how do we go about ensuring the quality, diversity, and consistency of our dataset? In this talk, we will discuss the practice of collecting and annotating data for your computer vision models and making sure the dataset you are using is representative and free of harmful biases. About the Presenter: Iva Gumnishka is the founder and CEO of Humans in the Loop, a professional data collection and annotation company focused on building high-quality datasets for computer vision applications. The company is a social enterprise and its mission is to provide dignified work opportunities to refugees and conflict-affected people through annotation projects. Iva holds a degree in Human Rights from Columbia University and she was named Forbes 30 under 30 in 2018. For further tutorials on the fundamentals of machine learning, check out this exclusive playlist: https://youtube.com/playlist?list=PL8eNk_zTBST-RTog7CPYvRfs1pYRWkPHG Table of Contents: 0:00 – Introduction 3:28 – Why should we care about high-quality data for computer vision? 8:13 – Algorithmic methods for bias mitigation 9:17 – Model centric to data centric AI 11:06 – Take of EU 28:41 – Canonical large-scale dataset 31:47 – Self-supervised pretraining 36:46 – Image collection and labeling 38:03 – Data quality 40:03 – The future 48:09 – Questions -- At Data Science Dojo, we believe data science is for everyone. Our data science trainings have been attended by more than 10,000 employees from over 2,500 companies globally, including many leaders in tech like Micr

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Data Science Dojo · Data Science Dojo · 0 of 60

← Previous Next →

Feature Engineering and Predictive Modeling | Data Analytics with R and Azure ML | Community Webinar

Feature Engineering and Predictive Modeling | Data Analytics with R and Azure ML | Community Webinar

Data Science Dojo

Data Exploration and Visualization | Beginning Azure ML | Part 3

Data Exploration and Visualization | Beginning Azure ML | Part 3

Data Science Dojo

Reading External Data Sources | Beginning Azure ML | Part 2

Reading External Data Sources | Beginning Azure ML | Part 2

Data Science Dojo

Importing Data, Accessing, & Creating a New Experiment | Beginning Azure ML | Part 1

Importing Data, Accessing, & Creating a New Experiment | Beginning Azure ML | Part 1

Data Science Dojo

Casting Columns & Renaming Columns | Beginning Azure ML | Part 4

Casting Columns & Renaming Columns | Beginning Azure ML | Part 4

Data Science Dojo

Scrub Missing Values & Project Columns | Beginning Azure ML | Part 5

Scrub Missing Values & Project Columns | Beginning Azure ML | Part 5

Data Science Dojo

Feature Engineering & R Script | Beginning Azure ML | Part 6

Feature Engineering & R Script | Beginning Azure ML | Part 6

Data Science Dojo

Building Your First Model | Beginning Azure ML | Part 7

Building Your First Model | Beginning Azure ML | Part 7

Data Science Dojo

Run and Fine-Tune Multiple Models | Beginning Azure ML | Part 8

Run and Fine-Tune Multiple Models | Beginning Azure ML | Part 8

Data Science Dojo

Deploying Your First Predictive Model As a Web Service | Beginning Azure ML | Part 9

Deploying Your First Predictive Model As a Web Service | Beginning Azure ML | Part 9

Data Science Dojo

Using R API to Obtain Predictions From Your Web Service Beginning Azure ML | Part 10

Using R API to Obtain Predictions From Your Web Service Beginning Azure ML | Part 10

Data Science Dojo

Using Python API to Obtain Predictions From Your Web Service | Beginning Azure ML | Part 11

Using Python API to Obtain Predictions From Your Web Service | Beginning Azure ML | Part 11

Data Science Dojo

Twitter Sentiment Analysis | Natural Language Processing | Community Webinar

Twitter Sentiment Analysis | Natural Language Processing | Community Webinar

Data Science Dojo

Listening to the Melody of the Universe (LIGO Gravitational Waves Presentation) | Community Webinar

Listening to the Melody of the Universe (LIGO Gravitational Waves Presentation) | Community Webinar

Data Science Dojo

David Wechsler on the Impact of Data Science Bootcamp

David Wechsler on the Impact of Data Science Bootcamp

Data Science Dojo

Andrew Choi on the Impact of Data Science Bootcamp

Andrew Choi on the Impact of Data Science Bootcamp

Data Science Dojo

Microsoft's Software Engineer Shares Her Experience with Data Science Bootcamp

Microsoft's Software Engineer Shares Her Experience with Data Science Bootcamp

Data Science Dojo

Michael DAndrea on the Impact of Data Science Bootcamp

Michael DAndrea on the Impact of Data Science Bootcamp

Data Science Dojo

Data Driven Decision-Making with Data Science Bootcamp: Artem Kopelev's Revelation

Data Driven Decision-Making with Data Science Bootcamp: Artem Kopelev's Revelation

Data Science Dojo

Learn the Fundamentals of Data Science: Srinivas Rao's Experience with Data Science Bootcamp

Learn the Fundamentals of Data Science: Srinivas Rao's Experience with Data Science Bootcamp

Data Science Dojo

Re-Learning Data Science with Data Science Bootcamp: Analyst's Revelation

Re-Learning Data Science with Data Science Bootcamp: Analyst's Revelation

Data Science Dojo

Scale R to Big Data with Hadoop & Spark | Community Webinar

Scale R to Big Data with Hadoop & Spark | Community Webinar

Data Science Dojo

Enhancing Skills with Data Science Bootcamp: Sharon Lane-Getaz's Revelation

Enhancing Skills with Data Science Bootcamp: Sharon Lane-Getaz's Revelation

Data Science Dojo

Ryan DeMartino on the Impact of Data Science Bootcamp

Ryan DeMartino on the Impact of Data Science Bootcamp

Data Science Dojo

Software Engineer at Microsoft Reveals About His Experience with Data Science Bootcamp

Software Engineer at Microsoft Reveals About His Experience with Data Science Bootcamp

Data Science Dojo

Wade Wimer on the Impact of Data Science Bootcamp

Wade Wimer on the Impact of Data Science Bootcamp

Data Science Dojo

Analyzing Data with Data Science Bootcamp: Hannah Richta's Revelation

Analyzing Data with Data Science Bootcamp: Hannah Richta's Revelation

Data Science Dojo

Applying Data Science Skills to The Current Role with Bootcamp: Marcos Lacayo's Revelation

Applying Data Science Skills to The Current Role with Bootcamp: Marcos Lacayo's Revelation

Data Science Dojo

Lance Milner on the Impact of Data Science Bootcamp

Lance Milner on the Impact of Data Science Bootcamp

Data Science Dojo

Deloitte's Data Scientist Revelation: Learning Predictive Analytics with Data Science Bootcamp

Deloitte's Data Scientist Revelation: Learning Predictive Analytics with Data Science Bootcamp

Data Science Dojo

Rajesh Patil's Experience at Data Science Bootcamp As an Enterprise Architect

Rajesh Patil's Experience at Data Science Bootcamp As an Enterprise Architect

Data Science Dojo

Michael Atlin on the Impact of Data Science Bootcamp

Michael Atlin on the Impact of Data Science Bootcamp

Data Science Dojo

Amina Tariq's In-Person Experience at Data Science Bootcamp

Amina Tariq's In-Person Experience at Data Science Bootcamp

Data Science Dojo

Ceo's Revelation about Data Science Bootcamp

Ceo's Revelation about Data Science Bootcamp

Data Science Dojo

Stephen Miller Describes His Experience at Data Science Dojo's Bootcamp

Stephen Miller Describes His Experience at Data Science Dojo's Bootcamp

Data Science Dojo

Kevin Hillaker on the Impact of Data Science Bootcamp

Kevin Hillaker on the Impact of Data Science Bootcamp

Data Science Dojo

Marko Topalovic's Experience with Data Science Bootcamp

Marko Topalovic's Experience with Data Science Bootcamp

Data Science Dojo

Text Analytics With Python, Cognitive Services & PowerBI | Data Analytics | Community Webinar

Text Analytics With Python, Cognitive Services & PowerBI | Data Analytics | Community Webinar

Data Science Dojo

Unisys Manager's Revelation: Visualizing Real Time Data with Data Science Bootcamp

Unisys Manager's Revelation: Visualizing Real Time Data with Data Science Bootcamp

Data Science Dojo

Learn Data Mining with Data Science Bootcamp: Ryan LaBrie's Revelation

Learn Data Mining with Data Science Bootcamp: Ryan LaBrie's Revelation

Data Science Dojo

Vang Xiong on the Impact of Data Science Bootcamp

Vang Xiong on the Impact of Data Science Bootcamp

Data Science Dojo

Data Scientist's Experience at Our Data Science Bootcamp

Data Scientist's Experience at Our Data Science Bootcamp

Data Science Dojo

Alejandro Wolf Yadlin on the Impact of Data Science Bootcamp

Alejandro Wolf Yadlin on the Impact of Data Science Bootcamp

Data Science Dojo

Introduction To Titanic Kaggle Competition | Part 1

Introduction To Titanic Kaggle Competition | Part 1

Data Science Dojo

Learning How to Code in R with Data Science Bootcamp: Priscilla Mannuel's Revelation

Learning How to Code in R with Data Science Bootcamp: Priscilla Mannuel's Revelation

Data Science Dojo

Andrew Berman On Why Data Science Bootcamp Is Better Fit for Him

Andrew Berman On Why Data Science Bootcamp Is Better Fit for Him

Data Science Dojo

How To Do Titanic Kaggle Competition in R | Part 3.1

How To Do Titanic Kaggle Competition in R | Part 3.1

Data Science Dojo

How to do the Titanic Kaggle competition in R | Part 3.1

How to do the Titanic Kaggle competition in R | Part 3.1

Data Science Dojo

Delve Deeper into Data Science with Data Science Bootcamp

Delve Deeper into Data Science with Data Science Bootcamp

Data Science Dojo

Bank of America Data Scientist Reveals His Experience of Data Science Bootcamp

Bank of America Data Scientist Reveals His Experience of Data Science Bootcamp

Data Science Dojo

Shaena Montanari on the Impact of Data Science Bootcamp

Shaena Montanari on the Impact of Data Science Bootcamp

Data Science Dojo

Types of Sampling | Introduction to Data Mining | Part 12

Types of Sampling | Introduction to Data Mining | Part 12

Data Science Dojo

Sampling for Data Selection | Introduction to Data Mining | Part 11

Sampling for Data Selection | Introduction to Data Mining | Part 11

Data Science Dojo

Data Aggregation | Introduction to Data Mining | Part 10

Data Aggregation | Introduction to Data Mining | Part 10

Data Science Dojo

Data Cleaning | Introduction to Data Mining | Part 9

Data Cleaning | Introduction to Data Mining | Part 9

Data Science Dojo

Missing & Duplicated Data | Introduction to Data Mining | Part 8

Missing & Duplicated Data | Introduction to Data Mining | Part 8

Data Science Dojo

Data Noise | Introduction to Data Mining | Part 7

Data Noise | Introduction to Data Mining | Part 7

Data Science Dojo

Graph and Ordered Data | Introduction to Data Mining | Part 5

Graph and Ordered Data | Introduction to Data Mining | Part 5

Data Science Dojo

Document Data & Transaction Data | Introduction to Data Mining | Part 4

Document Data & Transaction Data | Introduction to Data Mining | Part 4

Data Science Dojo

Data Quality | Introduction to Data Mining | Part 6

Data Quality | Introduction to Data Mining | Part 6

Data Science Dojo

This video teaches the importance of high-quality data for computer vision models and provides practical tools and techniques for data collection, annotation, and visualization. It highlights the need for a data-centric approach and the potential biases in AI systems.

Key Takeaways

Collect and annotate high-quality data for computer vision models
Use tools like imagenet, Mechanical Turk, Aquarium, Lightly, and Voxel 51 for data collection and annotation
Visualize data for model training and evaluation
Document data for transparency and reproducibility
Identify and mitigate biases in AI systems

💡 High-quality data is crucial for accurate and fair AI decision-making, and a data-centric approach is necessary to ensure the reliability and transparency of AI systems.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

Beyond the Elephant: On Manifolds, Projections, and the Hidden Assumptions of Neural Geometry

Learn how neural geometry relies on manifolds, projections, and hidden assumptions to understand complex data, and why it matters for AI development

Beyond the Elephant: On Manifolds, Projections, and the Hidden Assumptions of Neural Geometry

Learn how neural geometry relies on manifolds, projections, and hidden assumptions to understand complex data, and why it matters for advancing AI research

Medium · Data Science

Beyond the Elephant: On Manifolds, Projections, and the Hidden Assumptions of Neural Geometry

Explore the geometric assumptions underlying neural networks and their implications on manifold learning and projections

Medium · Deep Learning

Beyond the Elephant: On Manifolds, Projections, and the Hidden Assumptions of Neural Geometry

Learn about the hidden assumptions of neural geometry and how manifolds and projections impact neural network performance

Chapters (11)

Introduction

3:28 Why should we care about high-quality data for computer vision?

8:13 Algorithmic methods for bias mitigation

9:17 Model centric to data centric AI

11:06 Take of EU

28:41 Canonical large-scale dataset

31:47 Self-supervised pretraining

36:46 Image collection and labeling

38:03 Data quality

40:03 The future

48:09 Questions

Machine Learning Project for Final Year Students | ML Project Idea @FameWorldEducationalHub

FAME WORLD EDUCATIONAL HUB