Building tools and frameworks for large-scale social media mining (by Dr. Juan M. Banda)

Elvis Saravia · Intermediate ·📄 Research Papers Explained ·5y ago

Skills: Research Methods90%Reading ML Papers80%Paper Reproduction70%RAG Basics60%Vector Stores50%

Key Takeaways

The video discusses building tools and frameworks for large-scale social media mining, with a focus on creating data infrastructure for COVID-19 research. Dr. Juan M. Banda presents a social media mining toolkit and a large dataset on COVID-19, highlighting the importance of simplicity, scalability, and reproducibility in data collection and processing.

Full Transcript

well good yeah okay okay perfect so yeah thank you so I mean I just put some stuff in context you know obviously to get around introduction introduced in Mesa I'm just gonna kind of skip to that very quickly I just want to point out that you know along this journey academic journey I also got probably around seven to ten years experience of software development software engineering and architecture so that comes in handy when we talk about the stuff that we're going to talk about today so yeah so you pointed out we release social media mining toolkit and also we release a very big data set on covert data so you know that's basically what you see and I like to call it kind of colloquially as you know you see what you need you that's the sausage right that's what you see but usually rarely you don't see how the sausage is made and this is basically what I'm gonna talk about today so just kind of start you know obviously this came out a few years ago but it's very proud of pretty I mean it's very important that up until now you know the data is basically the world's most valuable resource right is a new oil so getting access to data is one thing collecting your own data it's different thing and also you know being able to use that data is the most important part right because everybody's going to have a bunch of data and a lot of people do for generating insightful things out of it is where the challenge is it's where you know you kind of separate the people that know versus the people that say they do so let's talk about this thing to start with right so if you notice I didn't use big data on my topic I use large scale because you know big data kind of is like two jargon to impress people so if I wanted to submit a grant I'll probably write big data a bunch of times if I want to talk about research I'll mostly talk about large scale because obviously you know there's very different interpretations of that right it could be ten gigabytes for some people that's big data for other people it's a hundred gigabytes terabytes petabytes so it's as if but it varies widely right and because this is not the same thing for you know computer scientists or physicists social scientists medical researchers so the pencil what feel you're working with and who you're working with you know I've been told that a hundred observations to something it's there's a lot of data or I've been in the side where you know you have petabytes of genomics information that is not really that much data because you know it only covers a few thousand people so having this in a context is very important and Act you know all the work that we do and then the work that I try to instill in people it's mostly based on this thing called the KISS principle right so this is a principle that was noted by the years Navy in the 60s that's you know where it kind of came up what we means keep it simple stupid I know my son harsh but it's not right it's the notion that you know whatever you do however you try to do things try to keep them always simple right over the complicated things this extremely verbose and you know over-the-top things I mean they're cool and everything but you know if you want people to use things if you want to reach a broader audience you have to keep things simple otherwise it's just gonna get you're gonna narrow your your scope or your focus too much so here you know I'm talking about you know the best practices you know on how to build something so I'm gonna talk about from the perspective of you know how do we go about Lisa in my opinion in my lab and in my work you know how do we go about building things and then I'm gonna put it up put the examples up of the social media mining toolkit and the Copa data set to you kind of contextualize what stuff I'm gonna be talkin about so a lot of people want to start you know you you're assigned by somebody or your boss your professor or anybody you know that is supervising you or your friends you know and are gonna class project whatever so everybody always wants to do oh let's use the most hype tool let's use whatever is cool now I guess you know hugging face transformers or whatever it's you know just coming out that everybody wants to use G let the MU G p33 anything right and all state of the art I can find it's that really the way to go about things well not always I mean I'm not saying that you know using the the best and the most complex my not used the best solution but it might not yield you know the most the most impactful or relevant solution so here you know I try to recommend always you know you want to identify the scope of the problem you know how much data are we talking about again you know you're not gonna build this ridiculous cluster of 100 notes and 10 petabytes of storage where your use case or the scope of the problem it's you know a hundred people or data on a hundred people or if it's you know a very limited set who's gonna use the data is very important right you're now gonna build this highly complicated system that you can access on a console or you know via terminal when you're gonna give it to a clinician or when you're gonna give it to a social scientist right you're not gonna build this is crazy tool with two billion features you know that it's really not gonna be used ideally also how they're gonna use data right it's very different when somebody wants data in Excel SAS versus when somebody wants you know data frames or anything at sequel database so make sure that all this is clear obviously you know do you need a cloud-based solution most people are using the cloud now you know not always it's needed at least in my research lab using the cloud will be prohibitively expensive because of all that extra added costs that a lot of people don't think about like just transferring data in and out that that multiplies the cost and actually we ended up paying more on data transfer cost and actually hosting and processing so obviously you know you want to know what where you want to go important stuff when you're building toolkits and frameworks you want to avoid scope creep what is this well it's basically you know it's called requirement creep or kitchen-sink syndrome we're awesome right I mean you're this brand new computer scientist brand new scientist and you you learn all this methods all this technologies I want to put all of it into everything that I do well you know you really don't want to do that and that's usually the case know if you've seen I mean there's a lot of tools out there that do you know 45 different things but usually the tools that are that are more use and more impactful are the tools that do one thing very very well you know you don't want to have over engineer things also importantly not very important you know you want to know who your audience is so basically you know if you're you know setting up a framework or tools like I was mentioning earlier it's very different between feels like for example solar physicists that I used to work with they did all the programming IDL and we develop all the systems that we're on Python well that's bad right when you're working with social scientists a lot of them want to do data analysis and SAS SPSS or human Excel and that's perfectly fine you don't have to be using pandas or you know spark or anything when you don't have that much data and also you don't want to disrupt people's workflows right and biologists I find that this at a lot of them still do all this gnarly scripting in Perl well you know you want to know your audience so if I release a tool for biologists you know I wanted to be able to interface with what they use rather than what you think they use so you and then this is you know a software engineering principle right so now that you know the heavy lifting is done knowing those main things is what's gonna make your tool or process or framework successful if you don't think if you ignore them then you know you're gonna have some other issues to deal with later so now that we have that out of the way let's start with you know the fun stuff right how about you know finding the right tool for the job so always ask right I know that everybody wants to use tensorflow you want to use transformers or you want to use whatever it's the latest and greatest always ask do I really need the capabilities of this right do I need the capabilities of Hadoop to do I don't know 100 thousand rows of data not quite right so always think about you know what you're trying to use I mean there's all this cool stuff and trust me I've been down a lot of rabbit holes trying all these cool things trying to make them work with my product problem just because you know I want to learn them but also when you when other people depend on what you're building you might not want to you know force people to change their from their workflows always to find a scalable architecture this is more than relevant now right I mean not everything needs to database oh if it does need a database you want sequel no sequel graph always have this stuff you know ironed out from the beginning changing things and I'll talk about this later you know and down stream it's a lot harder than you know doing it right since the beginning do you need the data like this is very popular now instead of a data warehouse or a bunch of databases you know data Lake where you can have flat files where you can have databases when you can have graph graphs knowledge graphs this is you know well at least most of the stuff that I've done last few years is moving to this direction also do you want a search engine right if you have something that involves a lot of text data you might want to have all this stuff index you know like an elastic search or anything that allows you to get results quick but also it's uski is dependent write and also do you really need you know real time pipelines to process stuff like big hive when all these things that companies like you know Facebook or companies that actually do have a lot of real time data coming in use and a lot of scenarios you don't and this is where you know a lot of people you know we're made to teach a lot of these things in school but a lot of people don't get the point that you know all these tools and all the stuff that you learn in school is usually a toy example the focus that you should have is some you know when you're deploying things or when you're trying to produce something irit evelopment is something that I always advocate for you know always keep you know you start with you to plan you do the requirements now let's design you implement you keep testing and you keep valuating and you keep refining things almost nothing is gonna be static anymore almost nothing you know it's not gonna change especially now that you want to build things to new technologies or something that will improve your process right don't build stuff into it just because it's new build stuff into it because it's better so ok so now Bourdieu enough with some more high-level kind of philosophical principles let's talk about you know how does this in the context of actual tools and actual things that we built so first I'm gonna go back a little bit and you know why do we want to use social media data right obviously everybody knows that I'm pretty sure every single person Houston a social network even just by joining meetup you are in a social network so this is you know this is the way people are connecting now so why do we want to do this why do we want to use status up data well we have large amounts of data we want large amounts data obviously social media generates terabytes of data almost per second now so there's the axes there we want timely data right we want to know what's going on now what's going on dad not real-time but maybe near real-time so social media is the best if you're waiting for like you know academic publications well that's gonna have a lag of several weeks to months if you're waiting for you know or not preprints anymore but still other you know peer review things even the news has a lag of you know hours for certain things we want data that can't be found in traditional systems and this is you know well one of the most powerful things - social media a lot of this stuff people talk about and we talk in social media it's stuff that we don't talk about you know with our doctor so that's not documented on an EHR system or in legal registries you know you don't go to the DMV and tell them that you know you like this car because it's super cool and has all these features right but you go to a forum or Facebook or Twitter to say that you bought this car because super cool and all these things so all that that information you know you and I'm gonna be able to find it anywhere if you're looking at you know person's right but however you anywhere else other than in social media so that's the goal that it's there to be extracted obviously there's a lot of privacy stuff and I'll talk about that a little bit later but you know that's the reason you want to have social media data because is recent it's a lot of it and it's data that you won't find anywhere else you know if assuming you had access to all the data in the world and all this social media companies got really smart about it right because now they're mine absolutely every single thing you do it's recorded so and you know and that's their business model basically it's you and we want data that is voluntarily provided by the user - you know I don't want to be snooping in your phone calls I don't want to be snooping in your conversations well even though Alexa might do some of that stuff you know but still you know I want stuff that you reported that you put out there publicly so you so there's no you know so there's some sort of transparency in the sense that you know I'm not going behind people's backs to get data however if you volunteer late and I mind it well you know it's already on my data set cautious about you know social media data privacy this is a very very big one right a lot of people might not understand now it's changing now people are more cognizant of what they're sharing people in Europe especially are better informed that like I would say people in Latin America or Americans in general that's a continent but you know people are starting to turn off that you know GPS all the time they're starting to turn off share your location they're starting to turn off things you stop putting your birthday on your Facebook account so you know people are starting to gather getting better at this permanence that's also caution in the sense that okay in Twitter if you get upset one day you get upset at Central let's just troll you too hard you can delete your Twitter account that data is gone right Twitter is very good about that as bad for researchers but it's good for people right in the sense that you delete your stuff and your tweets that appear if I download the data set that included some of your tweets but you deleted your account I cannot retrieve them back anymore so you know so if there's if there is a list of your Twitter tweet IDs and you deleted your account or you made your account private I can't go and get it anymore so you know that's good and bad the permanence of it is good and bad also you know if you were if you're writing a forum that just goes out of business or goes offline then that date is gone so that's subconscious about trying to you know always think that the tap is gonna be open veracity this is the biggest one right I can just say that I own 17 planes on Twitter this is I gonna make it true probably not right so I know I tributing this data or whatever statements or things before I saying it's all so hard right and it's a caution but you know I mean it's it's a necessary evil right I can also go to my clinician and tell them that I'm you know and that I've been not smoking even though I smoked two packs today there's the veracity is you know it's not there yet it's not there either but you know at least in social media people being Anonymous people being you know anywhere or people having multiple troll accounts can read whatever crap they want and it's up to the researcher or whoever is doing work with that data to decide if it's you know real or not and doing this a scale it's a huge problem so why Twitter let's talk about you know what about use of Facebook data well that's in strictly against tenancy conditions to scrape and Facebook after the Cambridge analytical stuff got really picky and it's very very paranoid about this and they force it well what about reddit data well depending on the subreddit there's not enough I do research an aging and you know elderly populations the subreddit is about that are very small however and Twitter people talk about it on a daily basis obviously there's other sub red is that are there's a lot of data coming in every day there's a lot of them are not what about forums well the problem with forests is that yeah they're all cool a lot of them are behind paywalls a lot of them are behind you know different structures so if there was only one forum software the structure the data the same way throughout the world that would be awesome but not so if you're trying to mine data from here you're gonna have to set up a bunch of crawlers in different ways the data is gonna come formatted differently you know you start to make yourself get yourself in a mess and also there's a lot of privacy concern read a lot of people go to forum especially like the self up forums like you know the addiction forums to post things that you know there feel personal they just want to put them there in a community lets you know sort of limited right and then you go and stand take all of that and you know make your own analysis out of it well that's you know a little bit iffy also a lot of the forums do explicitly say do not mind this data however most people don't read that what about Weibull Pinterest and all this other you know social networks well yeah they're nice you're cool but for example for Weibo it's only a it's not a small subset of the population but it's surely is a subset of the population right it's not very representative also Pinterest has sort of target audience and all this other more specialized networks so that's why we picked Twitter but we're not saying I'm not saying that you should not pick all this other ones right because your use case or your application or your intention might be completely different on Twitter you know we do a lot of health-related research so we're trying to push this and we've seen that you know the community itself is starting to use Twitter and there's publication number so if you know papers that mention Twitter and health related questions on PubMed is growing every year also the benefits like I mentioned you know there's a good population representation so they age low age groups are kind of nicely distributed that it represents multiple countries the anonymity that that people get there allow us to get you know very own filter opinions which are which is terrible if you're reading tweets but it's good because you get honest opinions the data is freely available I'll talk a little bit more about that there's around you know hundreds of millions of tweets generated every day so you have a constant stream of data coming in and you can filter them somewhat easily with hashtags mentions and I'll get a little bit more into that when I talk about actually how to set this up so traditional disadvantages and data is super messy we do a lot of safety stuff or jerk safety analysis on this or at least trying to extract drugs and at least for this very popular drug that everybody knows about because suck Ovid you know hydroxychloroquine is misspelled at least 25 different times so you know that's hard to handle right and also we have a paper a preprint of a paper that we've shown that if you ignore misspellings you are leaving on the table 15% or more of the data so that's a big chunk right especially when you get net when you narrow down your research scope to something very small anyways attribution like I mentioned all the freely available data is only 1% of the sample of course Twitter gives you a little taste and if you want a lot more you have to pay collection is hard and this is something that you know we address in my toolkit and I'll talk about it later but because you need to have this ongoing for days weeks before you get considerable mass and actually be able to do anything with that data it has very unique challenges and this is from you know the NLP machine learning perspective this short form text so you can't really use a lot of those you know nice tools built on you know full text of thousands of pages it's more colloquial it's very ambiguous and expressive so you know a lot of these NLP tools that you see a lot of them or most of them at least up until the last couple of years and they didn't work anymore just write that stuff was just parsed out however a lot of the stuff on social media and emojis actually you know what could change the polarization of any sentence or can change the meaning of what you're reading so you know all this stuff needs to be addressed so how do we harness you know such data so well you can start by downloading an already created data set and usually you know this is the way that I like to tell people when somebody says oh I want to do this the first thing that I send them to do is okay fine go and look if there's nobody else that already did this in the sense of gathering the data you'll be amazing how much other people are doing and with enough you know Google ninja skills you'll be able to find it and be able to use it so you can already you know to use to say that you can download something there so recreated fine there's nothing there like for example our kit use case that I'll talk about run of kovat right cool so how do we get our own data so obviously you know if we're most people here I assume a lot of them are computer scientists you know so being a computer scientist for you using API calls using this you know getting tokens and using get requests and this kinds of things are not a problem however most people using actual you know social science research Health Sciences self care research on this have no freaking idea how to do this obviously there's some tools but a lot of those tools and the majority of those tools are built by computer scientists therefore you know we have a very peculiar way of thinking and of doing things that other domain people in other domains don't so we decided to think okay fine so we have all these things that we want to do there's a lot of ways to do it obviously or I'm not saying that you know there were no tools that there's similar things there there were no tool or a set of different tools that you can put together to do what we wanted but you know we said well maybe we do need a specific tool so now you know this is that now I'm actually getting into talking about this tool that we released that it's a social media mining toolkit there's some links there how the slides will be uploaded somewhere I'm sure if that just google it and I think luckily it's the first thing that comes up so how did we stumble upon no thinking that we need this thing so let's start so I assigned a simple task three of my students where we basically got here's the 60 gigabyte JSON file with several million tweets and one collaborator gave it to me so I asked the students I'm like cool let's see what people that I mean how do people do write oh well how do people do things and what things they do so I'd ask them how many tweets we have just tell me the number tell me how many unique users we have in that file and separate you know all the tweets by identifier date text and user so you know very basic very trivial tasks but you might think oh yeah this is you know cake anybody can do that the only requirements that I gave to people where you have to use Python and you have to give me a tea vo or tab-delimited file with the format you know street ID debut certain text that's it so that seems like you know something trivial right well what do you guys think you know did everybody return with the same answers shocker no wait isn't everybody using the same file yes not a single answer match between three people and this three people had you know undergraduate degrees in computer science at least so huh weird interestingly enough two people even use the same codebase they found on the Internet to do the task and still their answers did not match so obviously there's something wrong about this right then there's obviously a need that you know there's no need to stand your eyes this process so we go back to what I talked about at the beginning in the sense that fine we identify the scope of the problem we need to process the data in a standard way and this is thinking this tool originally was internal for my lab but we ended up releasing it because I shared it with other people they like it they shared it with other people they liked it so we just decided why not get some you know citation credits on it and also help out other people you know that wanna use it so we need to process the data in a standard way right I should not decide I should not be assigning a task to three people that are gonna get and then and where I'm gonna get three different answers why because if this is research work reproducibility you can't reproduce results of any of them or every time that you reproduce try to reproduce something you will get a different answer that's really bad we need to be able to get data so fine you know I'm saying okay let's put all this stuff that we need to do together to build a tool for it so identify the scope of the problem we want data in standard way we want to be able to get data and we want to be able to use the data for typical NOP tasks like annotation like you know whatever for generating counts by grams try aramis whatever you want we don't want to have to you know we want to avoid scope creep we don't want to put everything in the kitchen sink in this tool we do not want to build the name a machine-learning package right there's already hundreds ill know we do not want to rebuild functions for Twitter calls in the sense that okay there's this low leveled API calls that there are other software other packages that handle this well we wanna you know not reinvent the wheel we do not want to build around all TK or Spacey like tool why because NL TK and Spacey are pretty good there's no point in building all this stuff as somebody else built just because you want to call your tool the one-stop shop that does everything those kinds of tools never really work so we want we first and this is where you know we spend a little time thinking okay we want to know your know our audience right ideally see as people however you know social scientists informaticians you know have this issues you can see it all over the internet you can go to stack overflow and see all these people are asking the same thing and they come from different domains you can see all this random snippets of code everywhere provided by people from different you know areas so we need to make this as easy as possible to use which you know this is some fame this is famous last words for a lot of people most people expect that their whatever they do is gonna fall in this place but he ends up not being that way right so you know after talking to many people in different domains we noticed that you know a lot of people do not know how to use programs that encapsulate details of the process and this is kind of counterintuitive right why because yeah you want to encapsulate as much as you can to give just one call to do something but a lot of people don't really get what's going on and tweaking something or if there's an error and the more encapsulated something is at least four scientists in different domains not layman people I guess you know the more you obscure things and make it harder for people to use however you know we found after you know I talked to a lot of people I made a little poll you know that everybody does know how to run and change scripts on which you know I guess this kind of a part of the practice now where you go to stack overflow put a little piece of code and try to change it so you know that kind of mentality a lot of researchers do it I've done it many people will still keep doing it so we wanted to build a tool that kind of had this you know baked in there and also you know we in order to solve that other factor if you know finding the right tool for the job we use Python Python is free it comes to start with and most Linux distributions and comes to styling a Mac so you know and many people outside of you as I started to use it all this data science trends are moving to be able to use Python one of the biggest things we did we want to define a scalable architecture right and this is where you need to think a little bit big right we want to build a toolkit that grows I mean now it only works for Twitter however you can basically add on tour you know Reddit and all this other stuff where we just have you know three different blocks the maker tool the data acquisition tools which is you know a lot of different little scripts that allow you to do data acquisition for Twitter like you know hydration scraping and all those things pre-processing tools where you know it would allow you to parse jasons and separate them and again you know this things seem very trivial but it was actually when you start when you move into the space a success person is hard to find all this in one place and also imagine somebody outside of CS where you know you're just completely lost and also we want data annotation and standardization tools where we can use you know terminologies dictionaries do named entity recognition and standardize all the outputs to stuff like you know Brad or you know probe annotation and all this other tools that are already there to use so if you notice we're not scope creeping on anything we just focus on you know functional quick and well separated so in the end you know so we release this tool what do we learn about this the tool was for internal usage but we've decided to release it publicly one very important thing is that we don't have fancy wrappers for everything so everybody's obsessed with making this very compact Python packages which is ok and it's fine I have nothing against them but you know when you want to reach a broader set of people and when you want people to you know Grover your tool in a way or you know build around it you know a tight integration is a lot harder to decouple then you know something that's kind of loose but we have everything okay but at least you know I lean our tool you know you know where to find the things that you're looking for and while seemingly primitive you know it has been able to be used by people multiple domains with mismo hassle and whenever I go and try to shop this tool around I kind of sit down to people and tell them okay here's the repo show me how will you use it and you know I go through the motions so all there all this stuff that I've seen that you know block people over time I kind of bacon and prove in the tool to acceptable in the sense that you know you can just change the code file so you're liking right if you have a very compact package you can install via pi PI or whatever taking those things apart it's increasingly difficult the more complex they become versus just having you know a lot of files there that each file does a specific thing where you can just say yeah I'm gonna take this file I'm gonna take this piece from this file but paste it with the piece from the other file to do my own workflow of my own pipeline so that's what we wanted we didn't want you to know users to be you know scrambling around to patch a lot of different tools we just wanted you know to use the click code on the thing and it reduces the learning curve to start using tutor data and now that with the Cova data set that we released a lot of people came out of the Woodworks they wanted to do Twitter data and covet so when we provide this tool you know people I actually get quicker up to speed about you know just boiling the data they need and doing the analysis versus spending weeks or months trying to figure out how to get the data the right way tonight so you know okay fine so how do you set up you know now we have a tool that does a lot of the social media mining stuff you can do a large-scale or not how do we set up you know a framework and this is where I make the difference between a tool and a framework a framework is basically another set of steps or at least a set of procedures that you want to do or dictate to you know follow to do something so you know how do we set up a framework for collecting data right so I mentioned the Twitter data and this is the good and bad right they have this the free a free-fire host or whatever they call it now the word there's constant 1% of their Twitter feed is coming out of it for free however you know it's the more you can grab or you know as and as long as you can wince so you need to have a very good framework to be able to mine this data and to be able to collect it first so for the Kovach pandemic you know we release you know using our tools we release this big data set of you know now is 513 million tweets it's only cold related chatter and it's been downloaded over 23,000 times I think by the end of today so it's you know it's a tool that people use or a dev resource that people use it has a bunch of languages it has you know you have nice associations of it you have you know different versions the data set has geo locations it has you know locations place locations enable so for I know P users we provide top you know 100,000 terms diagrams trigrams so this is nice and cool why do we provide these things well because CFA if you know Twitter data we cannot share the full Twitter object or the text or all the particulars all the tweets the complete tweet with you per Twitter's developer you know the terms and terms and conditions so you have to hydrate it so in order to do quick things you know we can't share some aggregate statistics so we share this stuff in order for people I want to just use try you know buy grass diagrams so that's type of things other quick things that you want to do so that's nice okay so this dataset sounds cool awesome but how did we actually get there at how do you say how can you produce data so like that well you know you need to define a framework for data collection and in this says you know you need to know and this is the best the starting point you need to know what data you want and how to get it on Twitter you know you can use has so keywords to filter data or you can just grab all and then filter it later but if you do this you're gonna be loosing you're gonna be getting data that is not relevant and you and that data that you didn't get because you did you get other data obviously you have in our network connectivity limits on the Twitter API so you're gonna be you know not getting the right data so you can filter it by hashtags for keywords by language locations obviously there's a lot of details here that are kind of high level II touch but you can do that at some point right so you want to know for so for at least for this you know we had the covert hashtags or you know coronavirus when it was starting so this is high you know you kind of funnel the the firehose of water you kind of funnel it into the the parts that you're interested in so you know you play around with this to maximize your data gathering the sense that you can go to Twitter and then type in a hashtag and if the hashtag has you know 10 tweets then obviously that's not a very popular hashtag but if you have a high if you put a hashtag and it has good millions of tweets and he has tweets every second then you probably want to use that to collect the most you need to have a vision you know how long will you be doing this and this is very important right you know i nobody foresaw or at least a lot of people didn't freestyle for see that you know this pandemic was gonna be at least going into the hundreds and now we're in 120 some days you know since it was declared open emic you know seven months since it started in china so you know have a you need to have a base on how much you want to collect this so since the start we said fine we want to collect this and up until you know at least one year after the pandemic ends so we kind of had you know some rough estimates and and decided to architect something to be able to hold all of that and it's very important thing that you know you have your early decisions will be very hard to fix later so if we decided to collect data wrong that will we can't fix that if you decided we store the data incorrectly you know you can kind of fix that right you can always add more hard drives you can always compress things so that's something you can fix but always keep it simple right don't try to over complicate it over in your things you know how do you check you know you want to check the infrastructure you have your disposal and pick an architecture at least for our collection we have a data Lake and I'll talk I'll talk about under the hood and a few in a couple minutes but we want this because of the flexibility and scale that allows us write in a daily lake we can just add another you know never attached storage device and put more data in it and it integrates but everything else if we have a big database you know a central monolithic database it would be a lot trickier to add disk and then you know we need to char the data later so it becomes complicated again I like to keep things simple off to my cell your processes iteratively and obviously you start the data keeps piling in you're just you know trying to get by at first after a while you get a handle of it you start you know optimizing and you see you know you you want to aim to solve the need at hand we consideration of how this will work when you have 10x or 100x data right and this is where it's hard to plan at the beginning or when you're not used to working this way why because it's always you see oh I have 500 gigabytes of RAM I can just load everything in Ram well it will get to the point that you will not be able to load 500 gigabytes in RAM then what right so you you need to think about that since the beginning or at least you know over time so you start up to my saying your processes so under the hood of how we collect data this data set and the infrastructure we use now we have this big research server you know where we have all the shreds always you know seven six seven hundred and sixty eight gigs of ram all this hard drive so we have a beefy computer it doesn't mean that you know you need a beefy computer you can also I shifted this for a few weeks to a cloud VM under the the free tier and you can still do the same raid you I won't be able to do the processing the same at the same speed but I'll be able to do the collection at least so know that your architecture you know could play with now we have you know around five hundred twenty million tweets which are in daily one object files there JSON objects if you do the math you know we have around 2.5 terabytes of uncompressed data all their scripts and this is you know my what might shock some people we only have Python and bash scripts and some visualization stuff in our we're not using tasks we're not using you know what is a spark or anything like that we started with something simple we freight and and we've been you know paralyzing it we've been optimizing it too for it to still scale better without the need of a tool that you know at first if you want to if you starting this and you want to use tasks but you don't know how to use to ask you want to be learning on data that you know you need to be that is that you need to be relying on so that's why you know keep it simple we use bare-bones things obviously we can tweak it we're looking into you know yeah and doing other pipelines but for now you know for the functional aspect of things this just works and we can process the whole data set in parallel in about 150 minutes and this is going line by line so that's pretty fast obviously because we have a big computer we can paralyze this but still you know I mean that is something that you architect for so there's no need really you know if I can reparse everything you know I don't have a need of you know having something very and this convoluted piece of software to do it we have a data Lake architecture we have all the raw files compressed we split the data into clean and retweets so retweets are around 60 to 70% of Twitter and all this day and what the retreat's are usually the stuff that's you know all the trolls the bots and people not saying anything or much we keep daily track so if you know everything is still broken down by days we keep daily tracks of things like mentions hashtags emojis languages and locations for some visualization stuff but so we have those things stored as separate files but always by a date we have an immediate master TSV file that you know with only fuels we use so we don't use a Twitter JSON objects have like a potential up to like 160 different fields we not o not all of them are populated our but we only need like 20 for the stuff that we do so we you know there's no point in processing or opening these huge files well you can just process or open this little small ones we can load all of the data in memory for fast processing we only have one database you know with the tweet details so whenever we pull the tweet that that for whatever analysis is useful we can just query the database and get you know date location and all this extra stuff on it and we used to have an elasticsearch full text index on the whole textual data we kind of deprecated that because the computer that we had around kind of died and I haven't rebuilt it but that allowed us to query it quickly you know subset our twits quickly our you know bi-weekly github updates are fully automated so we run to bash scripts that you want for moving data and pre-processing and one for putting coppices data in the git folder is in committing we update the data set three times a week so and this is one of the scripts that does that takes 20 minutes to run i still call them manually i didn't wanna you know scheduled jobs but i I'm kind of weird about this I didn't I like the control to that level but if I go and make a I can just put a cron job and get this done automatically all right weakly sonoda updates because we update the full data set every week we're in version 19 as of this week our Oh slow mostly fully automated we have the scripts to do the bi-weekly updates and we run those and then we upload all that's a note oh you manually kick off the are visualisation updates which is just a script that cause all this are scripts that calls the files that live on the heart on the on the filesystem aggregates them and does all this visualization automatically and this whole process of uploading the dataset it takes around 40 minutes to run why because we still load the full data set file on memory just to remove duplicates which we're kind of moving away from and in our next version of the software so it takes 40 minutes to run which you know all the work seems like a lot of work and it's a lot of work it was a lot of work at first I used to the first couple of releases I spent several hours doing this until you know you keep automating you you keep making things your life easier so what do we learn you want to automate things as quick as you can you want to don't be afraid of fixing bad processes but try to avoid them obviously everybody makes mistakes or you will just do stuff and efficiently when you're doing stuff quick one-time expensive competition costs are always good in the long run having these flat files with everything in them just sub set it to what we need it's a one-time expense that when you load that file it's a lot faster than loading you know that's full JSON object not everything needs to big database not everything needs to new is most complex tools we're not using anything super fancy I always share with others you know publisher code we have our code available publish the data we have a data available be nice to the community if you post an issue I try to respond it I try to fix it if you post a request I try to address it a lot of the data set features that we've been adding over time have been requests from people you know be nice right I mean help the community grow so yep that was my spiel acknowledgments my PhD student Rami ax has done a lot of this work collaborations this is mostly research collaborators that we have people that provided data to us at the beginning people that are you know constantly helping us with questions some of the funding we've really we received

Original Description

◾ Title: Building tools and frameworks for large-scale social media mining: Creating data infrastructure for COVID-19 research. Slides: https://www.dropbox.com/s/fblk0h56jqohjeq/Building_tools_and_frameworks_for_large-scale_social_media_mining_7-22-2020.pptx?dl=0 ◾ Speaker: Dr. Juan M. Banda ◾ Twitter: @drjmbanda ◾ Talk Description In this talk we will discuss the motivation and rationale behind our Social Media Mining Toolkit (SMMT) (https://github.com/thepanacealab/SMMT), and how to use it to define frameworks for large-scale social media data gathering for NLP and machine learning research projects. We will outline all the lessons learned, mistakes, and hard decisions made to produce and maintain a publicly available large-scale dataset of COVID-19 Twitter chatter data featuring over 424 Million Tweets in 60+ languages and from 60+ countries (https://zenodo.org/record/3911930). ◾ About Speaker: Dr. Juan M. Banda (http://www.jmbanda.com) at his GSU lab, Panacea Lab (http://www.panacealab.org/), works on building machine learning, computer vision, and NLP methods that help to generate insights from multi-modal large-scale data sources. With applications to precision medicine, medical informatics, astroinformatics as well as other domains. Dr. Banda has published over 50 peer reviewed conference and journal papers. Prior to being an assistant professor of Computer Science at Georgia State University, Dr. Banda was a postdoctoral scholar, then a research scientist at Stanford’s center of Biomedical Informatics. He is an active collaborator of the Observational Health Data Sciences and Informatics and his work has been funded by the Department of Veteran Affairs, National Institute of Aging as well as NASA, NSF and NIH. ◾ About dair.ai Website: https://dair.ai/ GitHub: https://github.com/dair-ai Twitter: https://twitter.com/dair_ai Newsletter: https://dair.ai/newsletter/ Slack: https://join.slack.com/t/dairai/shared_invite/zt-dv2dwzj7-F9HT047jIGkunNKv88lQ~g

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Elvis Saravia · Elvis Saravia · 7 of 60

← Previous Next →

101 ways to solve search (by Pratik Bhavsar)

101 ways to solve search (by Pratik Bhavsar)

TLDR Generation of Scientific Documents | ML Interview #1 with Isabel Cachola

TLDR Generation of Scientific Documents | ML Interview #1 with Isabel Cachola

Sentiment Analysis: Key Milestones, Challenges and New Directions

Sentiment Analysis: Key Milestones, Challenges and New Directions

Discriminative Adversarial Search for Abstractive Summarization (by Thomas Scialom)

Discriminative Adversarial Search for Abstractive Summarization (by Thomas Scialom)

Question Understanding: COVID-Q: 1,600+ Questions about COVID-19

Question Understanding: COVID-Q: 1,600+ Questions about COVID-19

Getting Started with NLP

Getting Started with NLP

Building tools and frameworks for large-scale social media mining (by Dr. Juan M. Banda)

Building tools and frameworks for large-scale social media mining (by Dr. Juan M. Banda)

TextAttack: A Framework for Data Augmentation and Adversarial Training in NLP

TextAttack: A Framework for Data Augmentation and Adversarial Training in NLP

Dive into Deep Learning (Study Group): Introduction to Deep Learning | Session 1

Dive into Deep Learning (Study Group): Introduction to Deep Learning | Session 1

Dive into Deep Learning (Study Group): Multilayer Perceptrons | Session 4

Dive into Deep Learning (Study Group): Multilayer Perceptrons | Session 4

How I read and annotate ML papers

How I read and annotate ML papers

Keep Learning ML (Session 1) | DSV, CompLex, Modern tools for emotions

Keep Learning ML (Session 1) | DSV, CompLex, Modern tools for emotions

Dive into Deep Learning (Study Group): Preliminaries | Session 2

Dive into Deep Learning (Study Group): Preliminaries | Session 2

Keep Learning ML #2 | Language-conditioned policy learning, Effective ML Testing, EagerPy

Keep Learning ML #2 | Language-conditioned policy learning, Effective ML Testing, EagerPy

Dive into Deep Learning (Study Group): Linear Neural Networks | Session 3

Dive into Deep Learning (Study Group): Linear Neural Networks | Session 3

Dive into Deep Learning (Study Group): Multilayer Perceptrons | Session 4

Dive into Deep Learning (Study Group): Multilayer Perceptrons | Session 4

Keep Learning ML #3 | Contrastively Trained Structured World Models

Keep Learning ML #3 | Contrastively Trained Structured World Models

Dive into Deep Learning (Study Group): Deep Learning Computation with PyTorch | Session 5

Dive into Deep Learning (Study Group): Deep Learning Computation with PyTorch | Session 5

Dive into Deep Learning (Study Group): Convolutional Neural Networks | Session 6

Dive into Deep Learning (Study Group): Convolutional Neural Networks | Session 6

Dive into Deep Learning (Study Group): Modern CNNs | Session 7

Dive into Deep Learning (Study Group): Modern CNNs | Session 7

101 ways to solve neural search with Jina

101 ways to solve neural search with Jina

(Hopefully-Reusable) Life Lessons for PhD Students in NLP

(Hopefully-Reusable) Life Lessons for PhD Students in NLP

How to save the world and forward your career in 5 easy steps | Women in NLP Talks

How to save the world and forward your career in 5 easy steps | Women in NLP Talks

Prompt Engineering Overview

Prompt Engineering Overview

Getting Started with the OpenAI Playground

Getting Started with the OpenAI Playground

LM-Guided Chain of Thought

LM-Guided Chain of Thought

Elements of a Prompt

Elements of a Prompt

Reasoning with Intermediate Revision and Search with LLMs #chatgpt #ai #llms #science #programming

Reasoning with Intermediate Revision and Search with LLMs #chatgpt #ai #llms #science #programming

General Tips for Designing Prompts

General Tips for Designing Prompts

Efficient Infinite Context Transformers #ai #machinelearning #research #llms #science

Efficient Infinite Context Transformers #ai #machinelearning #research #llms #science

Best Practices and Lessons Learned on Synthetic Data for Language Models #ai #machinelearning #genai

Best Practices and Lessons Learned on Synthetic Data for Language Models #ai #machinelearning #genai

Reducing Hallucinations in Structured Outputs via RAG #chatgpt #ai #llms #programming

Reducing Hallucinations in Structured Outputs via RAG #chatgpt #ai #llms #programming

Basic Prompt Examples for LLMs

Basic Prompt Examples for LLMs

LLM In Context Recall is Prompt Dependent #llms #ai #chatgpt #machinelearning

LLM In Context Recall is Prompt Dependent #llms #ai #chatgpt #machinelearning

Zero-shot Prompting Explained

Zero-shot Prompting Explained

RAG Faithfulness #llms #ai #gpt4

RAG Faithfulness #llms #ai #gpt4

Understanding LLM Settings

Understanding LLM Settings

Llama 3 is here! | First impressions and thoughts

Llama 3 is here! | First impressions and thoughts

Llama 3 is Here! #ai #llms #llama3

Llama 3 is Here! #ai #llms #llama3

Microsoft introduces Phi-3 | The most capable small language model?

Microsoft introduces Phi-3 | The most capable small language model?

Microsoft introduces Phi-3! #ai #llms #microsoft

Microsoft introduces Phi-3! #ai #llms #microsoft

Make Your LLM Fully Utilize the Context #ai #llms #machinelearning

Make Your LLM Fully Utilize the Context #ai #llms #machinelearning

When to Retrieve? #ai #llms #machinelearning

When to Retrieve? #ai #llms #machinelearning

Training an LLM to effectively use information retrieval

Training an LLM to effectively use information retrieval

State-of-the-art open-source LLM judges #ai #machinelearning #gpt4

State-of-the-art open-source LLM judges #ai #machinelearning #gpt4

Better and Faster LLMs via Multi-token Prediction

Better and Faster LLMs via Multi-token Prediction

AlphaMath Almost Zero #ai #science #machinelearning

AlphaMath Almost Zero #ai #science #machinelearning

SWE-Agent | An LLM-based Software Engineering Agent

SWE-Agent | An LLM-based Software Engineering Agent

[LLM NEWS] AlphaFold 3, xLSTM, OpenAI's Model Spec, DeepSeek-V2, OpenDevin CodeAct 1.0

[LLM NEWS] AlphaFold 3, xLSTM, OpenAI's Model Spec, DeepSeek-V2, OpenDevin CodeAct 1.0

LLM-powered tool for web scraping #ai #chatgpt #engineering

LLM-powered tool for web scraping #ai #chatgpt #engineering

Learn about LLMs in this NEW course #ai #chatgpt #engineering

Learn about LLMs in this NEW course #ai #chatgpt #engineering

[LLM NEWS] KANs, Gemma 10M Context, OpenAI Updates?, Automatic Prompt Engineering, Tokenizer Arena

[LLM NEWS] KANs, Gemma 10M Context, OpenAI Updates?, Automatic Prompt Engineering, Tokenizer Arena

[LLM News] GPT4-o, Project Astra, Veo, Copilot+ PCs, Gemini 1.5 Flash, Chameleon

[LLM News] GPT4-o, Project Astra, Veo, Copilot+ PCs, Gemini 1.5 Flash, Chameleon

Enhancing Answer Selection in LLMs #ai #machinelearning #engineering

Enhancing Answer Selection in LLMs #ai #machinelearning #engineering

On exploring LLMs #ai #promptengineering #chatgpt

On exploring LLMs #ai #promptengineering #chatgpt

Transformers Can Do Arithmetic with the Right Embeddings #ai #machinelearning #engineering

Transformers Can Do Arithmetic with the Right Embeddings #ai #machinelearning #engineering

[LLM News] xAI Series B, Codestral, LLM Guide, AutoGen Course, Symbolic Chain-of-Thought

[LLM News] xAI Series B, Codestral, LLM Guide, AutoGen Course, Symbolic Chain-of-Thought

PR-Agent #ai #gpt4 #software

PR-Agent #ai #gpt4 #software

Extracting features from Claude 3 Sonnet

Extracting features from Claude 3 Sonnet

Has prompt engineering been solved?

Has prompt engineering been solved?

This video teaches viewers how to build tools and frameworks for large-scale social media mining, with a focus on creating data infrastructure for COVID-19 research. Viewers will learn about the importance of simplicity, scalability, and reproducibility in data collection and processing, and how to apply these principles to real-world problems.

Key Takeaways

Define a framework for data collection
Know what data you want and how to get it on Twitter
Filter data by keywords, hashtags, language, or locations
Hydrate Twitter data due to developer terms and conditions
Release a dataset of collected data
Process 2.5 terabytes of data in 150 minutes
Optimize data processing for scalability
Use a data Lake architecture
Store data in separate files by date
Query a single database for tweet details

💡 The KISS principle (Keep it Simple Stupid) is a guiding principle for building tools and frameworks for large-scale social media mining, as it allows for scalability, reproducibility, and ease of use.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Research Methods

View skill →

Mechanics of Materials III: Beam Bending

Mechanics of Materials III: Beam Bending

Inaugural Lecture: Juliane Reinecke

Inaugural Lecture: Juliane Reinecke

Saïd Business School, University of Oxford

Hands-On Learning: How and Why You Should Build a Home Lab

Hands-On Learning: How and Why You Should Build a Home Lab

SANS Live Online Interactive Remote Lab and Range Demo – SEC599: Defeating Advanced Adversaries

SANS Live Online Interactive Remote Lab and Range Demo – SEC599: Defeating Advanced Adversaries

Does Water Swirl the Other Way in the Southern Hemisphere?

Does Water Swirl the Other Way in the Southern Hemisphere?

Undergraduate Research Forum 2026

Undergraduate Research Forum 2026

Related Reads

Follow-up: The ArxivLens Protocol: Transforming Research Nois

Learn how to apply the ArxivLens Protocol to create dynamic grant-allocation pools that rebalance based on citation-impact signals, transforming research noise into actionable insights

On July 1, 2026, arXiv will spin out from Cornell University, its home for the past 25 years, to become an independent nonprofit organization. Major funding support from Simons Foundation and Schmidt Sciences. Ditching the red for their website. [N]

arXiv is becoming an independent nonprofit organization after 25 years at Cornell University, backed by major funding, which will impact the future of research and academia

Reddit r/MachineLearning

CS-NRRM™ Official Publications: Paper 1 and Paper 2 Are Now Available

Learn about the CS-NRRM's official publications on a 12-year longitudinal human observation archive and its significance in research and development

Medium · Data Science

Found a potential mistake in an ICLR 2026 blogpost [D]

Verify a potential mistake in an ICLR 2026 blog post and learn how to effectively report errors in academic publications

Reddit r/MachineLearning

How to get started With Drug Discovery using BioAI: Computational Biology ( 4K UHD Med Masterclass )

Sudarshan's Multiverse