Data Engineering in E-Commerce | The Best Case Study
Key Takeaways
The video discusses a case study of an e-commerce company's data engineering journey, covering their technology stack, data migration, ETL pipelines, and data warehousing using tools like AngularJS, MySQL, MongoDB, Flask, Amazon RDS, and Apache Spark.
Full Transcript
[Music] In the upcoming videos, we'll discuss a case study of an e-commerce company. Let's say that this company started in the year 2014 where they first launched their website and slowly year on year the company started growing in terms of user base, sales and new features were added on the website from time to time as and when required. With the increase in user base and features on the website, the underlying technology stack obviously also became more complex. In this case study, we will start from the very first year when this company hosted their website to the current time where they handled millions of transactions in a single day. We've divided this case study into four different parts. Years 2014, 2016, 2017, and year 2021. For each of these years, we will set up the context for you like what is the situation of the company at that point of time. What are the requirements of the company? What are the challenges to resolve them? And obviously, all these challenges will be related to the technology required to provide a seamless shopping experience for the end user or the customer and finally how data engineering can help us resolve all these issues and different tools that will be required in each of the stages. So we'll look at all of that. Let's start with the year 2014. This is where the company has decided to host the website. The very first question for the company, what is the technology stack that we are going to use initially to set up the website. As a data engineer, you don't have to decide the stack that will be decided by solution architects. They will go through the pros and cons of multiple tools available in the market and come up with a final design. Let's say that the team has decided that for the front end we are going to use the AngularJS framework and for the backend databases we are going to use the MySQL and MongoDB databases and for some of the scripts for the back end we going to use the flask framework which uses Python and Git for version control because in this course we are dealing with the data engineering role. We will see these databases uh in a bit more detail here. For our website, we are going to use two different types of databases, MySQL and MongoDB. For the transactional data like user details, supplier details, product transactions, and refund requests. And for the products catalog and the clickstream data, we are going to use the NoSQL database like MongoDB. Different categories of the products have different features. So a document-based database will be a good choice to use. If you face any difficulty in understanding the difference between SQL and NoSQL database, go through the link provided below this video. All right. Now let's see in this situation of the company what is the work of a data engineer. At this stage first of all a data engineer will be required to define the tables and relationships between them. We need to choose the appropriate data types, keys, indexing and partitioning. Next, a data engineer is required to write the optimized queries and connect that to the front end so that we can serve the requested results to the customer. So the company is in very early stage now. The data engineering work has not even begun. The use cases that you have seen like writing queries and defining tables can easily be done by a software engineer. Any person who is capable of writing Python or JavaScripts or writing SQL queries and has some knowledge of Linux shell commands will be able to perform the required tasks. Right now we can assume that around 4 to 5,000 customers visit our website every single day and around 100 transactions are happening uh daily. Now after 2 years these numbers have increased and we will have to do some important changes to the current architecture to handle the data. In the previous video, we've seen that in 2014, we were getting four to 5,000 customers daily on our website. And the number obviously increases over time. And in 2015, we got an average of around 50 to 60,000 customers daily. And this increased to 100 to 150,000 daily in 2016. That's an incredible rise. And we also noticed that slowly over the last few years we've received customers from multiple parts of the world. Right now the company has its onremise setup in India. So the customers in India they get served their request on time. Whereas the request from other parts of the world might have to face just some delay time. As you can see here around India there are green dots which means the latency or the delay time is less. Whereas on the other hand requests from other parts of the world will have a longer delay time represented by the red dots that you see. So the solution architects of the company have identified that we should place a copy of our database at some other parts of the world as well. This will help us improve the customer experience in those parts of the world. But it's still a difficult task to set up and configure the machines in all those parts on your own. Right? So the solution architects suggested that we should move our e-commerce application on the cloud where we don't have to maintain all of the hardware and we can easily scale if required in the future. They've gone through the pros and cons of multiple cloud service providers their pricing and suppose they have selected AWS. Now on AWS we've selected Amazon RDS. Using this service we can get a MySQL hosted instance on AWS. And for the replacement of MongoDB, we are using document DB on AWS. Don't worry in case you have not heard about RDS or document DB, it's not very important at this point in time. Just think of these as the replacement for the databases on the cloud. Now the job of the data engineer would be to help in migration of the data from on premises to the cloud databases. They need to make sure that the data should be available in the correct format as soon as possible and follow the best migration practices. So now updated architecture looks something like this. Data engineers need to test all the queries again and if anything is not working then they need to modify the code accordingly. All right. Now the company stakeholders also want to see the daily reports of how the company is performing. So right now the data analyst team attract repeat. So right now the data analyst team extracts the data from these databases every 24 hours and then they do their analysis and generate reports. So the problem arises when this team wants to extract the data from these databases multiple times. It might bring the database down for some time and may reduce the user experience which is what you want to avoid. So to deal with this problem, data engineers will launch few more instances of databases and these databases will not be involved in resolving the customer requests but only for the data analyst team members. And after every few hours as required by the company, new data will be extracted from the databases and using an ETL pipeline, we will add this data into the new database instances. For example, after every 1 hour. Right now, if you're confused by the term ETL, it is just a process of extracting, transforming, and loading data. You can refer to the link below this video to know more about ETL. All right. So, coming back, the analyst team can do as many queries as they want on these new database instances and customers will not be affected in any way. And finally we can also connect these databases to business intelligence tools like Excel, Tableau or PowerBI. So here the data engineer task will be to create these ETL pipelines and make sure that the data flow between multiple databases is a seamless process and also to connect these databases with a business intelligence tool. At this point in the company, they need a full-time data engineer who has a knowledge of cloud computing and expertise in any of the cloud service providers like AWS, Azure or GCP. He or she should be able to build ETL pipelines, perform basic uh data warehousing stuff and know basic data warehousing concepts and fundamentals of BI tools like Tableau or PowerBI. All of this is needed so that we can connect those databases with these tools as required by the stakeholders. So that's it from the year 2016. After a year in 2017, now the company faces a few more challenges. We've seen that by the year 2016 we were getting around 100 to 150,000 customers per day on our website. In the next year in 2017, this number has grown up to 250 to 3,000 per day. If you'll remember up to year 2016, this was the architecture of the company where we added some databases for analysis purposes and we connected them with the BI tools. We've also created some ETL pipelines as well. Now let's focus on this part only. When you prepare any report, most of the times there are some complex metrics which involve joining from multiple tables, aggregations, so on so forth. Calculating all those metrics will not consume that much compute power. If you're creating all those metrics in batches, for example, in every 6 hours or every 12 hours as required. But if you want to calculate some of these metrics in real time, that can be a tricky task to do. To solve this problem, data engineers have decided to use any real-time streaming tool like Kafka and we will use that with Spark. Also, earlier we were using some extra database instances for the analyst team. Those databases were nothing but a data warehouse. Now we're going to use a scalable data warehouse. For example, Amazon Redshift. We will store the summarized data in the data warehouse and the data engineer will build the pipelines so that data will flow through the Kafka to these data warehouses and also some other resources. This is about realtime data but alongside we will also require some pipelines to keep running the batch process as well for daily, weekly or monthly reports. So data engineers will have to maintain that as well. So this is the main task of a data engineer. Extract, transform or load or ETM. You might have heard or read this term a lot when talking or reading about job descriptions of a data engineer. The main task is only to transfer the data from source to destination. But as the data grows, the tools to do that will become more complex. Now let's say we have just seen that we will have to create multiple pipelines. For example, we want to create real-time dashboards to track new users and transactions. And we want to initiate the refund requests after every 30 minutes and update the daily reports every 24 hours. So we don't want someone from the team to go run the scripts every single day or after every few hours. You want to automate this process and for that data engineers have different tools which we call as scheduulers like airflow. We can use these tools for the workflow management. Now after integration of some new tools our updated architecture looks something like this. Pretty cool right? At this stage the company would require a team of data engineers who are experts in designing the data warehouse. They should have the knowledge of the Hadoop ecosystem and Apache Spark and they should be able to work with Airflow or any other scheduleuler and some streaming tools like Kafka. All right, so that's it from 2017. After 4 years in 2021, the company faces a few more challenges. In this video, we look at the challenges in the year 2021. You've seen by the year 2017, we were getting around 250 to 300,000 customers per day on our website. In the next four years in 2021, this number has grown up to 500 to 750,000 per day. Now that we have a good amount of traffic on a daily basis, the stakeholders have decided that they want to add a new feature on the website which is a recommendation engine. For example, whenever a customer will look for any product or is about to complete a transaction, we will recommend the customer with some other products. So as some of the conversions increase the company will get more profits. Now the question is who will be responsible for building this recommendation engine? Will the data engineer build the recommendation engine? Think about it for a second and I encourage you to put your thoughts in a discussion forum. Let's get a healthy uh thought process going there. In this case the company will hire a data scientist for it who will build the recommendation engine. So if you remember this was the latest architecture that we had. The data scientist will get the access of the data warehouse and try to explore like what are the different features the data that can be used to create the recommendation engine. But the data scientist will find out the data warehouses predefined metrics that are currently required by the company in case he or she wants to experiment more with the data. Then the raw form of the data is required. Now for the raw data, data engineers will create one data lake. It's a place where raw data is stored. So the data engineers task will be to create this data lake and keep the raw data stored there securely. Now once we have the data in the data lake, data scientist can perform as many experiments and then give us the first version of the recommendation engine. So to deploy this engine will be the work of a data engineer. But the data scientist will keep on improving the model every few weeks or months. And a data engineer also has to define a series of steps to deliver the new version from testing and validation to the deployment. Right? These kind of pipelines are known as the CI/CD pipelines or continuous integration and continuous delivery. To read more about these pipelines, you can refer to the link we've provided below this video. At this stage, the data engineer should know about the model deployment and these CI/CD pipelines as well. All right. So, this was all we want to cover in this case study. I hope you got an idea of what would be your work if you become a data engineer. I'll see you in the next video. Thank you. [Music]
Original Description
In this video, we have discussed a case study of an e-commerce company
Do subscribe to Analytics Vidhya channel & get regular updates on videos:
Stay on top of your industry by interacting with us on our social channels:
Follow us on Instagram: https://www.instagram.com/analytics_vidhya/
Like us on Facebook: https://www.facebook.com/AnalyticsVidhya/
Follow us on Twitter: https://twitter.com/AnalyticsVidhya
Follow us on LinkedIn:https://www.linkedin.com/company/analytics-vidhya
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Analytics Vidhya · Analytics Vidhya · 53 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
▶
54
55
56
57
58
59
60
The DataHour: Data Science in Retail
Analytics Vidhya
The DataHour: Anomaly detection using NLP and Predictive Modeling
Analytics Vidhya
The DataHour: Energy Data Science Project from Scratch
Analytics Vidhya
The DataHour: Explainable AI Need and Implementation
Analytics Vidhya
The DataHour: Google Cloud AI/ML
Analytics Vidhya
Prediction to Production in Machine Learning #machinelearning #prediction
Analytics Vidhya
Practical Applications of Data science in Ecommerce
Analytics Vidhya
How to tackle Overfitting?#machinelearning #overfitting
Analytics Vidhya
Building Data Pipelines on GCP #googlecloud #datapipelines #data
Analytics Vidhya
Hands-on with A/B Testing #abtesting #datascience
Analytics Vidhya
Efficient Implementations of Transformers #transformers #cnn #machinelearning
Analytics Vidhya
Modern Deep Learning Architecture #deeplearning #architecture #deeplearningtutorial
Analytics Vidhya
Key steps for Designing Artificial Neural Network (ANN) for Image classification #machinelearning
Analytics Vidhya
5 things you should know about Azure SQL #azure #sql #datahour #datascience
Analytics Vidhya
AI & ML in the Automotive Industry #machinelearning #ai
Analytics Vidhya
Building Machine Learning Models in BigQuery
Analytics Vidhya
NLP aspects in Telecommunication Industry
Analytics Vidhya
Practical Time Series Analysis
Analytics Vidhya
Fundamentals of Quantum Computing
Analytics Vidhya
A DAY IN THE LIFE of a Data Scientist (From waking up to working on algorithms)
Analytics Vidhya
Classification Machine Learning Model from Scratch
Analytics Vidhya
Knowledge Graph Solutions using Neo4j
Analytics Vidhya
Model Guesstimation (MLOps)
Analytics Vidhya
ETL Pipelines in Google Cloud Platform
Analytics Vidhya
Key steps for Designing Convolutional Neural Network(CNN) for Image Classification
Analytics Vidhya
Getting Started with AWS EC2 #amazon #aws
Analytics Vidhya
How to Use Azure NLP and Graph Databases for Intelligent Knowledge Mining
Analytics Vidhya
Certified AI & ML BlackBelt Plus Program #shorts
Analytics Vidhya
Visualizing Data using Python #machinelearning #visualization #python
Analytics Vidhya
DCNN for Machine RUL Prediction using Time-series Data #timeseries #machinelearning #datascience
Analytics Vidhya
M in ML stands for Math & Magic
Analytics Vidhya
An Unsupervised ML approach using Clustering
Analytics Vidhya
Customizing Large Language Models GPT3 for Real-life Use Cases #gpt3 #datascience
Analytics Vidhya
Model Parameters vs Hyperparameters - Techniques in ML Engineering #machinelearning
Analytics Vidhya
Practical MLOps #mlops #datascience
Analytics Vidhya
Data Engineering with Databricks #dataengineering #databricks
Analytics Vidhya
Multi-Objective Optimisation
Analytics Vidhya
When Airflow Meets Kubernetes
Analytics Vidhya
AI in Banking
Analytics Vidhya
Learn Convolutional Neural Network for Image Recognition
Analytics Vidhya
Extracting Value from Data
Analytics Vidhya
How to measure Marketing Channel Effectiveness
Analytics Vidhya
Transforming Lives | Data Science Immersive Bootcamp
Analytics Vidhya
Stock Market Analysis - AI driven approach
Analytics Vidhya
Become a Data Engineering Professional in 2022 | Future Trends + Skills Required
Analytics Vidhya
Ensemble Techniques in Machine Learning #machinelearning #ensemble #datascience
Analytics Vidhya
The Power of Visualization | Tableau Full Course | Analytics Vidhya
Analytics Vidhya
Demand for Data Engineers is on the Rise | Data Engineer | Analytics Vidhya
Analytics Vidhya
Data Visualization in Data Science | DataHour | Analytics Vidhya
Analytics Vidhya
Role of Optimization in Machine Learning & Deep Learning | DataHour | Analytics Vidhya
Analytics Vidhya
Solving any Machine Learning Problem | Approach and Steps Involved
Analytics Vidhya
Topic Modeling Explained with Implementation | Using LDA in Python | DataHour by Arpendu Ganguly
Analytics Vidhya
Data Engineering in E-Commerce | The Best Case Study
Analytics Vidhya
Introduction to Classification using Azure Machine Learning | DataHour | Analytics Vidhya
Analytics Vidhya
Introduction to Federated Learning | DataHour | Analytics Vidhya
Analytics Vidhya
Diffusion Models for Generative Arts | DataHour | Analytics Vidhya
Analytics Vidhya
Master Google Analytics in 1 Hour | DataHour | Analytics Vidhya
Analytics Vidhya
Learn Hypothesis Testing | DataHour | Analytics Vidhya
Analytics Vidhya
A Practical Approach to Kaggle Competition | DataHour | Analytics Vidhya
Analytics Vidhya
Making AI work for Business | DataHour | Analytics Vidhya
Analytics Vidhya
More on: ML Pipelines
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
The AI Moat Paradox: The Better Models Become, the Less Models Matter
Medium · AI
170,927 AI Papers Reveal the Biggest Research Shifts of the First Half of 2026
Medium · Machine Learning
170,927 AI Papers Reveal the Biggest Research Shifts of the First Half of 2026
Medium · Data Science
[PoV] When Everyone Is Smart, No One Is
Medium · AI
🎓
Tutor Explanation
DeepCamp AI