Data Engineering in E-Commerce | The Best Case Study

Analytics Vidhya · Intermediate ·📰 AI News & Updates ·3y ago

Skills: ML Pipelines80%Data Literacy70%LLM Foundations60%

Key Takeaways

The video discusses a case study of an e-commerce company's data engineering journey, covering their technology stack, data migration, ETL pipelines, and data warehousing using tools like AngularJS, MySQL, MongoDB, Flask, Amazon RDS, and Apache Spark.

Full Transcript

[Music] In the upcoming videos, we'll discuss a case study of an e-commerce company. Let's say that this company started in the year 2014 where they first launched their website and slowly year on year the company started growing in terms of user base, sales and new features were added on the website from time to time as and when required. With the increase in user base and features on the website, the underlying technology stack obviously also became more complex. In this case study, we will start from the very first year when this company hosted their website to the current time where they handled millions of transactions in a single day. We've divided this case study into four different parts. Years 2014, 2016, 2017, and year 2021. For each of these years, we will set up the context for you like what is the situation of the company at that point of time. What are the requirements of the company? What are the challenges to resolve them? And obviously, all these challenges will be related to the technology required to provide a seamless shopping experience for the end user or the customer and finally how data engineering can help us resolve all these issues and different tools that will be required in each of the stages. So we'll look at all of that. Let's start with the year 2014. This is where the company has decided to host the website. The very first question for the company, what is the technology stack that we are going to use initially to set up the website. As a data engineer, you don't have to decide the stack that will be decided by solution architects. They will go through the pros and cons of multiple tools available in the market and come up with a final design. Let's say that the team has decided that for the front end we are going to use the AngularJS framework and for the backend databases we are going to use the MySQL and MongoDB databases and for some of the scripts for the back end we going to use the flask framework which uses Python and Git for version control because in this course we are dealing with the data engineering role. We will see these databases uh in a bit more detail here. For our website, we are going to use two different types of databases, MySQL and MongoDB. For the transactional data like user details, supplier details, product transactions, and refund requests. And for the products catalog and the clickstream data, we are going to use the NoSQL database like MongoDB. Different categories of the products have different features. So a document-based database will be a good choice to use. If you face any difficulty in understanding the difference between SQL and NoSQL database, go through the link provided below this video. All right. Now let's see in this situation of the company what is the work of a data engineer. At this stage first of all a data engineer will be required to define the tables and relationships between them. We need to choose the appropriate data types, keys, indexing and partitioning. Next, a data engineer is required to write the optimized queries and connect that to the front end so that we can serve the requested results to the customer. So the company is in very early stage now. The data engineering work has not even begun. The use cases that you have seen like writing queries and defining tables can easily be done by a software engineer. Any person who is capable of writing Python or JavaScripts or writing SQL queries and has some knowledge of Linux shell commands will be able to perform the required tasks. Right now we can assume that around 4 to 5,000 customers visit our website every single day and around 100 transactions are happening uh daily. Now after 2 years these numbers have increased and we will have to do some important changes to the current architecture to handle the data. In the previous video, we've seen that in 2014, we were getting four to 5,000 customers daily on our website. And the number obviously increases over time. And in 2015, we got an average of around 50 to 60,000 customers daily. And this increased to 100 to 150,000 daily in 2016. That's an incredible rise. And we also noticed that slowly over the last few years we've received customers from multiple parts of the world. Right now the company has its onremise setup in India. So the customers in India they get served their request on time. Whereas the request from other parts of the world might have to face just some delay time. As you can see here around India there are green dots which means the latency or the delay time is less. Whereas on the other hand requests from other parts of the world will have a longer delay time represented by the red dots that you see. So the solution architects of the company have identified that we should place a copy of our database at some other parts of the world as well. This will help us improve the customer experience in those parts of the world. But it's still a difficult task to set up and configure the machines in all those parts on your own. Right? So the solution architects suggested that we should move our e-commerce application on the cloud where we don't have to maintain all of the hardware and we can easily scale if required in the future. They've gone through the pros and cons of multiple cloud service providers their pricing and suppose they have selected AWS. Now on AWS we've selected Amazon RDS. Using this service we can get a MySQL hosted instance on AWS. And for the replacement of MongoDB, we are using document DB on AWS. Don't worry in case you have not heard about RDS or document DB, it's not very important at this point in time. Just think of these as the replacement for the databases on the cloud. Now the job of the data engineer would be to help in migration of the data from on premises to the cloud databases. They need to make sure that the data should be available in the correct format as soon as possible and follow the best migration practices. So now updated architecture looks something like this. Data engineers need to test all the queries again and if anything is not working then they need to modify the code accordingly. All right. Now the company stakeholders also want to see the daily reports of how the company is performing. So right now the data analyst team attract repeat. So right now the data analyst team extracts the data from these databases every 24 hours and then they do their analysis and generate reports. So the problem arises when this team wants to extract the data from these databases multiple times. It might bring the database down for some time and may reduce the user experience which is what you want to avoid. So to deal with this problem, data engineers will launch few more instances of databases and these databases will not be involved in resolving the customer requests but only for the data analyst team members. And after every few hours as required by the company, new data will be extracted from the databases and using an ETL pipeline, we will add this data into the new database instances. For example, after every 1 hour. Right now, if you're confused by the term ETL, it is just a process of extracting, transforming, and loading data. You can refer to the link below this video to know more about ETL. All right. So, coming back, the analyst team can do as many queries as they want on these new database instances and customers will not be affected in any way. And finally we can also connect these databases to business intelligence tools like Excel, Tableau or PowerBI. So here the data engineer task will be to create these ETL pipelines and make sure that the data flow between multiple databases is a seamless process and also to connect these databases with a business intelligence tool. At this point in the company, they need a full-time data engineer who has a knowledge of cloud computing and expertise in any of the cloud service providers like AWS, Azure or GCP. He or she should be able to build ETL pipelines, perform basic uh data warehousing stuff and know basic data warehousing concepts and fundamentals of BI tools like Tableau or PowerBI. All of this is needed so that we can connect those databases with these tools as required by the stakeholders. So that's it from the year 2016. After a year in 2017, now the company faces a few more challenges. We've seen that by the year 2016 we were getting around 100 to 150,000 customers per day on our website. In the next year in 2017, this number has grown up to 250 to 3,000 per day. If you'll remember up to year 2016, this was the architecture of the company where we added some databases for analysis purposes and we connected them with the BI tools. We've also created some ETL pipelines as well. Now let's focus on this part only. When you prepare any report, most of the times there are some complex metrics which involve joining from multiple tables, aggregations, so on so forth. Calculating all those metrics will not consume that much compute power. If you're creating all those metrics in batches, for example, in every 6 hours or every 12 hours as required. But if you want to calculate some of these metrics in real time, that can be a tricky task to do. To solve this problem, data engineers have decided to use any real-time streaming tool like Kafka and we will use that with Spark. Also, earlier we were using some extra database instances for the analyst team. Those databases were nothing but a data warehouse. Now we're going to use a scalable data warehouse. For example, Amazon Redshift. We will store the summarized data in the data warehouse and the data engineer will build the pipelines so that data will flow through the Kafka to these data warehouses and also some other resources. This is about realtime data but alongside we will also require some pipelines to keep running the batch process as well for daily, weekly or monthly reports. So data engineers will have to maintain that as well. So this is the main task of a data engineer. Extract, transform or load or ETM. You might have heard or read this term a lot when talking or reading about job descriptions of a data engineer. The main task is only to transfer the data from source to destination. But as the data grows, the tools to do that will become more complex. Now let's say we have just seen that we will have to create multiple pipelines. For example, we want to create real-time dashboards to track new users and transactions. And we want to initiate the refund requests after every 30 minutes and update the daily reports every 24 hours. So we don't want someone from the team to go run the scripts every single day or after every few hours. You want to automate this process and for that data engineers have different tools which we call as scheduulers like airflow. We can use these tools for the workflow management. Now after integration of some new tools our updated architecture looks something like this. Pretty cool right? At this stage the company would require a team of data engineers who are experts in designing the data warehouse. They should have the knowledge of the Hadoop ecosystem and Apache Spark and they should be able to work with Airflow or any other scheduleuler and some streaming tools like Kafka. All right, so that's it from 2017. After 4 years in 2021, the company faces a few more challenges. In this video, we look at the challenges in the year 2021. You've seen by the year 2017, we were getting around 250 to 300,000 customers per day on our website. In the next four years in 2021, this number has grown up to 500 to 750,000 per day. Now that we have a good amount of traffic on a daily basis, the stakeholders have decided that they want to add a new feature on the website which is a recommendation engine. For example, whenever a customer will look for any product or is about to complete a transaction, we will recommend the customer with some other products. So as some of the conversions increase the company will get more profits. Now the question is who will be responsible for building this recommendation engine? Will the data engineer build the recommendation engine? Think about it for a second and I encourage you to put your thoughts in a discussion forum. Let's get a healthy uh thought process going there. In this case the company will hire a data scientist for it who will build the recommendation engine. So if you remember this was the latest architecture that we had. The data scientist will get the access of the data warehouse and try to explore like what are the different features the data that can be used to create the recommendation engine. But the data scientist will find out the data warehouses predefined metrics that are currently required by the company in case he or she wants to experiment more with the data. Then the raw form of the data is required. Now for the raw data, data engineers will create one data lake. It's a place where raw data is stored. So the data engineers task will be to create this data lake and keep the raw data stored there securely. Now once we have the data in the data lake, data scientist can perform as many experiments and then give us the first version of the recommendation engine. So to deploy this engine will be the work of a data engineer. But the data scientist will keep on improving the model every few weeks or months. And a data engineer also has to define a series of steps to deliver the new version from testing and validation to the deployment. Right? These kind of pipelines are known as the CI/CD pipelines or continuous integration and continuous delivery. To read more about these pipelines, you can refer to the link we've provided below this video. At this stage, the data engineer should know about the model deployment and these CI/CD pipelines as well. All right. So, this was all we want to cover in this case study. I hope you got an idea of what would be your work if you become a data engineer. I'll see you in the next video. Thank you. [Music]

Original Description

In this video, we have discussed a case study of an e-commerce company Do subscribe to Analytics Vidhya channel & get regular updates on videos: Stay on top of your industry by interacting with us on our social channels: Follow us on Instagram: https://www.instagram.com/analytics_vidhya/ Like us on Facebook: https://www.facebook.com/AnalyticsVidhya/ Follow us on Twitter: https://twitter.com/AnalyticsVidhya Follow us on LinkedIn:https://www.linkedin.com/company/analytics-vidhya

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Analytics Vidhya · Analytics Vidhya · 53 of 60

← Previous Next →

The DataHour: Data Science in Retail

The DataHour: Data Science in Retail

Analytics Vidhya

The DataHour: Anomaly detection using NLP and Predictive Modeling

The DataHour: Anomaly detection using NLP and Predictive Modeling

Analytics Vidhya

The DataHour: Energy Data Science Project from Scratch

The DataHour: Energy Data Science Project from Scratch

Analytics Vidhya

The DataHour: Explainable AI Need and Implementation

The DataHour: Explainable AI Need and Implementation

Analytics Vidhya

The DataHour: Google Cloud AI/ML

The DataHour: Google Cloud AI/ML

Analytics Vidhya

Prediction to Production in Machine Learning #machinelearning #prediction

Prediction to Production in Machine Learning #machinelearning #prediction

Analytics Vidhya

Practical Applications of Data science in Ecommerce

Practical Applications of Data science in Ecommerce

Analytics Vidhya

How to tackle Overfitting?#machinelearning #overfitting

How to tackle Overfitting?#machinelearning #overfitting

Analytics Vidhya

Building Data Pipelines on GCP #googlecloud #datapipelines #data

Building Data Pipelines on GCP #googlecloud #datapipelines #data

Analytics Vidhya

Hands-on with A/B Testing #abtesting #datascience

Hands-on with A/B Testing #abtesting #datascience

Analytics Vidhya

Efficient Implementations of Transformers #transformers #cnn #machinelearning

Efficient Implementations of Transformers #transformers #cnn #machinelearning

Analytics Vidhya

Modern Deep Learning Architecture #deeplearning #architecture #deeplearningtutorial

Modern Deep Learning Architecture #deeplearning #architecture #deeplearningtutorial

Analytics Vidhya

Key steps for Designing Artificial Neural Network (ANN) for Image classification #machinelearning

Key steps for Designing Artificial Neural Network (ANN) for Image classification #machinelearning

Analytics Vidhya

5 things you should know about Azure SQL #azure #sql #datahour #datascience

5 things you should know about Azure SQL #azure #sql #datahour #datascience

Analytics Vidhya

AI & ML in the Automotive Industry #machinelearning #ai

AI & ML in the Automotive Industry #machinelearning #ai

Analytics Vidhya

Building Machine Learning Models in BigQuery

Building Machine Learning Models in BigQuery

Analytics Vidhya

NLP aspects in Telecommunication Industry

NLP aspects in Telecommunication Industry

Analytics Vidhya

Practical Time Series Analysis

Practical Time Series Analysis

Analytics Vidhya

Fundamentals of Quantum Computing

Fundamentals of Quantum Computing

Analytics Vidhya

A DAY IN THE LIFE of a Data Scientist (From waking up to working on algorithms)

A DAY IN THE LIFE of a Data Scientist (From waking up to working on algorithms)

Analytics Vidhya

Classification Machine Learning Model from Scratch

Classification Machine Learning Model from Scratch

Analytics Vidhya

Knowledge Graph Solutions using Neo4j

Knowledge Graph Solutions using Neo4j

Analytics Vidhya

Model Guesstimation (MLOps)

Model Guesstimation (MLOps)

Analytics Vidhya

ETL Pipelines in Google Cloud Platform

ETL Pipelines in Google Cloud Platform

Analytics Vidhya

Key steps for Designing Convolutional Neural Network(CNN) for Image Classification

Key steps for Designing Convolutional Neural Network(CNN) for Image Classification

Analytics Vidhya

Getting Started with AWS EC2 #amazon #aws

Getting Started with AWS EC2 #amazon #aws

Analytics Vidhya

How to Use Azure NLP and Graph Databases for Intelligent Knowledge Mining

How to Use Azure NLP and Graph Databases for Intelligent Knowledge Mining

Analytics Vidhya

Certified AI & ML BlackBelt Plus Program #shorts

Certified AI & ML BlackBelt Plus Program #shorts

Analytics Vidhya

Visualizing Data using Python #machinelearning #visualization #python

Visualizing Data using Python #machinelearning #visualization #python

Analytics Vidhya

DCNN for Machine RUL Prediction using Time-series Data #timeseries #machinelearning #datascience

DCNN for Machine RUL Prediction using Time-series Data #timeseries #machinelearning #datascience

Analytics Vidhya

M in ML stands for Math & Magic

M in ML stands for Math & Magic

Analytics Vidhya

An Unsupervised ML approach using Clustering

An Unsupervised ML approach using Clustering

Analytics Vidhya

Customizing Large Language Models GPT3 for Real-life Use Cases #gpt3 #datascience

Customizing Large Language Models GPT3 for Real-life Use Cases #gpt3 #datascience

Analytics Vidhya

Model Parameters vs Hyperparameters - Techniques in ML Engineering #machinelearning

Model Parameters vs Hyperparameters - Techniques in ML Engineering #machinelearning

Analytics Vidhya

Practical MLOps #mlops #datascience

Practical MLOps #mlops #datascience

Analytics Vidhya

Data Engineering with Databricks #dataengineering #databricks

Data Engineering with Databricks #dataengineering #databricks

Analytics Vidhya

Multi-Objective Optimisation

Multi-Objective Optimisation

Analytics Vidhya

When Airflow Meets Kubernetes

When Airflow Meets Kubernetes

Analytics Vidhya

Analytics Vidhya

Learn Convolutional Neural Network for Image Recognition

Learn Convolutional Neural Network for Image Recognition

Analytics Vidhya

Extracting Value from Data

Extracting Value from Data

Analytics Vidhya

How to measure Marketing Channel Effectiveness

How to measure Marketing Channel Effectiveness

Analytics Vidhya

Transforming Lives | Data Science Immersive Bootcamp

Transforming Lives | Data Science Immersive Bootcamp

Analytics Vidhya

Stock Market Analysis - AI driven approach

Stock Market Analysis - AI driven approach

Analytics Vidhya

Become a Data Engineering Professional in 2022 | Future Trends + Skills Required

Become a Data Engineering Professional in 2022 | Future Trends + Skills Required

Analytics Vidhya

Ensemble Techniques in Machine Learning #machinelearning #ensemble #datascience

Ensemble Techniques in Machine Learning #machinelearning #ensemble #datascience

Analytics Vidhya

The Power of Visualization | Tableau Full Course | Analytics Vidhya

The Power of Visualization | Tableau Full Course | Analytics Vidhya

Analytics Vidhya

Demand for Data Engineers is on the Rise | Data Engineer | Analytics Vidhya

Demand for Data Engineers is on the Rise | Data Engineer | Analytics Vidhya

Analytics Vidhya

Data Visualization in Data Science | DataHour | Analytics Vidhya

Data Visualization in Data Science | DataHour | Analytics Vidhya

Analytics Vidhya

Role of Optimization in Machine Learning & Deep Learning | DataHour | Analytics Vidhya

Role of Optimization in Machine Learning & Deep Learning | DataHour | Analytics Vidhya

Analytics Vidhya

Solving any Machine Learning Problem | Approach and Steps Involved

Solving any Machine Learning Problem | Approach and Steps Involved

Analytics Vidhya

Topic Modeling Explained with Implementation | Using LDA in Python | DataHour by Arpendu Ganguly

Topic Modeling Explained with Implementation | Using LDA in Python | DataHour by Arpendu Ganguly

Analytics Vidhya

Data Engineering in E-Commerce | The Best Case Study

Data Engineering in E-Commerce | The Best Case Study

Analytics Vidhya

Introduction to Classification using Azure Machine Learning | DataHour | Analytics Vidhya

Introduction to Classification using Azure Machine Learning | DataHour | Analytics Vidhya

Analytics Vidhya

Introduction to Federated Learning | DataHour | Analytics Vidhya

Introduction to Federated Learning | DataHour | Analytics Vidhya

Analytics Vidhya

Diffusion Models for Generative Arts | DataHour | Analytics Vidhya

Diffusion Models for Generative Arts | DataHour | Analytics Vidhya

Analytics Vidhya

Master Google Analytics in 1 Hour | DataHour | Analytics Vidhya

Master Google Analytics in 1 Hour | DataHour | Analytics Vidhya

Analytics Vidhya

Learn Hypothesis Testing | DataHour | Analytics Vidhya

Learn Hypothesis Testing | DataHour | Analytics Vidhya

Analytics Vidhya

A Practical Approach to Kaggle Competition | DataHour | Analytics Vidhya

A Practical Approach to Kaggle Competition | DataHour | Analytics Vidhya

Analytics Vidhya

Making AI work for Business | DataHour | Analytics Vidhya

Making AI work for Business | DataHour | Analytics Vidhya

Analytics Vidhya

This video teaches how an e-commerce company can design and implement a data engineering system to handle large amounts of customer data, using various tools and technologies. The company's journey from using on-premises databases to migrating to cloud databases and implementing ETL pipelines is discussed. The video provides insights into the importance of data engineering in e-commerce and how it can help businesses make data-driven decisions.

Key Takeaways

Help in migration of data from on-premises to cloud databases
Make sure data is available in the correct format as soon as possible
Follow best migration practices
Test all queries again and modify code accordingly
Launch few more instances of databases
Build ETL pipelines
Create real-time dashboards
Initiate refund requests
Update daily reports
Design data warehouse

💡 The video highlights the importance of data engineering in e-commerce and how it can help businesses make data-driven decisions by providing real-time insights into customer behavior and preferences.

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Pipelines

View skill →

Building a Dog Breed Identifier App from scratch - DogNet

Building a Dog Breed Identifier App from scratch - DogNet

Aladdin Persson

Complete Dockers For Data Science Tutorial In One Shot

Complete Dockers For Data Science Tutorial In One Shot

Part 6 | Deploy ML Model on Kubernetes | Auto-Scaling with HPA and Monitoring with Prometheus

Part 6 | Deploy ML Model on Kubernetes | Auto-Scaling with HPA and Monitoring with Prometheus

Abonia Sojasingarayar

Vertex Pipelines: Qwik Start

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Automate R scripts with GitHub Actions: Deploy a model

Related AI Lessons

The AI Moat Paradox: The Better Models Become, the Less Models Matter

The AI moat paradox suggests that as AI models improve, their importance may decrease, and understanding this concept is crucial for AI professionals and businesses.

170,927 AI Papers Reveal the Biggest Research Shifts of the First Half of 2026

Discover the biggest AI research shifts of 2026 based on 170,927 papers, and learn how to apply these trends to your work

Medium · Machine Learning

170,927 AI Papers Reveal the Biggest Research Shifts of the First Half of 2026

Discover the major research shifts in AI from 170,927 papers published in the first half of 2026, and learn how to analyze trends in AI research

Medium · Data Science

[PoV] When Everyone Is Smart, No One Is

In a world where AI makes everyone smart, the value of intelligence decreases, and new challenges arise

‘ENOUGH IS ENOUGH’: Lebanon is STANDING UP to Iran, expert says