Database Indexes - You Might Be Using Them Wrong

LearnThatStack · Beginner ·🔧 Backend Engineering ·4mo ago

Key Takeaways

The video covers database indexes, B-tree structure, and query optimization, highlighting the importance of proper index design and maintenance in relational databases like Postgress, MySQL, and SQL Server.

Full Transcript

4.2 seconds. That's how long one query was taking in production. A developer added an index and the query dropped to 3 milliseconds. Problem solved, right? Except 6 months later, a different team can't figure out why their rights are crawling. The table has 11 indexes, each one added at the right time for a real reason. Most of them are doing nothing. A few are actively making things worse. We all know indexes make queries faster, but not everyone understands why, when, and how. Today, we're going to understand indexes, what they are under the hood, how B trees work, why the column order in a composite index matters, what a covering index does, and just as important, when adding an index makes your database slower. Let's get into it. Imagine you have 10 million rows. You run this query without an index. The database has exactly one option. Start at row one, check the email column, move to row two, check again, and keep going row by row. All 10 million of them. This is called a full table scan. And it's exactly as bad as it sounds. Assume the database found what it was looking for at row 47. It still has to read the remaining 9,999,953 because it has no way of knowing there isn't another match somewhere further down. A full table scan on a small table is fine. The database rips through a few thousand rows in microsconds, but tables grow. And when they do, that linear scan is why your API has a 2- second response time at 2 a.m. and a 12second one during peak traffic. An index solves this by building a separate data structure, a sorted, organized lookup that lets the database jump directly to the rows it needs without reading the entire table. But an index isn't free. It's a real data structure that lives on disk, competes for memory, and gets maintained every time you insert, update, or delete a row. Almost every relational database, Postgress, MySQL, SQL Server, Oracle uses the same foundational data structure for its indexes. It's called a B tree. Not a binary tree, a B tree. What does the B stand for? Nobody settled on an answer. The original creators, Bayer and McCrate, never committed to one. McCrate later floated Boeing, Balanced, and Bayer as candidates and noted the ambiguity was intentional. For a field that otherwise names things with punishing specificity, they chose to leave their own acronym open-ended. Okay, enough about names, but how does a B tree work? Imagine a sorted list of values. To find 42, you could scan from the beginning. That's the table scan approach. Or you could use something better. A B tree is a hierarchy. At the top, a single root node. That root points to a handful of child nodes. Those children point to more children until you reach the bottom layer, the leaf nodes. The leaves are where the actual indexed values live alongside pointers back to the corresponding row in the table. When you search for email equals dana@acample.com, the database starts at the root, compares your value to the keys stored there, picks the right branch, drops a level, and repeats. At each node, it's eliminating entire sub trees of the problem. A few hops later, it's at the leaf node with your value or certain that it doesn't exist. A B tree with 10 million entries might only be three or four levels deep. Finding any value requires three or four pages from disk, not 10 million rows. That's the difference between a 12-second query and a 3 millisecond one. Not a small improvement, a different category of operation. And there's a reason B trees are wide and shallow rather than tall and narrow. Each node holds many keys, often hundreds, which maps directly into how disc reads work. Databases read data in pages. Typically 8 kilobytes in Postgress in SQL Server, 16 in MySQL. A B tree node is designed to fit neatly inside one page. So each level of the tree costs exactly one page read from disk. The structure isn't arbitrary. It was built around the hardware it runs on. A binary tree holding 10 million values would be about 23 levels deep, 23 disc reads. A B tree holding the same 10 million values with a branching factor of 200 is roughly three levels deep, three disc reads. This is why every major database chose B trees. Another thing to point out, the leaf nodes are linked together in sorted order, a chain running left to right across the bottom of the tree. Technically, this makes it a B+ tree. In a B+ tree, all the actual data lives in the leaves and the leaves are chained together in sequence. Actually, every major database uses B+ trees but calls them B trees. Minor distinction, but worth pointing out. Why does the chained leaf structure matter? Range queries. When the database needs everything between date A and date B or all salaries above a threshold, it doesn't traverse the tree for each individual value. It finds the start point in the tree, then walks forward through the leaves. Sequential, efficient. This will become directly relevant when we get to composite indexes. A single column index is straightforward. Index the email column. Queries filtering on email get fast, but most real queries don't filter on one column. They filter on two, three, sometimes more. That's where composite indexes come in and where the mistakes start to accumulate. A composite index covers multiple columns and the column order matters a lot. Here's an index on the orders table. Customer ID, then status, then created at. Think of it like a phone book sorted by last name, then by first name within each last name group. If you want everyone named Smith, they're all grouped together. Easy. Smith. John. Go to the Smiths. Find Jon within that group, but find everyone named John regardless of last name. The phone book can't help you. The John's are scattered across every section because the book isn't sorted by first name. You'd have to read the whole thing. You've arrived back at a full scan. A composite index works exactly the same way. Sorted by the first column, then by the second within each matching group, then by the third. You can use it efficiently from the left. You can't skip a column and expect the structure to hold. So in what queries is this index actually helpful? Filter on just customer ID. The first column the data is sorted on it fast. Add status within the customer ID equals 5 group. Rows are sorted by status. The index narrows further. All three columns. The index handles the whole thing. This is what the index was designed for. customer ID and created at, but skipping status. The index gets you to customer five efficiently, but within that group, the rows are sorted by status, not created at. So, it finds all of customer 5's rows using the index, then filters the dates from that set the hard way. Partial use, not full use. Filter on just status with no customer ID. The index gives the query planner very little to work with. status is the second column. Without the first column to anchor the search, the sorted structure doesn't help. Some databases can attempt a skip scan to work around this. Instead of giving up, it probes each distinct value in the leading column separately, effectively running one mini lookup per value. Oracle has had it for decades. My SQL added it in 8.0. Postgress in version 18. When the leading column has very few distinct values, it can work reasonably well, but it's situational and a well-ordered index will consistently outperform it. The skip scan is a fallback, not a design strategy. This is called the leftmost prefix rule. In MySQL, it's strict. A composite index is usable only for queries that filter on the leading column or the first two or all three in order. Postgress is more flexible about using trailing columns in isolation but less efficient when it does. Either way, the practical guidance is the same. Design your index so the leading columns match what your queries actually filter on. Remove the foundation and the structure above it loses its footing. That's why column order matters. The same three columns in a different order produce a completely different index that serves completely different queries. Index A is built for give me orders for customer 5. Index B is built for give me all shipped orders from this month. Same three columns, entirely different utility. Column order isn't a detail, it's the design decision. Practical advice that can save you some iteration. Put equality conditions first in a composite index, range conditions last. An equality filter lets the B tree jump to a precisely bounded group. That's as efficient as lookup gets. A range filter, greater than, less than, or between, requires the database to scan forward through the leaf chain. Once that scan starts, no trailing columns in the index can narrow the scan itself. The database can still filter those columns while scanning. My SQL calls this index condition push down, but the scan range is determined by the range column. Equality first keeps that range as tight as possible. filter on status equals shipped and created at after 2024. The correct index is status then created at equality first range second. Reversing the order puts the range column in front which widens the scan unnecessarily. The difference shows up in the query plan and at scale in your response times. So far, we've talked about how an index finds the right rows. But there's a second step that's often overlooked. After the database locates your rows in the index, it usually has to go back to the actual table to retrieve the columns that aren't stored in the index. This is called a table lookup or a bookmark lookup or a heap fetch depending on which database and which docs you're reading. If your query selects 10 columns, but the index only contains the two you filtered on, the database finds the matching rows through the index, then for each one goes back to the table to fetch the remaining eight. Each lookup is a separate random read. When the matching row count is large, those random reads accumulate quickly, sometimes to the point where the query planner decides the index isn't worth using and scans the table sequentially instead. The index helped find the rows. The subsequent lookups cost more than a full scan would have. A covering index eliminates this. It's an index that contains every column the query needs. Not just the wear columns, but select, order by, and group by as well. Everything the query asks for is already in the index. When the index covers the query, the database never touches the main table. Everything it needs is in the leaf nodes. The query runs entirely from the index. In Postgress, you'll see indexonly scan in the query plan. In MySQL, using index appears in the extra column of explain. In SQL Server, the key lookup step disappears entirely. These are the signals that confirm the index is covering the query. Postgress and SQL server both offer an include clause that makes this more surgical. You can attach extra columns to the leaf nodes without adding them to the sort key. The B tree is still organized by customer ID and status. That's the structure the tree uses to navigate, but created at and total ride along in the leaves as payload. They're available when needed without changing how the tree itself is built. It lets you cover specific queries without redesigning the index. But wider indexes take more space on disk and in memory and cost more to maintain on writes. A covering index that includes eight columns might double the size of the index. It makes the most sense for queries that run at high frequency and where the read gain is worth the overhead, not as a general policy, but a specific decision about a specific query. Most of the time when developers think about indexes, they're thinking about reads. That's the natural frame. You have a slow query, you add an index, the query gets fast. The right side of the table gets much less attention. But every index is also a cost on writes. And this accumulates often quietly until it isn't quiet at all. When you insert a row, the database doesn't just write to the table. It writes a new entry into every index on that table, finding the right leaf node, inserting the value in sorted position, potentially splitting a node if it's full. Updates to indexed columns require removing the old entry and inserting the new one. Deletes require cleaning up entries from every index. One operation at the application level becomes several inside the database. A table with 12 indexes didn't get there through carelessness. It got there through 12 separate decisions. Each made for a real reason, none of them reconsidered together. The result is that every insert now costs 13 writes instead of one. Databases have some mitigations. Some can defer secondary index rights using the change buffer and updates to unindexed columns can skip index maintenance entirely, but the direction holds. More indexes mean more write overhead, especially on inserts, and the overhead is significant. There's also the memory cost. Your database keeps frequently accessed index pages in a buffer pool in RAM. 12 indexes competing for that same pool means each one gets a smaller share. When index pages don't fit in memory, they require discreads, which is exactly the problem you were trying to solve by adding indexes in the first place. There are also cases where an index exists and the database chooses not to use it. This may surprise some developers, but the query planner is doing math. If your query returns 30% of the table, doing 30% of 10 million random lookups from the index back to the table is more expensive than reading the table sequentially. The exact tipping point varies. It depends on the database, the row width, and the storage hardware. For wide rows on spinning discs, the crossover might come around 10 to 15%. For narrow rows on SSDs, the index stays useful further. The point is that the planner is making this calculation for you and sometimes the answer is to skip the index. Another common issue, indexing low cardality columns. A status column with three evenly distributed values means the B tree can get you to onethird of the table. Onethird of 10 million rows is still 3.3 million, each requiring a random lookup back to the table. A sequential scan is often cheaper. It's the even distribution that makes low cardality indexes useless, not the low cardality itself. If 98% of rows are active and 1% are pending, the database will use the index efficiently for queries targeting pending, Postgress even lets you create a partial index covering only that rare value. Far more efficient than indexing the entire column. So if a column rarely appears in wear clauses, join conditions or order by, it probably doesn't need an index. Low cardality columns on their own rarely justify one. and speculative indexes added on the theory that they might be needed one day accumulate quietly costing on every right long after the original reason for adding them has been forgotten. The good news is that most databases track which indexes are actually being used in Postgress. PG stat user indexes will surface indexes with an idx scan of zero, meaning they've never been used for a lookup. MySQL has cy.sema unused indexes. SQL Server has cis.dm.db index usage stats. Check periodically. Drop the indexes that aren't earning their keep. When you have a slow query, resist the urge to immediately reach for create index. The instinct is understandable. The habit is worth breaking. Start with the query plan. Run explain or explain analyze and look at what the database is actually doing. Is it scanning the entire table? using an index but bouncing back to the table thousands of times for lookups. Sorting on disk because nothing is providing the order it needs. The plan tells you where the time is going and why. From there, examine the query itself. Which columns appear in the wear clause in join conditions, order by or group by. Those are your candidates. The columns the query is actually asking the database to work with. Then think about the index design. equality columns first, range columns last. Consider whether you can cover the query by including the select columns, and be honest whether the wider index is worth it. Before shipping it, consider the right side. If this table gets hundreds of inserts a second, the new index has a real ongoing cost. Make sure the read improvement justifies it. Finally, come back and check on your indexes. Queries change, features get removed. That index you added for a report nobody runs anymore is still there still costing on every write long after the report is gone. Indexes aren't always the solution. They're B trees with trade-offs. They speed up reads. They slow down writes. And the details column order covering queries and knowing when to leave a column unindexed are worth understanding. Most of the time the difference between a fast database and a slow one isn't hardware. It's whether the people running it understood their indexes, not just added them. If this was useful, a like helps other developers find it. Thanks for watching and see you in the next

Original Description

Database indexes can speed up queries but can also silently tank your write performance. We all know indexes make queries faster. But few understand the B-tree structure underneath, why column order in composite indexes matters, or when adding an index makes things worse. This video covers the fundamentals every developer should know. What you'll learn: - How B-trees organize data and why they're so efficient - Why composite index column order changes everything - What covering indexes do and when to use INCLUDE - When indexes hurt performance (write overhead, buffer pool pressure, low cardinality) Timestamps: 0:00 Intro 0:50 What Problem Do Indexes Solve? 2:08 The B-Tree — How Indexes Actually Work 5:40 Composite Indexes and Why Order Is Everything 10:19 Covering Indexes 12:51 When Indexes Hurt 16:47 Practical Advice More Videos : Software Egineering Basics - https://www.youtube.com/playlist?list=PLWP-VtjCVpWyLNBm3zz_sGyC5mVwiAOvj Software Design - https://www.youtube.com/playlist?list=PLWP-VtjCVpWx7kPq30XRN6O6LjVQ4VL95 #database #postgresql #mysql #sqlserver #btree #programming
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

The video teaches the importance of proper database index design and maintenance, covering topics like B-tree structure, composite indexes, and query optimization. Viewers will learn how to optimize database queries, design efficient indexes, and improve database performance.

Key Takeaways
  1. Run explain or explain analyze to look at the query plan
  2. Examine the query itself to identify which columns are being used
  3. Consider the index design, including equality columns first and range columns last
  4. Think about the ongoing cost of the index, especially for high-write tables
  5. Check on your indexes periodically to ensure they're still justified
  6. Drop unused indexes to save resources
💡 Proper index design and maintenance are crucial for optimal database performance, and regular monitoring of indexes can help prevent unnecessary write overhead and improve query efficiency.

Related AI Lessons

Chapters (7)

Intro
0:50 What Problem Do Indexes Solve?
2:08 The B-Tree — How Indexes Actually Work
5:40 Composite Indexes and Why Order Is Everything
10:19 Covering Indexes
12:51 When Indexes Hurt
16:47 Practical Advice
Up next
This Cop Was Held Accountable For His Brutality! #police #lawyer
Hampton Law
Watch →