What are indexes? Simply put, indexes in a SQL Server database are structures that speed up data access from tables and views. There are two types of indexes: clustered and nonclustered. Since any database table can only have a single primary key (therefore, a single clustered index), nonclustered indexes are what we as developers and analysts can optionally use to improve query performance.
Nonclustered indexes are beneficial whenever their underlying column data values are highly unique. For example, think of AdmissionDateTime – there are over 31.5M possible values within a single year (remember that datetime columns in Data Repository are specific to the second). If we have a datetime column that is commonly queried, it’s an excellent candidate for a nonclustered index. Conversely, the Gender field only has three possible values (M, F and U) regardless of how many rows are in the table. Since the data are NOT discrete, adding an index here, even if we only wanted to query for patients that identify as female, would provide very little benefit.
While the selective creation and use of nonclustered indexes improves query performance, there are also costs associated with those indexes that can result in performance trade-offs. Before we look at creating indexes for any table, we should understand where they provide the most benefit and understand the associated maintenance and system impact to maintain them.
(Microsoft has an excellent introduction to indexes in SQL Server).
Every time we create an index, we’re also modifying the table design. For every nonclustered index in a table, a small (or maybe not so small) additional amount of space is allocated. In addition, as table data are added and updated, indexes are accommodated by the SQL Server database engine – what this means in practice is that if a table has redundant or poorly-maintained indexes, data transfer performance can suffer. Since there are costs associated with indexes, there are two important things to ensure:
- Tables (and both clustered and nonclustered indexes) need to be maintained (defragmented) on a regular basis. Index maintenance is a separate topic but should always be a part of a database maintenance schedule. Even if you have a dedicated database administrator (DBA), make sure that index maintenance is occurring regularly.
- Before creating a nonclustered index, check to see what other indexes may already exist so you don’t create a redundant index. Other vendors or IT staff may also create indexes, so always check the table for existing nonclustered indexes before creating a new one.
When do I need a non-clustered index?
The short answer: when data retrieval is too slow. Since “slow” and “fast” are relative however, there are points to consider. For any report, scheduled job or user query:
- Will anyone notice? A user running a report from SQL Server Reporting Services and waiting a long time for the report to render is a problem. A scheduled report (or data compilation) running overnight when no one’s waiting for it to complete may not be. Keep in mind who the audience is for the data and if slower report run times are really a problem.
- How large is the dataset? Again, this is relative, but two minutes to return a simple census report is a long time, while two minutes to retrieve all last year’s BAR charge transactions isn’t. The size of the underlying table matters too: querying a 100k row table vs. a 10M row table is very different. Be realistic about your expectations.
- Analyze the query with Management Studio execution plan tools. This graphical tool built in to SSMS lets you see how the data for a query are being assembled. If you see lots of table scans (bad) and one table from your query doing most of the work, the execution plan is very helpful to identify tables and columns where nonclustered indexes will help.
- What about efficiency just for the sake of efficiency? In other words, why not create nonclustered indexes by default when I suspect they will be used by many reports? In principle, not a bad idea, but read on for more specific guidelines.
Non-clustered index recommendations
Given the considerations above, here are some general best practices for creating non-clustered indexes in Data Repository:
- The primary key columns are included in every non-clustered index. Therefore, it is recommended to not include them in a new nonclustered index or Include columns (see Advanced below for more on Include columns).
- Whenever you join on VisitID, where VisitID is not part of the primary key, create a nonclustered index on this column. Typical examples: AbstractData, BarVisits, LabSpecimens, MicSpecimens, PhaRx, OeOrders/OmOrd_Main, SchAppointments/CwsAppt_Main. Joins to these tables from whichever table you use for patient name and account number (AdmVisits, AbstractData, RegAcct_Main) will be served by adding an index on VisitID. Note that you don’t need to add an index on any child tables (e.g. LabSpecimenTests) since you would join using SpecimenID, not VisitID.
- As in the previous example, whenever you join on PatientID, where PatientID is not part of the primary key. This is less common, but you may do this when getting patient-based (versus visit-based) data.
- Whenever a datetime datatype column is used as an SP parameter or hard-coded in the WHERE clause. AdmissionDateTime, ServiceDateTime, DischargeDateTime, CollectionDateTime, AdministrationDateTime, etc. By creating a nonclustered index on the datetime column, it will potentially be used by an SP or query that asks for that date value.
- Like #3 above, columns other than datetime datatypes with highly discrete data values that are used in a WHERE clause are candidates for an index. For example, AdmVisits.Status may have up to 20 or more different values. If you wanted to look for “ADM IN” patients only, you’d have to test your code with and without a nonclustered index on the Status field to see if it makes the query faster.
Advanced Index Strategy
There are some cases where more than basic nonclustered indexes are needed. Keep in mind that every database is unique, based on several factors: hardware performance, physical database size, how well the system is (or isn’t) maintained, and of course the nature of the data being requested. Given these variables, the following index techniques are only recommended after working with Acmeware’s designated index experts, and after thoroughly evaluating the workload being produced by the specific stored procedure logic for your client and report requirements. Always test and validate your nonclustered indexes.
- Included columns - Nonclustered indexes can have optional non-key columns specified as “included” when the indexes are created. An index with included non-key columns can significantly improve query performance because the query optimizer can locate all the column values within the index; table or clustered index data is not accessed resulting in fewer disk I/O operations. This is called an Index Only Scan. Let’s use OeOrders or OmOrd_Main as an example. Our standard recommendation for an index on these tables would be on VisitID, to facilitate joins to AdmVisits or RegAcct_Main for patient visit demographics. However, if we’re looking at order data, we’ll almost always use data like OrderDateTime, OrderNumber, CategoryID/Category_OmCatID, etc. on our report as well, so we could add these data output columns as “included” columns in the index. If we access only these values in our query, SQL Server will do an index only scan.
- Filtered indexes - these are another technique we can use to further increase index performance in very specific circumstances. Filtered indexes are great performance boosts if you have a value that is used in a predicate frequently, but that value is only a small amount of the total values for that table. The resulting index will be smaller and cost less to maintain than a full table non-clustered index. A good example is AdmVisitEvents or RegEvent_Events. Typically, when we’re using these tables, we’re looking for specific events of interest – like a status change from observation to inpatient, for example. In this case, let’s say we’ve already got an index on the DateTime associated with the ADM or REG event, but by adding a filter on the Status column, where it only matches “OBSTOIN,” we’re limiting the scope of the index, making it more efficient. A filter condition for an index is a WHERE clause that limits the rows the index applies to. Whenever adding a filter to an index it is recommended to add the column being filtered in the include columns of your index.
Remember that there is no one size fits all solution for these advanced index examples. You should balance the tradeoffs between improved query performance and the overhead and maintenance indexes require. Your specific situation should always be tested before adding more indexes in support of a stored procedure. When you can demonstrate a dramatic decrease in query runtime (say from five minutes to 30 seconds), then adding a nonclustered index is a no-brainer. On the other hand, if you spend hours testing and analyzing an index that only reduces your runtime from ten seconds to five, it may not be worth the effort and overhead.