Tuesday, May 12, 2009

Tuning is a Team Sport

I was working with a Client who was having performance issues with their web site and asked for assistance. The Users were experiencing slow page displays from almost everywhere in the site. The DBA's thought they might have an issue with the web server or database and asked me to take a look. So we enabled Oracle 10046 tracing and collected data while executing a couple of web pages. Our analysis quickly revealed a particular SQL which was taking about 8 seconds to execute for each web page displayed:

SELECT TO_CHAR(MIN(access_event_date),'MM/DD/YY') as first_event,
       TO_CHAR(COUNT(*), '999,999,999,999') as page_hits
FROM abc_access_events
WHERE page_url = :this_url

Note: this_url was being passed '/employee/abc/' in my transaction

When I inquired into the purpose of this query, I was told it was a page counter used to display the total number of times the page was accessed. Hmm, something already seemed a bit off, but I continued to investigate.

This sql was FULL scanning the ABC_ACCESS_EVENTS table which was a fairly good size table of 50,004 blocks and almost 10 million rows. I first looked and noticed no index on PAGE_URL, so naturally my next thought was perhaps we just need an index on the PAGE_URL column to help.

I needed to look at the table and the data selectivity for the PAGE_URL column. When I did, I could see that my query, which was looking at '/employee/abc/', returned only 91,877 rows out of 9,739,865 rows in the table (0.94% row selectivity), but it had to return 35,057 blocks out of the 50,004 blocks in the table (70.11% block selectivity). So even though I was interested in less than 1% of the data, Oracle would need to read in over 70% of the total blocks to satisfy my query. This showed that Oracle was better off FULL scanning the whole table, since my rows were scattered throughout 70.11% of the blocks in the table. This is a great example on the need to focus on the block selectivity of data and not the row selectivity when looking into indexing and data access. (For further info on this subject see Hotsos White Paper When to Use an Index by Cary Millsap.)

SQL> @hds
Table Owner : ABC
Table Name : abc_access_events
Column List : page_url

Table Blocks Below HWM Table Rows
(B) (R)
---------------------- ----------------
50,004 9,739,865

Block Row
Selectivity Block Count Selectivity Row Count
PAGE_URL (pb = b/B) (b) (pr = r/R) (r)
------------------------ ----------- ----------- ----------- ---------
/register/ 98.64% 49,326 8.69% 845,997
/ 98.55% 49,279 24.25% 2,361,734
/contractor/ 95.27% 47,639 7.21% 702,055
/employee/ 94.44% 47,225 4.65% 453,173
/contractor/cc/ 89.31% 44,657 2.22% 215,939
/find-a-place 86.85% 43,430 2.32% 225,562
/contractor/lc/ 82.97% 41,488 1.96% 191,205
/abc/cc/bidding/ 80.04% 40,023 2.45% 238,755
/abc/cc/bidding/file 72.98% 36,491 2.13% 207,610
/employee/abc/ 70.11% 35,057 0.94% 91,877
/project-status/one-proj 68.74% 34,373 2.58% 251,356
/happenings/one 67.92% 33,961 2.98% 290,701
/employee/fbc/ 64.35% 32,177 0.79% 77,102
/news/item 58.29% 29,148 0.81% 79,113
/register/index 58.25% 29,127 0.83% 80,818
/employee/nf/ 58.19% 29,096 0.62% 59,996
/support/ 56.99% 28,499 0.64% 62,626
/employee/ef/ 55.73% 27,867 0.58% 56,649
/contractor/cc/chb/ 55.33% 27,665 0.69% 67,144
/fbc/firs/ 51.48% 25,743 0.69% 67,076
/contractor/sbop/ 49.37% 24,687 0.55% 53,913
/file-storage/ 48.86% 24,431 1.07% 103,837
/bond/ 47.48% 23,743 0.47% 45,420
/employee/abc/rfqs-rfps 47.23% 23,616 0.49% 47,765
/documents 46.68% 23,341 0.42% 41,126
/contractor/cc/contract 46.56% 23,283 1.92% 186,734
Therefore this was not going to be a quick index fix to get a performance improvement. Again, something just didn't feel right with this process. I wanted to understand more about this table and its use. After looking a bit more, I quickly realized that this table was not keeping track of the page counts, but really was used to record EVERY ACCESS to the website. An INSERT statement into this table is executed every time the user displays a page. Tracking records began getting inserted back in June of 2004 and over the years this table has grown to almost 10 million rows and it continues to grow each day. This also explains the "shotgun" type of data selectivity because as a user randomly navigates around the different pages of the website, rows get inserted and therefore PAGE_URL data is scattered throughout the blocks.

SQL> select min(ACCESS_EVENT_DATE) from abc_access_events;


Next I started to think about reorganizing the table with partitioning on the PAGE_URL, so data would be grouped together in the blocks and that would improve performance. I also didn't understand why the Client was keeping track of every page displayed and not just incrementing a counter using a row for each PAGE_URL in a table. This design would really work much faster because the application would just need to update and select a single row from the table. The table would only have a few rows (1 row per web page). I knew that each of these modifications I was pondering, would take a bit of time to implement because we would need to change the schema and possibly even the application code.

I called a meeting with the Systems people, the DBA's, the Developers and the Users to review the situation. I told them the reason for the slow page display was because this query was causing a roughly 8 second delay for each page displayed. It took about 8 seconds for the SQL to FULL scan the ABC_ACCESS_EVENTS table. I informed them that this SQL has probably been slowing down the website for years and it will continue to make it run slower each day. I also told them some of the things we might do to re-architect it and have it perform properly. After presenting this, the Systems people responded that they could probably help by removing disk io hotspots and get the physical io rate to improve to about 5 ms, but that it would take a bit of work to implement. The DBA's then responded that they could migrate the table to a partitioned structure but it would take some work and testing. They would prefer the application get re-architected since that was really a better solution. The Developers said they could re-architect the code but it would take some time and then need to be tested but they could do it if they had to. Finally the User spoke up and responded that they really don't even care about seeing the page count at the bottom of the web pages and that they would be fine if it was simply removed. They said it was never part of the requirements but the Developer just added it as an enhancement.

So as soon as possible, we eliminated this work (truncated the table first, then later removed the application code) and the Client was ecstatic that there website was performing now faster than ever.

The Users solution was of course, the best solution choice. It follows the old Hotsos saying that "The fastest way to do something is don't do it". Why bother going down the path of tuning the disk array, rebuilding table structures and re-architecting your code, when the best solution is just don't do the work. The User's solution also has the residual affect of potentially helping performance in other areas of the application where resource contention was occurring with this transaction. This is a great example of why I like to say that "Tuning is a Team Sport" and that optimization needs to be done with participation across the organization. Quite often each group can propose a solution to a problem, but one solution will likely provide the best return on investment (ROI). If you do not engage the "Whole Team", then you risk missing out on the best solution and will likely waste valuable time, resources and money implementing a sub-optimal solution.


  1. Nice! The one group that almost never gets asked is the users. After all, they're just the ones that cause the problems, not the ones that solve them. It drives me crazy when I ask a technical guy a question and he/she responds that it's a "business requirement". When that happens I usually start looking for someone else to ask.