It is once again approaching that time of year for the Hotsos Symposium. This will be the eighth year for this grand event which uniquely brings together many of the top Oracle performance experts for 3-4 full days of sessions and networking. For those who don't know, it is March 7 - 11, 2010 in Dallas, TX at the Omni Mandalay Hotel.
Personally this will be my seventh Hotsos Symposium, as I missed the first year when I was in the process of moving from Oracle to Hotsos. One of my big fears of leaving Oracle after 15 years, was that I might be losing my connections to Oracle experts. Fortunately that was not the case and after joining Hotsos and attending events like the Hotsos Symposium, I quickly realized just how many Oracle experts exist outside of Oracle the Company. This event has helped me to build a stronger network of Oracle performance friends than I had when I was working at Oracle. The Hotsos Symposium has even enabled me to build stronger relationships with some of the co-workers I left behind at Oracle.
When I am asked to describe the Symposium, I usually say it is a mini Oracle OpenWorld focused on Oracle performance. It is a once a year event where top Oracle performance experts from around the world come together to present, network, mingle and mix in a little fun as well (Disco Night this year). Oh, I can't leave out the great food and accommodations provided by the Omni Mandalay Hotel.
If you have been to one of the previous Symposiums, then you already know what a great event and opportunity it is to once again rub elbows with many of the top Oracle performance experts in the world. So tell your friends! If you have not previously attended, then this is your opportunity to hear, meet and mingle with people like: Tom Kyte (Keynote Speaker), Tanel Poder (Training Day Speaker) and many other greats on the speaker list.
So, I hope to see you at the Hotsos Symposium this year. Or as my oldest daughter calls it, "the time of year when you travel to Dallas and miss my birthday". Yes, her birthday is Mar 8th. and once again I will be away in Dallas on this day. Sorry Amanda, I promise we will celebrate your birthday when I get home (as usual) - Love Dad.
Friday, February 19, 2010
Tuesday, May 12, 2009
Tuning is a Team Sport
I was working with a Client who was having performance issues with their web site and asked for assistance. The Users were experiencing slow page displays from almost everywhere in the site. The DBA's thought they might have an issue with the web server or database and asked me to take a look. So we enabled Oracle 10046 tracing and collected data while executing a couple of web pages. Our analysis quickly revealed a particular SQL which was taking about 8 seconds to execute for each web page displayed:
SELECT TO_CHAR(MIN(access_event_date),'MM/DD/YY') as first_event,
TO_CHAR(COUNT(*), '999,999,999,999') as page_hits
FROM abc_access_events
WHERE page_url = :this_url
Note: this_url was being passed '/employee/abc/' in my transaction
When I inquired into the purpose of this query, I was told it was a page counter used to display the total number of times the page was accessed. Hmm, something already seemed a bit off, but I continued to investigate.
This sql was FULL scanning the ABC_ACCESS_EVENTS table which was a fairly good size table of 50,004 blocks and almost 10 million rows. I first looked and noticed no index on PAGE_URL, so naturally my next thought was perhaps we just need an index on the PAGE_URL column to help.
I needed to look at the table and the data selectivity for the PAGE_URL column. When I did, I could see that my query, which was looking at '/employee/abc/', returned only 91,877 rows out of 9,739,865 rows in the table (0.94% row selectivity), but it had to return 35,057 blocks out of the 50,004 blocks in the table (70.11% block selectivity). So even though I was interested in less than 1% of the data, Oracle would need to read in over 70% of the total blocks to satisfy my query. This showed that Oracle was better off FULL scanning the whole table, since my rows were scattered throughout 70.11% of the blocks in the table. This is a great example on the need to focus on the block selectivity of data and not the row selectivity when looking into indexing and data access. (For further info on this subject see Hotsos White Paper When to Use an Index by Cary Millsap.)
SQL> select min(ACCESS_EVENT_DATE) from abc_access_events;
MIN(ACCES
---------
28-JUN-04
Next I started to think about reorganizing the table with partitioning on the PAGE_URL, so data would be grouped together in the blocks and that would improve performance. I also didn't understand why the Client was keeping track of every page displayed and not just incrementing a counter using a row for each PAGE_URL in a table. This design would really work much faster because the application would just need to update and select a single row from the table. The table would only have a few rows (1 row per web page). I knew that each of these modifications I was pondering, would take a bit of time to implement because we would need to change the schema and possibly even the application code.
I called a meeting with the Systems people, the DBA's, the Developers and the Users to review the situation. I told them the reason for the slow page display was because this query was causing a roughly 8 second delay for each page displayed. It took about 8 seconds for the SQL to FULL scan the ABC_ACCESS_EVENTS table. I informed them that this SQL has probably been slowing down the website for years and it will continue to make it run slower each day. I also told them some of the things we might do to re-architect it and have it perform properly. After presenting this, the Systems people responded that they could probably help by removing disk io hotspots and get the physical io rate to improve to about 5 ms, but that it would take a bit of work to implement. The DBA's then responded that they could migrate the table to a partitioned structure but it would take some work and testing. They would prefer the application get re-architected since that was really a better solution. The Developers said they could re-architect the code but it would take some time and then need to be tested but they could do it if they had to. Finally the User spoke up and responded that they really don't even care about seeing the page count at the bottom of the web pages and that they would be fine if it was simply removed. They said it was never part of the requirements but the Developer just added it as an enhancement.
So as soon as possible, we eliminated this work (truncated the table first, then later removed the application code) and the Client was ecstatic that there website was performing now faster than ever.
The Users solution was of course, the best solution choice. It follows the old Hotsos saying that "The fastest way to do something is don't do it". Why bother going down the path of tuning the disk array, rebuilding table structures and re-architecting your code, when the best solution is just don't do the work. The User's solution also has the residual affect of potentially helping performance in other areas of the application where resource contention was occurring with this transaction. This is a great example of why I like to say that "Tuning is a Team Sport" and that optimization needs to be done with participation across the organization. Quite often each group can propose a solution to a problem, but one solution will likely provide the best return on investment (ROI). If you do not engage the "Whole Team", then you risk missing out on the best solution and will likely waste valuable time, resources and money implementing a sub-optimal solution.
SELECT TO_CHAR(MIN(access_event_date),'MM/DD/YY') as first_event,
TO_CHAR(COUNT(*), '999,999,999,999') as page_hits
FROM abc_access_events
WHERE page_url = :this_url
Note: this_url was being passed '/employee/abc/' in my transaction
When I inquired into the purpose of this query, I was told it was a page counter used to display the total number of times the page was accessed. Hmm, something already seemed a bit off, but I continued to investigate.
This sql was FULL scanning the ABC_ACCESS_EVENTS table which was a fairly good size table of 50,004 blocks and almost 10 million rows. I first looked and noticed no index on PAGE_URL, so naturally my next thought was perhaps we just need an index on the PAGE_URL column to help.
I needed to look at the table and the data selectivity for the PAGE_URL column. When I did, I could see that my query, which was looking at '/employee/abc/', returned only 91,877 rows out of 9,739,865 rows in the table (0.94% row selectivity), but it had to return 35,057 blocks out of the 50,004 blocks in the table (70.11% block selectivity). So even though I was interested in less than 1% of the data, Oracle would need to read in over 70% of the total blocks to satisfy my query. This showed that Oracle was better off FULL scanning the whole table, since my rows were scattered throughout 70.11% of the blocks in the table. This is a great example on the need to focus on the block selectivity of data and not the row selectivity when looking into indexing and data access. (For further info on this subject see Hotsos White Paper When to Use an Index by Cary Millsap.)
Therefore this was not going to be a quick index fix to get a performance improvement. Again, something just didn't feel right with this process. I wanted to understand more about this table and its use. After looking a bit more, I quickly realized that this table was not keeping track of the page counts, but really was used to record EVERY ACCESS to the website. An INSERT statement into this table is executed every time the user displays a page. Tracking records began getting inserted back in June of 2004 and over the years this table has grown to almost 10 million rows and it continues to grow each day. This also explains the "shotgun" type of data selectivity because as a user randomly navigates around the different pages of the website, rows get inserted and therefore PAGE_URL data is scattered throughout the blocks.
SQL> @hds
Table Owner : ABC
Table Name : abc_access_events
Column List : page_url
Table Blocks Below HWM Table Rows
(B) (R)
---------------------- ----------------
50,004 9,739,865
Block Row
Selectivity Block Count Selectivity Row Count
PAGE_URL (pb = b/B) (b) (pr = r/R) (r)
------------------------ ----------- ----------- ----------- ---------
/register/ 98.64% 49,326 8.69% 845,997
/ 98.55% 49,279 24.25% 2,361,734
/contractor/ 95.27% 47,639 7.21% 702,055
/employee/ 94.44% 47,225 4.65% 453,173
/contractor/cc/ 89.31% 44,657 2.22% 215,939
/find-a-place 86.85% 43,430 2.32% 225,562
/contractor/lc/ 82.97% 41,488 1.96% 191,205
/abc/cc/bidding/ 80.04% 40,023 2.45% 238,755
/abc/cc/bidding/file 72.98% 36,491 2.13% 207,610
/employee/abc/ 70.11% 35,057 0.94% 91,877
/project-status/one-proj 68.74% 34,373 2.58% 251,356
/happenings/one 67.92% 33,961 2.98% 290,701
/employee/fbc/ 64.35% 32,177 0.79% 77,102
/news/item 58.29% 29,148 0.81% 79,113
/register/index 58.25% 29,127 0.83% 80,818
/employee/nf/ 58.19% 29,096 0.62% 59,996
/support/ 56.99% 28,499 0.64% 62,626
/employee/ef/ 55.73% 27,867 0.58% 56,649
/contractor/cc/chb/ 55.33% 27,665 0.69% 67,144
/fbc/firs/ 51.48% 25,743 0.69% 67,076
/contractor/sbop/ 49.37% 24,687 0.55% 53,913
/file-storage/ 48.86% 24,431 1.07% 103,837
/bond/ 47.48% 23,743 0.47% 45,420
/employee/abc/rfqs-rfps 47.23% 23,616 0.49% 47,765
/documents 46.68% 23,341 0.42% 41,126
/contractor/cc/contract 46.56% 23,283 1.92% 186,734
...
SQL> select min(ACCESS_EVENT_DATE) from abc_access_events;
MIN(ACCES
---------
28-JUN-04
Next I started to think about reorganizing the table with partitioning on the PAGE_URL, so data would be grouped together in the blocks and that would improve performance. I also didn't understand why the Client was keeping track of every page displayed and not just incrementing a counter using a row for each PAGE_URL in a table. This design would really work much faster because the application would just need to update and select a single row from the table. The table would only have a few rows (1 row per web page). I knew that each of these modifications I was pondering, would take a bit of time to implement because we would need to change the schema and possibly even the application code.
I called a meeting with the Systems people, the DBA's, the Developers and the Users to review the situation. I told them the reason for the slow page display was because this query was causing a roughly 8 second delay for each page displayed. It took about 8 seconds for the SQL to FULL scan the ABC_ACCESS_EVENTS table. I informed them that this SQL has probably been slowing down the website for years and it will continue to make it run slower each day. I also told them some of the things we might do to re-architect it and have it perform properly. After presenting this, the Systems people responded that they could probably help by removing disk io hotspots and get the physical io rate to improve to about 5 ms, but that it would take a bit of work to implement. The DBA's then responded that they could migrate the table to a partitioned structure but it would take some work and testing. They would prefer the application get re-architected since that was really a better solution. The Developers said they could re-architect the code but it would take some time and then need to be tested but they could do it if they had to. Finally the User spoke up and responded that they really don't even care about seeing the page count at the bottom of the web pages and that they would be fine if it was simply removed. They said it was never part of the requirements but the Developer just added it as an enhancement.
So as soon as possible, we eliminated this work (truncated the table first, then later removed the application code) and the Client was ecstatic that there website was performing now faster than ever.
The Users solution was of course, the best solution choice. It follows the old Hotsos saying that "The fastest way to do something is don't do it". Why bother going down the path of tuning the disk array, rebuilding table structures and re-architecting your code, when the best solution is just don't do the work. The User's solution also has the residual affect of potentially helping performance in other areas of the application where resource contention was occurring with this transaction. This is a great example of why I like to say that "Tuning is a Team Sport" and that optimization needs to be done with participation across the organization. Quite often each group can propose a solution to a problem, but one solution will likely provide the best return on investment (ROI). If you do not engage the "Whole Team", then you risk missing out on the best solution and will likely waste valuable time, resources and money implementing a sub-optimal solution.
Wednesday, March 18, 2009
007 Secret Agent (a.k.a. 10046 Tracing)
Whenever I speak of Oracle 10046 tracing, I usually say "10 0 46", probably because that is the way others I work with like, Cary Millsap and all of the Hotsos folks, say it. However, during the 2009 Hotsos Symposium, I noticed that Jonathan Lewis pronounced it "1 Double 0 46". I guess being a James Bond fan, it made me instantly think of James Bond 007 (Double 0 7). I thought it was kind of cool to hear Jonathan say it that way, maybe even more so because of his British accent. I started thinking that maybe I was saying it wrong and actually he was saying it right.
After all, what 10046 does is it allows us to spy on the exact details of what our Oracle transactions are doing and decipher where our valuable resources are being spent. It is the secret agent we hire to investigate where our precious time and system resources are being used up. It has a license to kill, well at least help us to kill, our heavy resource consumers. So the more I thought about it, the more I got to thinking that 10046 actually is kind of like the James Bond 007 for Oracle performance.
Sure 10046 doesn't drive around in an awesome car with cool gadgets, but it is now driving around in Oracle 10g, with a much easier way of being invoked. In the past, it could only be enabled using the ALTER SESSION or ALTER SYSTEM command. Yes, there were other ways to start it like setting an init.ora parameter, using oradebug, and of course the DBMS_SYSTEM and DBMS_SUPPORT packages. But now in Oracle 10g, it has the cool DBMS_MONITOR package to get the bad guy, I mean extended trace data file.
It also has its' Quartermaster, the TRCSESS command, to help support it when needed, by tinkering with the trace. I guess that makes TRCSESS kind of like Q in the James Bond films.
So from now on when I mention 10046, I will try to pronounce it more appropriately as 1 Double 0 46. Thank you Jonathan for correcting my pronunciation. And thank you 10046 for getting the bad guys and saving our clients' performance once again. Not to mention, allowing us to get the girl (our paycheck).
All this writing about 10046 is making me kind of thirsty. Maybe I'll have a Vodka Martini - shaken not stirred.
Now if only I can get one of those cool Aston Martin's.
Oh Rick...
After all, what 10046 does is it allows us to spy on the exact details of what our Oracle transactions are doing and decipher where our valuable resources are being spent. It is the secret agent we hire to investigate where our precious time and system resources are being used up. It has a license to kill, well at least help us to kill, our heavy resource consumers. So the more I thought about it, the more I got to thinking that 10046 actually is kind of like the James Bond 007 for Oracle performance.
Sure 10046 doesn't drive around in an awesome car with cool gadgets, but it is now driving around in Oracle 10g, with a much easier way of being invoked. In the past, it could only be enabled using the ALTER SESSION or ALTER SYSTEM command. Yes, there were other ways to start it like setting an init.ora parameter, using oradebug, and of course the DBMS_SYSTEM and DBMS_SUPPORT packages. But now in Oracle 10g, it has the cool DBMS_MONITOR package to get the bad guy, I mean extended trace data file.
It also has its' Quartermaster, the TRCSESS command, to help support it when needed, by tinkering with the trace. I guess that makes TRCSESS kind of like Q in the James Bond films.
So from now on when I mention 10046, I will try to pronounce it more appropriately as 1 Double 0 46. Thank you Jonathan for correcting my pronunciation. And thank you 10046 for getting the bad guys and saving our clients' performance once again. Not to mention, allowing us to get the girl (our paycheck).
All this writing about 10046 is making me kind of thirsty. Maybe I'll have a Vodka Martini - shaken not stirred.
Now if only I can get one of those cool Aston Martin's.
Oh Rick...
Monday, March 16, 2009
Begin Blogging ...
I have been thinking for quite awhile about starting my own blog, but was unsure as to who would actually read it and what I would write about. I noticed how other bloggers, like my good friends Karen Morton and Cary Millsap, have really embraced the medium which honestly is kind of new to me. Let me say that writing has never been one of my favorite things to do. In the past, I have even used sayings like "I am more of a Do-er, than a Writer" and "I would pay good money for a Technical Writer to do the documentation for me." By that I mean, I really enjoy the Oracle performance optimization work I have been involved with for the past 25 years or so, but it is a real effort for me to write about it. However, here I am writing a blog, which I hope others will read, enjoy and get a little insight into my thoughts. Perhaps we can even learn a bit about Oracle performance optimization along the way. Well I have just completed my first entry, so please be kind. I hope I didn't just open up a Pandora's Box.
Subscribe to:
Posts (Atom)