Hide this madness

Quick Guide to Oracle Histograms

Statistics – when you’ve got ‘em, and when they’re reasonable, life is good. But even when they’re good, Oracle’s optimizer can make some strange decisions, generating some hard to explain query plans. Histograms can be part of an effective solution, but they can also make life worse.

Most software developers are aware of a number of factors that can affect the performance of their database queries. All have experienced the common situation of a query that has been thoroughly tested against a variety of data working well for a quite some time, and then later performing dismally, rather unexpectedly.

Often the culprit is “database statistics”. Unfortunately, for most developers that’s the end of it. The DBA fixes the situation, and all is well again. This article hopes to illuminate a small portion of what is typically a foggy area of understanding. We’ll use Oracle 10g for our base of reference and examples, but the concepts are similar across most modern relational database systems (Sybase, Informix, MySQL, MS SQL Server, etc.).

Table and index statistics are simply a set of numbers gathered by Oracle to describe your data – a way of describing or profiling the landscape of your information. These numbers have internal benefit to Oracle because there are many ways to retrieve data, and these methods can vary significantly in cost. By cost, we mean the resources used, most notably: CPU, memory, and disk i/o’s. As the volume of data increases, and the complexity of the SQL query increases, statistics become even more critical for Oracle to generate what is known as an optimal query plan – a series of steps and techniques that Oracle has at its disposal to acquire the data being requested.

In the past, the optimizer component of an RDBMS would develop a query plan based on a set of rules in a relatively straightforward manner: SQL query + rules = query plan. Starting with Oracle 9i, the trend is “cost-based” optimization rather than “rule-based” optimization. This is proving to be more flexible, dynamic, and effective. But it has its own set of perils.

Suboptimal query plans generated by Oracle’s cost based optimizer (CBO) are most often a result of a difference of opinion between your own knowledge of the data being queried, and CBO’s opinion of what the data looks like. The CBO’s opinion is base entirely on the statistics it has gathered about your data. So when these numbers fail to accurately describe reality, some interesting query plans can be generated.

Here are couple of very simple examples of important statistics:
1. The number of rows contained in a table
2. The number of distinct values that a column currently has.

These, and a number of other statistics are used as input for cost calculations. Oracle calculates a number of candidate query plans and their associated costs, then (usually) selects the plan with the least expensive cost.

Frequency Histograms

Histograms are type of statistic that can be an important ingredient in an accurate cost calculation. Oracle uses histograms to improve its selectivity and cardinality calculations for nonuniform data distributions. I’ll explain what this means with a real world example, from Jonah Group’s FARE1 data warehouse project.

Within some of the larger fact tables, we use numeric columns to describe the year & month of a date. Using an integer or numeric data type for partial date information – rather than a date type – is a common tactic to save storage space. It also can make the development of reporting criteria simpler.

The diagram below represents what is called a “frequency histogram”. In this example, an integer column called “PAID_MONTH” represents the month in which an insurance claim was paid. In SQL, such a column could be used to select claims paid between (and including) October 2009 and January 2010:


frequency histogram

Frequency histograms allow for every distinct value for the column to be categorized nicely into its own “bucket”. One limitation of Oracle’s frequency histograms is that it has a hard limit of 254 buckets. (In this situation, a frequency histogram works out rather well, since the number of months since the beginning of the FARE data warehouse, and today falls below 254.) By pre-constructing a frequency histogram for this column, Oracle can more accurately calculate the selectivity for this column. The number of qualifying rows would be: 61732 + 20152 + 20652 + 81636 = 184172.

Without a histogram, Oracle would make other assumptions concerning the distribution of data, and this would likely affect the outcome of the selectivity calculation. In this case, one assumption that Oracle uses would be that the number of integers between 200910 and 201001 is 91. But since we have inside knowledge of this data that Oracle doesn’t have, we know that the number of months in this range of month values is just 5. Not surprisingly, this discrepancy can result in some very different cost calculations and the choice of some suboptimal query plans, for some SQL queries.

Without a histogram, Oracle sees the distribution of data somewhat like this:

Height Balanced Histograms

Height balanced histograms are similar to frequency histograms in their design, but in this case the number of distinct column values exceeds the hard limit of the number of buckets available (254). Here, the number of column values are divided into ranges so that each range contains approximately the same number of rows. The number of distinct values represented by a bucket can be more than just one value.

We’ll use an example to illustrate. Let’s suppose there is a small medical clinic of providers that specialize in just 999 different surgical procedures, but together they service many thousands of patients each year. You can imagine that a medical procedure code representing each procedure, might be a useful column in a table that contains these many thousands of patient encounters. A height balanced histogram would be represented by the following diagram.

As with frequency histograms, height balanced histograms can also provide the same benefits to more accurate query plan cost calculations because it could provide Oracle with a more accurate view of the distribution, should it be heavily skewed.

Histograms… Or Not?

Oracle statistics gathering and histogram creation are specified through the use of the DBMS_STATS package. There are numerous online resources available that detail the use of this package, so we won’t get into this here. Unfortunately, the down-side of having Oracle generate histograms during the statistics gathering process is that they’re quite expensive to create, and they aren’t actually used very much by the CBO.

Nonetheless, it is important to understand when the use of histograms is a good idea, and when it’s a bad idea. The default position on this matter should be to avoid constructing histograms if you can. There are large amounts of misinformation on the web indicating otherwise. Typical recommendations are to create them for all indexed columns, and/or let Oracle decide when to create them by using its documented “auto” or “skewonly” column parameter modifiers. Sadly, both recommendations are usually bad ideas, and depending on the circumstances lead to greater troubles.

Guidelines For Histograms:

  • Use as a way to solve a specific query plan issue; and explicitly choose them.
  • And when column data is heavily skewed.
  • And when a column is likely to appear in a query’s “WHERE” clause.
    (Note: Histograms are not just for indexed columns.)
  • And when the time and resources required to create the histogram are not excessive.
    This will be very dependent on the data volumes involved.

Avoid Histograms:

  • When column data is evenly distributed
  • Or the column rarely appears in a “WHERE” clause
  • Never create them on every column of a table
  • Never on the Primary Key

The details of the cost calculations that CBO performs are very well explained in Jonathan Lewis’s book titled “Cost-Based Oracle Fundamentals”. Here you’ll find plenty of detail concerning Oracle histograms and their use.

  1. Facets Analytic Reporting Environment – a project executed by The Jonah Group for BCN of Michigan
  1. I recently came across this site and have loved the information. I look forward to future articles and will definitely link to this and tell the people I know. Thanks.

    Jan 20th, 2011
  2. Hi Scott, I need info like does histogram generate only on columns or it can generated on indexes also ? if index histograms generate which dictionary they can viewed ?
    Thanks in Advance ..

    Jul 10th, 2015

Add a comment

Comment feed
The better to greet you with
No one will ever see this
Your pride and joy
The reason this comment form exists

The crew behind ASOT

We're a team of interactive, software, and business intelligence experts skilled in the design, construction, and management of online enterprise systems.

Visit The Jonah Group site

Get in touch with us