Optimizer
statistics in Oracle are managed via a pl/sql package, dbms_stats. It provides
several pl/sql procedures to gather statistics for a table, schema, or a
database. For example, gather_table_statistics is used to gather statistics on
a table. This procedure has an estimate_percent parameter, which specifies the
sampling percentage of the statistics gathering. The users can specify any
number between 0 ~ 100 for this parameter. For example, suppose you have a
table BIGT, you can specify a 1% sampling percentage as follows:
exec dbms_stats.gather_table_stats(null, 'BIGT',
estimate_percent =>
1);
It is not
always easy for users to pick the appropriate sampling percentage. If the
specified sampling percentage is too high, it can take longer to gather
statistics. On the contray, if the data is skewed and the specified sampling
percentage is too low, the resulting statistics can be inaccurate. For this
reason, Oracle introduced the AUTO value for the estimate_percent parameter.
For example, you can gather statistics on BIGT as follows:
exec dbms_stats.gather_table_stats(null, 'BIGT',
estimate_percent => dbms_stats.auto_sample_size);
The
advantage of using AUTO sample size over a fixed number is twofolds. First,
when AUTO sample size is specified, the system automatically determines the
appropriate sampling percentage. Second, AUTO sample size is more flexible than
a fixed sampling percentage. A fixed sampling percentage size that was good at
some point in time may not be appropriate after the data distribution in the
table has changed. On the other hand when AUTO value is used Oracle will adjust
the sample size when the data distribution changes.
When AUTO
is used Oracle picks a sample size where the statistics quality is good enough.
However, it does not work very well under cases of extreme skew in the data. In
Oracle 11g, we improved the behavior when the AUTO value is used. First, AUTO
sampling now generates deterministic statistics. Second, and more importantly,
AUTO sampling generates statistics that are almost as accurate as 100% sampling
but takes much less time than 100% sampling. To illustrate these merits, we
compare the performance of using a fixed sampling percentage, AUTO sample size
in Oracle 10g and AUTO sample size in Oracle 11g. We used the standard TPCD
data generator to generate a Lineitem table. The Lineitem table is about 230G
and contains 1.8 million rows with 16 columns. The schema of the lineitem table
is as follows:
column
name

column
type

l_shipdate

date

l_orderkey

number

l_discount

number

l_extendedprice

number

l_suppkey

number

l_quantity

number

l_returnflag

varchar2

l_partkey

number

l_linestatus

varchar2

l_tax

number

l_commitdate

date

l_receiptdate

date

l_shipmode

varchar2

l_linenumber

number

l_shipinstruct

varchar2

l_comment

varchar2

Table 1 gives the elapsed time of gathering statistics on the Lineitem table by different sampling percentages.
Sampling
Percentage

Elapsed
Time (sec)

1%
sampling

797

100%
sampling (Compute)

18772

Auto
sampling in Oracle 10g

2935

Auto
sampling in Oracle 11g

1908

Table 1: Statistics gathering time on 230G TPCD Lineitem Table Using Different Estimate Percentages We also compare the quality of the statistics gathered using different estimate percentages. Among all the statistics of a column, number of distinct values (NDV) is the one whose accuracy used to be an issue. We define the accuracy rate of NDV of a column as follows:
accuracy rate = 1  (estimated NDV  actual
NDV)/actual NDV.
The
accuracy rate ranges from 0% to 100%. The higher the accuracy rate is, the more
accurate the gathered statistics are. Since 100% sampling always lead to an
accuracy rate of 100%, we do not report it. We focus on the columns which has
at least one statistics accuracy rate below 99.9% when using different estimate
percentages. Table 2 illustrates the accurate rates of the columns.
Column
Name

Actual
NDV

Auto
Sampling in Oracle 11g

1%
Sampling

orderkey

450,000,000

98.0%

50%

comment

181,122,127

98.60%

4.60%

partkey

60,000,000

99.20%

98.20%

suppkey

3,000,000

99.60%

99.90%

extendedprice

3,791,320

99.60%

94.30%

Table 2: Accuracy Rate of Gathering NDV LineItem Using Different Estimate Percentages In short, the elapsed time of ''AUTO sampling in Oracle 11g'' is 10 times faster than 100% sampling but yields the statistics of similar quality (accuracy rate close to 100%).
copy & paste from oracle.com
ReplyDeletehttps://blogs.oracle.com/optimizer/entry/improvement_of_auto_sampling_statistics_gathering_feature_in_oracle_database_11g