OEM 10gR3 is out!

emgrid 10.2.0.3 is not the second patchset of 10gR2, it is Oracle Enterprise Manager Grid Control Release 3.

Well, it is still called a patch set and must be applied to 10.2.0.1 or 10.2.0.2.

But it is a new release !

You can download the soft and the doc on otn
http://www.oracle.com/technology/software/products/oem/index.html

Out of the new features, you have a Linux pack for your Unbreakable Linux, Management pack for Siebel and Oracle Content Database, and also Linux Server Administration to manage your Linux (RHEL4 and SuSE9).

Monitoring the age of the last backup with OEM

My customer wants to receive alerts if a database has not been backed up (either unsuccessful backup or no backup) for ages.

As the customer have Oracle Enterprise Manager Grid Control 10gR2, I started using User Defined Metric (UDM) yesterday. Thanks Troy for his comment at OEM Generic Service

I would do 2 checks, backup of datafiles and backup of redo logs.

Here the steps :
- Open a database target (LSC01)
- Click on User-Defined-Metrics
- Create
- Metric Name = Age of datafile backup
- Type = Number
- Output = Single Value
- SQL Query : the age in hour since the oldest checkpoint time of the newest backup
select (sysdate-min(t))*24 from
(
  select max(b.CHECKPOINT_TIME) t
  from v$backup_datafile b, v$tablespace ts, v$datafile f
  where INCLUDED_IN_DATABASE_BACKUP='YES'
  and f.file#=b.file#
  and f.ts#=ts.ts#
  group by f.file#
)
- Credentials : dbsnmp/*****
- Threshold Operator > Warning 24 Critical 48
- Repeat every 1 hour
- OK

Same for redologs, with a name of Age of redolog backup query of
select (sysdate-max(NEXT_TIME))*24 from v$BACKUP_REDOLOG

I am not going to do this for each instance, so I will create a monitoring template
- Setup
- Monitoring Templates
- Create
- Select the target (LSC01)
- Name = Age of last backup
- Metric Threshold : remove all metrics except Age of datafile backup and Age of redolog backup
- Policies : remove all policies from template
- OK

Start applying your new metric to your databases.
- Apply
- Add
- Select all your databases
- Continue
- dbsnmp/*** (if you have the same database password for dbsnmp on all databases, it is easier)

It is now possible to define alerts.
- Preferences
- Notification Rules
- Create
- Apply to specific targets : Add you productive databases group
- Deselect Availability Down
- Metric: Add : Show all: Check User defined metric : Select : Age of datafile backup , Age of redolog backup
- Severity : Critical and Clear
- Policy : None
- Method : Email

After a while, you can monitor the metric with a historical graphic
User Defined Metric graph in OEM

Critical Patch Update

The critical patch update for january is released. However, not for every version.

Already in CPU Oct 2006, many versions were delayed. The release date of 10gR1 Tru64 for example was Nov 15, but the note has not been updated for months. 9.2.0.8 Linux was released so late (29 Dec), that the CPU2006OCT-9208 is considered to be CPU2007JAN (patch 5490859). For 10.2.0.3, there is no CPU2007JAN

I applied both 9208-cpu2006oct and 10105-cpu2007jan to my notebook database.

Both complained about not existing XDB schema, but I ignore those (I have no XDB schema). Otherwise it went fine.

I apply the CPU as soon as released for testing purpose. However, applying the CPU as soon as possible in PROD does not apply to most of my customers.

RAC workshop

As announced, I have been speaking for Credit Suisse employees last week. The objective was to give the database project and team leaders the keys arguments for going or not going to RAC.

First : what is RAC ?
Oracle Real Application Cluster.
You have a shared storage, and two instances accessing the same database.

The challenges:

    The consistency

In a single instance model, there is a read-write consistency. This may not change in RAC. Somehow, if one read a table and at the same time someone else write the table, it must offer the same level of consistency.

    Crash recovery

In single instance, if one instance dies (shutdown abort / kill -9 / server crash), there is an instance recovery which reads the commited and uncommited transaction in the redo logs. In RAC, each instance has an UNDO tablespace and a separate thread of redo. So if one instance crash (server crash), the other instances will do the crash recovery. If all instances crashes at the same time (ex: disaster), the one who restart will have to recover all transactions.

    Performance

You have more OS, more network traffic, but you should achieve a comparable performance as for multiple processor. Some Oracle Marketing slides shows 2cpus-4nodes-cluster which performs better than a 8cpu-server. This is a challenge

    Scalability

This is rather related to Linux. If you have Sparc/Solaris, you can scale on a single server to more than 100 cpus and 1Tb memory. But if you have Linux, RAC offers additional scalability. However, it is not guarantee that your cluster will scale over 4 nodes.

    High Availability (HA)

This is THE challenge. It is also the main reason to go RAC. If you cannot afford the downtime of a FAILOVER (a few minutes, 1-30), you probably need RAC.

    Maintainability

It must not be more difficult to maintain RAC than single instance database. This is a very difficult question. On failover cluster, there are lots of in-house scripts to do the failover. A big mess most of the time. In RAC, there is no failover. If one instance crashes, it just need to be restarted one day, that’s it! However, the configuration of the cluster is twice as difficult, because you have not 1 but 2 instances to configure. In case of tuning / troubleshooting, you have also 2 instances to tune, but also the interconnect and the cache fusion, I talk about the later below. If you have a SLA (server level agreement) per Database, going to RAC is a pain. If you have a SLA per instance, you will have more instance to maintain, so you may increase your productivity.

You have (typically) two servers (or more). On each server you have memory (buffer cache / shared pool / pga), cpu (load balance / parallelism) and processes (pmon / dbwr / dispatchers / arch). If the servers have different size, you can have different setting. One instance can have a bigger cache, more cpu used for parallel query, more dispatches, more db writers.

The spfile has the capability of configuring multiple instance in a single shared parameter file.

alter system set shared_pool_size=400M sid='INST01';

The dictionary has view to let you monitor either a single instance (V$SESSION) or globally (GV$SESSION).
select inst_id,sid,serial# from gv$session;
Note that if you want to kill a session, you must be connect as a privileged user on that instance, or wait for the next release of RAC.
There are also performance view about cache fusion and interconnect.

The cache fusion. In a single instance, you can read data from your cache safely. In RAC, if you have a block in the cache, another instance may update this block, so there must be a mechanism to insurance cache invalidation. This mechanism already existed in Oracle 6 Parallel Server (OPS) and later. Here the principle : one instance is the lock master. The first who started. Not configurable. Any time one instance read or write a block, the block can be cached. If one instance has the block in cache, another can read it.
Access time :
local cache << remote cache << disk
Let’s take an example:

INSTA: hello lock master, I want to read block 1008
LOCKM: nobody has this block, please read from disk
LOCKM updates his table, he knows INSTA has the block
INSTA reads the block
INSTB: hello lock master, I want to read block 1008
LOCKM: wait, INSTA has the block, I tell INSTA to send you that block
LOCKM knows both INSTA and INSTB have the block
INSTA sends the block to INSTB
INSTB: hello lock master, I want to update block 1008
LOCKM inform INSTA that his version of the block 1008 is now invalid
LOCKM knows INSTB has the block 1008 in cache (valid)
LOCKM: please do your update

This is cache fusion. If the lock master crashes, a new global cache table is built on another instance, voted as lock master.

This mechanism has performace impact. You need more CPU. If you have 100% cpu usage (ex: batch job), it will have dramatic performance impact. So you need suffisant CPU (cost of license :twisted:). Having 100% cpu in rac is not good.

Also if you have bigger buffer cache (OLTP), the probably to read a block from the network will increase, so it will generate network traffic. Having network traffic is good, because interconnect is faster than disk. But having network bottleneck is terrible.

If you have more nodes, you will have also more traffic. If you have some latches contention on single instance, going to RAC will increase your contention. RAC will not solve your single instance performance problem. If you have a package application, be sure it supports RAC and read the benchmarks for that particular application.

I had a lot of questions about planned downtime. Here are some facts:
If you have a standard os upgrade/patch, you may achieve 0 downtime with RAC. However your application must support connection failover (a challenge 8) ).
If you have an interim patch, like a Critical Patch update, RAC may offer 0 downtime by applying the patch locally in a rolling upgrade manner. Not all versions. Not all interim patches.
If you have a patchset (10.2.0.2 to 10.2.0.3) or major version upgrade (10gR1 to 10gR2), you need a logical standby database to achieve a downtime smaller than one minute. RAC will not help. In the opposite, you have more servers and more oracle homes to patch, so eventually it will increase the downtime in comparison to single instance.

Also the cost of the infrastructure was of biggest interest. If you have RAC, you may need a different cluster infrastructure (ex: infiniband) and some more efficient protocol than ethernet over IP (ex: Reliable Datagramm Socket). It is way more expensive than single instance failover. And the more node you add, the faster your network must perform.

One student asked : “Why invest time and money on RAC when the application / webserver is not redundant?”. This is a good question. You need to offers HA all over the service, from DBA to client. If one server crashes, you need another server. If one network crashes, you need another network. If one application server crashes, you need another application server. If one operator is ill, you need an other operator to do his job. Oracle Application Server offers HA with farms of J2EE services. If your application does not support TAF (transparent application failover), you need to change your application first !

Finally about the skills of the RAC dba. Lot’s of tasks are almost as easy or even easier than Failover cluster. runinstaller / opatch / database configuration assistant (dbca) / enterprise manager are cluster aware. Installing an additional node with OEM or ORACLE_HOME/oui/bin/add_node.sh and adding an additional instance with dbca or OEM is simple. However, the DBA must have some Cluster skills, and it is less easy, at least in Switzerland, to find a DBA with RAC experience than a “normal” DBA :mrgreen:

Fit for RAC

I will be speaking tomorrow in German and Wednesday in English in Credit Suisse Bank.

The workshop is about the differences between single instance and RAC

Some of them : architecture, availability, scalability, manageability, skills required, price. I will talk also about the operational aspects, upgrade, dictionary, parameter file.

Paul Moen article on MONTHS_BETWEEN

I have been shocked by Paul Moen article on MONTHS_BETWEEN.

Here is my own case :

SQL> select months_between( date '2000-03-01', date '2000-02-28') * 31 from dual;
4

Incredible! I have always been very careful with months_between because of the documented end_of_month behaviour (there is one month between 30-APR and 31-MAY), but I did not know the fractional part of add_months was based on a 31-day month :evil:

update: but it is documented