RAC workshop – Laurent Schneider

As announced, I have been speaking for Credit Suisse employees last week. The objective was to give the database project and team leaders the keys arguments for going or not going to RAC.

First : what is RAC ?
Oracle Real Application Cluster.
You have a shared storage, and two instances accessing the same database.

The challenges:

The consistency

In a single instance model, there is a read-write consistency. This may not change in RAC. Somehow, if one read a table and at the same time someone else write the table, it must offer the same level of consistency.

Crash recovery

In single instance, if one instance dies (shutdown abort / kill -9 / server crash), there is an instance recovery which reads the commited and uncommited transaction in the redo logs. In RAC, each instance has an UNDO tablespace and a separate thread of redo. So if one instance crash (server crash), the other instances will do the crash recovery. If all instances crashes at the same time (ex: disaster), the one who restart will have to recover all transactions.

Performance

You have more OS, more network traffic, but you should achieve a comparable performance as for multiple processor. Some Oracle Marketing slides shows 2cpus-4nodes-cluster which performs better than a 8cpu-server. This is a challenge

Scalability

This is rather related to Linux. If you have Sparc/Solaris, you can scale on a single server to more than 100 cpus and 1Tb memory. But if you have Linux, RAC offers additional scalability. However, it is not guarantee that your cluster will scale over 4 nodes.

High Availability (HA)

This is THE challenge. It is also the main reason to go RAC. If you cannot afford the downtime of a FAILOVER (a few minutes, 1-30), you probably need RAC.

Maintainability

It must not be more difficult to maintain RAC than single instance database. This is a very difficult question. On failover cluster, there are lots of in-house scripts to do the failover. A big mess most of the time. In RAC, there is no failover. If one instance crashes, it just need to be restarted one day, that’s it! However, the configuration of the cluster is twice as difficult, because you have not 1 but 2 instances to configure. In case of tuning / troubleshooting, you have also 2 instances to tune, but also the interconnect and the cache fusion, I talk about the later below. If you have a SLA (server level agreement) per Database, going to RAC is a pain. If you have a SLA per instance, you will have more instance to maintain, so you may increase your productivity.

You have (typically) two servers (or more). On each server you have memory (buffer cache / shared pool / pga), cpu (load balance / parallelism) and processes (pmon / dbwr / dispatchers / arch). If the servers have different size, you can have different setting. One instance can have a bigger cache, more cpu used for parallel query, more dispatches, more db writers.

The spfile has the capability of configuring multiple instance in a single shared parameter file.

alter system set shared_pool_size=400M sid='INST01';

The dictionary has view to let you monitor either a single instance (V$SESSION) or globally (GV$SESSION).
select inst_id,sid,serial# from gv$session;
Note that if you want to kill a session, you must be connect as a privileged user on that instance, or wait for the next release of RAC.
There are also performance view about cache fusion and interconnect.

The cache fusion. In a single instance, you can read data from your cache safely. In RAC, if you have a block in the cache, another instance may update this block, so there must be a mechanism to insurance cache invalidation. This mechanism already existed in Oracle 6 Parallel Server (OPS) and later. Here the principle : one instance is the lock master. The first who started. Not configurable. Any time one instance read or write a block, the block can be cached. If one instance has the block in cache, another can read it.
Access time :
local cache << remote cache << disk
Let’s take an example:

INSTA: hello lock master, I want to read block 1008
LOCKM: nobody has this block, please read from disk
LOCKM updates his table, he knows INSTA has the block
INSTA reads the block
INSTB: hello lock master, I want to read block 1008
LOCKM: wait, INSTA has the block, I tell INSTA to send you that block
LOCKM knows both INSTA and INSTB have the block
INSTA sends the block to INSTB
INSTB: hello lock master, I want to update block 1008
LOCKM inform INSTA that his version of the block 1008 is now invalid
LOCKM knows INSTB has the block 1008 in cache (valid)
LOCKM: please do your update

This is cache fusion. If the lock master crashes, a new global cache table is built on another instance, voted as lock master.

This mechanism has performace impact. You need more CPU. If you have 100% cpu usage (ex: batch job), it will have dramatic performance impact. So you need suffisant CPU (cost of license :twisted:). Having 100% cpu in rac is not good.

Also if you have bigger buffer cache (OLTP), the probably to read a block from the network will increase, so it will generate network traffic. Having network traffic is good, because interconnect is faster than disk. But having network bottleneck is terrible.

If you have more nodes, you will have also more traffic. If you have some latches contention on single instance, going to RAC will increase your contention. RAC will not solve your single instance performance problem. If you have a package application, be sure it supports RAC and read the benchmarks for that particular application.

I had a lot of questions about planned downtime. Here are some facts:
If you have a standard os upgrade/patch, you may achieve 0 downtime with RAC. However your application must support connection failover (a challenge 8) ).
If you have an interim patch, like a Critical Patch update, RAC may offer 0 downtime by applying the patch locally in a rolling upgrade manner. Not all versions. Not all interim patches.
If you have a patchset (10.2.0.2 to 10.2.0.3) or major version upgrade (10gR1 to 10gR2), you need a logical standby database to achieve a downtime smaller than one minute. RAC will not help. In the opposite, you have more servers and more oracle homes to patch, so eventually it will increase the downtime in comparison to single instance.

Also the cost of the infrastructure was of biggest interest. If you have RAC, you may need a different cluster infrastructure (ex: infiniband) and some more efficient protocol than ethernet over IP (ex: Reliable Datagramm Socket). It is way more expensive than single instance failover. And the more node you add, the faster your network must perform.

One student asked : “Why invest time and money on RAC when the application / webserver is not redundant?”. This is a good question. You need to offers HA all over the service, from DBA to client. If one server crashes, you need another server. If one network crashes, you need another network. If one application server crashes, you need another application server. If one operator is ill, you need an other operator to do his job. Oracle Application Server offers HA with farms of J2EE services. If your application does not support TAF (transparent application failover), you need to change your application first !

Finally about the skills of the RAC dba. Lot’s of tasks are almost as easy or even easier than Failover cluster. runinstaller / opatch / database configuration assistant (dbca) / enterprise manager are cluster aware. Installing an additional node with OEM or ORACLE_HOME/oui/bin/add_node.sh and adding an additional instance with dbca or OEM is simple. However, the DBA must have some Cluster skills, and it is less easy, at least in Switzerland, to find a DBA with RAC experience than a “normal” DBA

13 thoughts on “RAC workshop”

Kirtan Desai January 17, 2007 at 06:59

If all instances crashes at the same time (ex: disaster), the one who restart will have to recover all transactions.

Can you please elaborate on this? What would happen whe, after disaster, two nodes are available? When does the node that is not performing DR become available to public? Does it wait for the other node that is performing DR?
laurentschneider Post authorJanuary 17, 2007 at 09:56

I believe the second node have to wait until the crash recovery is finished.
Kirtan Desai January 17, 2007 at 18:50

So in other words, the second node will wait for a signal from the database server. And the database server waits for the signal from first node. is that correct? Do they remain in suspend mode until then?
laurentschneider Post authorJanuary 18, 2007 at 11:18

it is not very different from single instance. When you issue your STARTUP after your crash, the database is not open until crash recovery is finished. I hope I can test it more accurately one day
jokach July 13, 2007 at 16:57

“But if you have Linux, RAC offers additional scalability. However, it is not guarantee that your cluster will scale over 4 nodes.”

Is this written somewhere out there? I’ve heard this mentioned, and see it in your article, but can’t seem to find any qualitative information supporting it?
Laurent Schneider Post authorJuly 13, 2007 at 18:48

Jokach,
Thanks for your question, you are very right to ask, it is not my habit to put such comments without justification. I may be completly wrong with some application that will scale well on your six nodes clusters.

Also important is than you can offer different services ❗ you have a 6 nodes database, but application A has 4 prefered nodes, application B has 4 prefered nodes and application C has 4 prefered nodes. So you keep a very high availability but you do not “scale” one application over more than 4 nodes, but you scale your whole cluster and add nodes as you offer more services/application.
This could be an approach too.

The concept about 5 nodes is less than 4 I have read a few times but could not find a reference right now, sorry. The fact that I have read it does not mean it is true 🙁 But the overhead due to the cache fusion and the high number of connection will probably make your network a bottleneck if you add nodes and want to unlimitely scale your application. More tests should be needed. And it may be os/plateform/hardware specific

Do not hesitate to ask more!

😉

Regards
Laurent
Irfan Khan July 30, 2007 at 15:00

We have a 2 node rac cluster and a standby ( datagaurd , single node rac ). We have 6 application servers connecting to the the 2 node rac cluster.( using weblogic multipools ). In our tests when we crash the databse server ( halt -q ) , it takes about 25 minutes to recover.

I wanted to know how long is the oracle crash recovery process and what is so time consuming during this process? Also is datagaurd cozing additional delays

thank you
Laurent Schneider Post authorJuly 30, 2007 at 17:42

what transfer do you use? if you use archivelog transfer, it will take a while…

if you use log transfer (or online redo log hardware mirroring) it should never take that long, unless you use the DELAY parameter…

I guess the instance recovery take only a few seconds, does not in? if not, you could set FAST_START_MTTR_TARGET, but I cannot believe this could be the root cause.

Is the crash recovery slower because of RAC? or is it the same even if you have only 1 instance running
Irfan Khan July 30, 2007 at 18:15

I am sorry I did not explain the test clearly. We are crashing the node from the 2 node rac cluster. We want it to failover to the 2nd rac node (not the standby). I mentioned the standby coz I wanted to now if it will cause any delays.

transfer is archive logs to the standby node (datagaurd)

I am a Sys admin ( do a little bit of DBA ). We are testing multipools as oppose to TAF ( currently in production, taf going out of support by weblogic )

Hence we are testing the database server crash. We also tried a shutdown immediate onthe node A and node B took over immediately. NO loss of any users.

But when we crash the database server it takes about 20-25 minutes to recovr, plus we loos a LOT of users
Laurent Schneider Post authorJuly 30, 2007 at 18:26

you mean the weblogic multipool took the time? in this case you should check with weblogic support, because I cannot help you much there.

But if you use taf, it should not take too long, and if it does, you can use PRECONNECT sessions.
Irfan Khan July 30, 2007 at 18:38

Everything in weblogic log looks good ( from the log ). Its just the response time for the DB that is very very slow. And due to slow reponse times the users are being lost.

I have a case open with weblogic, and we have multipools setup as they have recommended.

I was going in the direction where there is a lot of RAC wait ( or some crash recovery on rac ) that is taking time.

Is there a way to find out all the details of the recovery process and how long it takes
Laurent Schneider Post authorJuly 30, 2007 at 18:47

yes, you can use OEM or the views. Also the alert log. But I cannot imagine the instance recovery taking so long.

You could start with the GV$MTTR_TARGET_ADVICE

Can you connect and create a dummy table with sqlplus during those 25 minutes?
Neeraj Bhatia October 24, 2007 at 05:14

Its quite impressive text on RAC. very simple language and wonderfull explanation. Just bookmark the blog ….

Thanks !!

Comments are closed.