Irrecoverable part III : a fix

After part I:the problem, Part II: two reports here is part III.

We have backed up archivelog during a full and deleted them before the full completed.


RMAN> list backup of archivelog all;

List of Backup Sets
===================

BS Size   Type Elapsed  Completion
-- ------ ---- -------- ----------
15 4.00K  DISK 00:00:00   13:31:08
    BP Key: 15   Status: AVAILABLE
    Piece Name: /FULL/0fq7gc0s_1_1

  List of Archived Logs in backup set 15
  Seq LowSCN  LowTime  NextSCN NextTime
  --- ------- -------- ------- --------
  15  355533  13:29:55 355777  13:31:08

BS Size   Type Elapsed  Completion
-- ------ ---- -------- ----------
25 4.00K  DISK 00:00:00   13:31:26
    BP Key: 25   Status: AVAILABLE
    Piece Name: /ARCH/0pq7gc1e_1_1

  List of Archived Logs in backup set 25
  Seq LowSCN  LowTime  NextSCN NextTime
  --- ------- -------- ------- --------
  15  355533  13:29:55 355777  13:31:08

BS Size   Type Elapsed  Completion
-- ------ ---- -------- ----------
26 3.00K  DISK 00:00:00 13:31:26
  BP Key: 26   Status: AVAILABLE
  Piece Name: /ARCH/0qq7gc1e_1_1

  List of Archived Logs in backup set 26
  Seq LowSCN  LowTime  NextSCN NextTime
  --- ------- -------- ------- --------
  16  355777  13:31:08 355827  13:31:22

BS Size   Type Elapsed  Completion
-- ------ ---- -------- ----------
28 2.50K  DISK 00:00:00 13:31:28
    BP Key: 28   Status: AVAILABLE
    Piece Name: /FULL/0sq7gc1g_1_1

  List of Archived Logs in backup set 28
  Seq LowSCN  LowTime  NextSCN NextTime
  --- ------- -------- ------- --------
  17  355827  13:31:22 355863  13:31:28

Sequence 16 is missing in /FULL/ and is available in /ARCH/

Ok, let’s copy the missing pieces from ARCH to FULL


$ cd /
$ cp -p ARCH/* FULL/

Now let’s try. I save my directory on an offline location (tape/usb/cdrom) and dropped my database including backups.


cd /
tar cvf /tmp/FULL.tar FULL

Let’s drop (it’s just a test)


RMAN> startup force mount dba 
RMAN> drop database including backups;
database dropped

Ok let’s get this USB stick back from the moon to my datacenter and see if I can restore …


cd /
tar xvf /tmp/FULL.tar 

Let’s do this !


RMAN> startup force nomount
RMAN> restore controlfile from '/FULL/c-2414975447-20150521-01';
RMAN> alter database mount;

Now I want RMAN to look for other files in FULL and also marked expired stuff. The start with does the magic.

RMAN> catalog start with '/FULL/' noprompt;
RMAN> crosscheck backup;
RMAN> crosscheck copy;
RMAN> crosscheck archivelog all;
RMAN> restore database;
RMAN> recover database;
RMAN> alter database open resetlogs;

As @dbastorage said on twitter:
“if it was easy, we would not need DBAs! ”

Irrecoverable full backup part II : reporting

After my post Can you restore from a full online backup ?, I needed to come up with a report.

Assuming that each backup goes in a different directory, I just wrote two reports.

  1. Report gaps in v$backup_redolog (or rc_backup_redolog if you use the catalog)
    
    DIR     FIRST_CHANGE# NEXT_CHANGE#
    ------- ------------- ------------
    /bck01/        284891       285140
    /bck01/        285140       285178
    
    /bck02/        284891       285140
    === GAP ===
    /bck02/        285178       285245 
    
    /bck03/        285178       285245
    /bck03/        285245       286931
    /bck03/        286931       287803
    /bck03/        287803       288148
    

    This could be done with analytics, by checking where the last next_change is not the current first_change, within a directory

    
    SELECT dir, 
      LAG missing_from_change#, 
      first_change# missing_to_change#
    FROM (
      SELECT REGEXP_REPLACE (handle, '[^/\]+$') dir,
        first_change#,
        next_change#,
        LAG(next_change#) OVER (
          PARTITION BY REGEXP_REPLACE (handle, '[^/\]+$')
          ORDER BY first_change#
        ) LAG
      FROM v$backup_piece p
      JOIN v$backup_redolog l 
        USING (set_stamp, set_count))
    WHERE LAG != first_change#;
    
    DIR     MISSING_FROM_CHANGE# MISSING_TO_CHANGE#
    ------- -------------------- ------------------
    /bck02/               285140             285178
    
  2. Reports directories where archivelogs don’t include changes (backup redolog) from the earliest to the latest checkpoint (backup datafile)
    
    SELECT 
      REGEXP_REPLACE (handle, '[^/\]+$') dir,
      MIN (checkpoint_change#),
      MAX (checkpoint_change#),
      MIN (first_change#),
      MAX (next_change#)
    FROM v$backup_piece p
      LEFT JOIN v$backup_datafile f 
        USING (set_stamp, set_count)
      LEFT JOIN v$backup_redolog l 
        USING (set_stamp, set_count)
    WHERE handle IS NOT NULL
    HAVING 
      MIN (checkpoint_change#) < MIN (first_change#)
      OR 
      MAX (checkpoint_change#) > MAX (next_change#)
    GROUP BY REGEXP_REPLACE (handle, '[^/\]+$');
    
    DIR     MINCHECKPT MAXCHECKPT   MINFIRST    MAXNEXT
    ------- ---------- ---------- ---------- ----------
    /bck04/     954292     954299     959487    1145473
    

    the archives for the changes from 954292 to 959487 are missing.

If some archive backups are missing in one directory, it does not mean the database is irrecoverable, the archive backups could be in another directory. But it means that single directory would no longer permit you to restore or duplicate.

Another approach with RESTORE PREVIEW was provided by Franck in my previous post : List all RMAN backups that are needed to recover.

Usual disclaimer: there are plenty of other irrecoverabilty causes from hardware defect to backup “optimization” that are beyond the scope of this post.

Can you restore from a full online backup ?

The question is not HOW TO DO IT but WHETHER YOU CAN DO IT !

A typical backup script would contains something like BACKUP DATABASE PLUS ARCHIVELOG:


backup database format
'/u99/backup/DB01/20150518/full_0_%d_s%s_p%p'
plus archivelog format
'/u99/backup/DB01/20150518/arc_%d_s%s_p%p';

Starting backup at 2015-05-18_18:27:55
current log archived
input archived log thread=1 sequence=469
...
piece handle=
/u99/backup/DB01/20150518/arc_DB01_s86_p1
Finished backup at 2015-05-18_18:27:58

Starting backup at 2015-05-18_18:27:58
input datafile file number=00002 name=
/u02/oradata/DB01/undots01.dbf
...
including current control file in backup set
including current SPFILE in backup set
piece handle=
/u99/backup/DB01/20150518/full_0_DB01_s88_p1
Finished backup at 2015-05-18_18:28:16

Starting backup at 2015-05-18_18:28:16
current log archived
input archived log thread=1 sequence=20
piece handle=
/u99/backup/DB01/20150518/arc_DB01_s89_p1
Finished backup at 2015-05-18_18:28:17

This sounds pretty safe, but what happened if you keep this full backup for ever ? do you have all what you need in it to restore ?

It depends. Chance exists that you can restore. To annoy the paranoiacs, here is a counter example.

  1. you start your full backup at 6pm.
    backuping sequence 21,22,23,24
    backup datafiles…

  2. at 7 pm one cronjob issue backup archivelog all delete input;
    backuping and deleting sequence 21,22,23,24,25,26,27,28,29,30

  3. at 8pm your backup is about to finish
    backuping sequence 31

Well, where is my sequence 27 ?

Let’s try


rm /tmp/arch_DB01*
startup force mount;
crosscheck backup;restore database;


Starting restore at 2015-05-18_18:47:45
channel ORA_DISK_1: restore complete, elapsed time: 00:02:05
Finished restore at 2015-05-18_18:49:51

Ok, the restore was fine. Now what?


RMAN&gt; recover database;

Starting recover at 2015-05-18_18:50:35
using channel ORA_DISK_1

starting media recovery

RMAN-00571: =================================================
RMAN-00569: ========== ERROR MESSAGE STACK FOLLOWS ==========
RMAN-00571: =================================================
RMAN-03002: failure of recover command at 05/18/2015 18:50:36
RMAN-06053: unable to perform media recovery because of missing log
RMAN-06025: no backup of archived log for thread 1 with sequence 30
RMAN-06025: no backup of archived log for thread 1 with sequence 29
RMAN-06025: no backup of archived log for thread 1 with sequence 27

RMAN&gt; alter database open;

RMAN-00571: =================================================
RMAN-00569: ========== ERROR MESSAGE STACK FOLLOWS ==========
RMAN-00571: =================================================
RMAN-03002: failure of alter db command at 05/18/2015 18:51:29
ORA-01113: file 1 needs media recovery
ORA-01110: data file 1: '/u02/oradata/DB01/system01.dbf'

You have lost your database. Even if you did a full online backup plus archivelog, it did not contain all the archivelogs and RMAN did not tell you.

There is more than one way to reduce this annoyance :

1) don’t backup archivelogs during full backup
2) don’t delete archivelog backups done during a full backup
3) report any archivelog backups that run during a full. Either by looking in the RMAN dictionary or in your house logfiles
4) do offline backup for long term archives
5) don’t trust RMAN

And as well :
Test your backups 😀

Rman backup compression

Did you know you can make your backup at least twice faster with a single line ?

Demo :


RMAN> backup as compressed backupset database;
Starting backup at 2013-06-05_13:08:01
...
Finished backup at 2013-06-05_13:13:59

6 minutes for a compressed backup on a NAS with 24 Channels and 100Gb of raw data. Not bad. But look at this !


RMAN> configure compression algorithm 'low';
new RMAN configuration parameters are successfully stored
RMAN> backup as compressed backupset database;
Starting backup at 2013-06-05_14:06:09
...
Finished backup at 2013-06-05_14:08:29
RMAN> configure compression algorithm clear;
RMAN configuration parameters are successfully reset to default value

By configuring this magic parameter, it is now more than twice faster ! This is incredible !

Go to your cashier and grab some coins to get this amazing advanced compression option !

Rman and DBGSQL message

I have not seen DBGSQL very often. But today again, a duplicate in RMAN was failing with, amoung other errors, sqlcode 911


RMAN> duplicate target database to DB02
  until time "to_date('2013-01-29_00:00:00','YYYY-MM-DD_HH24:MI:SS')"
  nofilenamecheck ;

DBGSQL:     TARGET> select 2013-01-29_00:00:00 from sys.dual
DBGSQL:        sqlcode = 911
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of Duplicate Db command at 02/01/2013 17:58:23
RMAN-05501: aborting duplication of target database
ORA-01861: literal does not match format string

I have not found anything useful except bug Bug 9351175 which is fixed in 11.2 (and I have 11.2.0.2/aix), but I could workaround the problem with


unset NLS_DATE_FORMAT

Because I do like to have readable timestamp (with seconds) in my RMAN logs, I set NLS_DATE_FORMAT in my spfile


alter system set nls_date_format='YYYY-MM-DD_HH24:MI:SS';

How big was my database last month

If you backup your database at least weekly, you could check the datafile size from your backup history.

Without catalog :


select to_char(d,'"W"IW-IYYY') compl, sum(b)/1024/1024/1024*8192 GB 
from 
(
  select max(DATAFILE_BLOCKS) b, trunc(completion_time,'IW') d 
  from v$backup_datafile 
  group by FILE# ,trunc(completion_time,'IW')
) 
group by d 
order by d;


COMPL        GB
-------- ------
W30-2012   3.73
W31-2012   4.84
W32-2012   5.00
W33-2012   5.05
W34-2012   5.35
W35-2012   5.80
W36-2012   6.12
W37-2012   6.39
W38-2012    .93
W39-2012   7.02
W40-2012   7.56
W41-2012   7.72
W42-2012   7.88
W43-2012   8.08
W44-2012   8.83
W45-2012   9.03
W46-2012   9.45
W47-2012   9.61
W48-2012  10.11
W49-2012  10.29
W50-2012  10.38

The history mostly depends on control_file_record_keep_time. If you do not use an rman catalog, set it to a high value like 93 (3M) or 366 (1Y)

With the rman catalog, use the RC_ view


select DB_NAME,to_char(d,'"W"IW-IYYY') compl, sum(b)/1024/1024/1024*8192 GB 
from 
(
  select DB_NAME,max(DATAFILE_BLOCKS) b, trunc(completion_time,'IW') d 
  from rc_backup_datafile 
  group by DB_NAME,FILE# ,trunc(completion_time,'IW')
) 
group by DB_NAME,d 
order by DB_NAME,d;


DB_NAME  COMPL        GB
-------- -------- ------
DB01     W30-2012   3.73
DB01     W31-2012   4.83
DB01     W32-2012   5.00
DB01     W33-2012   5.05
DB01     W34-2012   5.34
DB01     W35-2012   5.79
DB01     W36-2012   6.11
DB01     W37-2012   6.39
DB01     W38-2012    .93
DB01     W39-2012   7.01
DB01     W40-2012   7.56
DB01     W41-2012   7.71
DB01     W42-2012   7.87
DB01     W43-2012   8.08
DB01     W44-2012   8.82
DB01     W45-2012   9.02
DB01     W46-2012   9.44
DB01     W47-2012   9.60
DB01     W48-2012  10.10
DB01     W49-2012  10.28
DB01     W50-2012  10.37

If you need to check which table grows the most, check How big was my table yesterday. But remember, RMAN backup is free to use, AWR and the WRI$ tables require the diagnostic pack and the Enterprise edition

Troubleshoot ORA-10878

You will probably not hit this bug unless you perform some media recovery in 11.2.0.1/AIX.

Ok. In case you hit ORA-10878: parallel recovery slave died unexpectedly during a DUPLICATE or a RESTORE command, you can disable parallel media recovery with _log_parallelism_max=1.

The usual warning applies : do not use hidden parameter without guidance of Oracle Support. Open an SR if you hit this bug. Check for a patch on your plateform. Read notes 9728806.8 and 315631.1.

Note: for a RECOVER, the option RECOVER NOPARALLEL must be safer. Unfortunately there is no such thing like DUPLICATE NOPARALLEL

Update: This could also happened with standby, if you have stopped your standby site for a while and after restart you get ORA-10878 and ORA-00448 and evtl core dumps or internal errors, then stop dataguard (set dg_broker_start to false) and start the recovery manually with the noparallel option, until all logs are applied. Once this is done, you can restart dataguard, which will then in normal operation mode apply only one log at the time.
Diggout out from Helios’s Blog

RMAN duplicate does change your DB_NAME !

I had a very serious issue last Friday with errors as weird as ORA-00322: log name of thread num is not current copy. After a clone from Prod to Test, the prod crashed. Both databases are located on the same server (I am not a virtualization fanatic) and clone from prod to test have been done by most of my dba readers.

What did change in 11g ?

Incredibly, in 11g, rman issues the following statement before restore

sql clone "alter system set  db_name = ''PROD'' ...
restore clone primary controlfile...

This is probably related to the capability of cloning a database without connecting to the target database.

At the end of the clone, rman is setting back the db_name to TEST and recreate the TEST controlfile

sql statement: alter system set  db_name = ''TEST'' ...
sql statement: CREATE CONTROLFILE REUSE SET DATABASE "TEST" ...
...
LOGFILE
GROUP 1 ('/.../TEST/redo1.dbf')...

So what’s wrong with this? Howcome could a clone from prod to test screw up the prod db???

Simple, the RMAN job did not complete

1) set new name, restore prod controlfile to test
2) restore issue, for instance ORA-19870: error while restoring backup piece archive1234
3) RMAN-03002: failure of Duplicate Db command

At this point, the complete restore was finished, we restored the missing archivelog, recover and open resetlog.
What happened then???
At this point, remember you still have the prod controlfile (and the prod db_name), so by doing an alter resetlogs, the production redologs will get overwritten without notice !

This is a fairly important change that could really hurt if you are cloning two databases on the same server.

In case you are trying to save a failed database clone, make sure you check db_name and also v$logfile before doing an alter database resetlogs!!!

backup database keep forever logs

If you want to keep some backups forever, you maybe tried

RMAN&gt; backup database keep forever logs;

Starting backup at 13.04.2007 13:58:04
...
backup will never be obsolete
archived logs required to recover from this backup 
will expire when this backup expires
...
Finished backup at 13.04.2007 13:58:23

but the problem is that the archivelogs to recover this backup at any time after the backup will NEVER be obsolete.

You could well try the NOLOGS option, but this requires you to take the database in the MOUNT state.

RMAN&gt; backup database keep forever nologs;

Starting backup at 13.04.2007 14:06:36
...
backup will never be obsolete
archived logs required to recover from this backup 
will not be kept
...
Finished backup at 13.04.2007 14:07:25

This is fine if you can stop your database. But you probably wants online backup. What’s next?

Ok, here is the way to go. You do your online backup, then you mark what you want to keep !

First I backup the old archivelogs, because I do not need to keep those yet.

RMAN&gt; backup archivelog all;

Starting backup at 13.04.2007 14:10:00
...
Finished backup at 13.04.2007 14:10:08

Now I do a backup plus archivelog (with a tag for simplicity)

RMAN&gt; backup database tag backuplsc 
plus archivelog tag backuplsc;

Starting backup at 13.04.2007 14:10:42
...
Finished backup at 13.04.2007 14:11:00

Now I can mark my backup as keep

RMAN&gt; change backup tag backuplsc keep forever;

...
keep attributes for the backup are changed
backup will never be obsolete
backup set key=405 RECID=116 STAMP=619798257
keep attributes for the backup are changed
backup will never be obsolete
backup set key=406 RECID=117 STAMP=619798260

Now if I do a delete obsolete, it will never delete my backup.

RMAN&gt; backup database plus archivelog
Starting backup at 13.04.2007 14:16:46
...
Finished backup at 13.04.2007 14:17:10

RMAN&gt; delete noprompt obsolete;

RMAN retention policy will be applied to the command
RMAN retention policy is set to redundancy 1
...
Deleting the following obsolete backups and copies:
...
Deleted 7 objects

RMAN&gt; list backup summary;

List of Backups
===============
Key     TY LV S Device Type Completion Time    
------- -- -- - ----------- ------------------- 
 #Pieces #Copies Compressed Tag
------- ------- ---------- ---
...
405     B  F  A DISK        13.04.2007 14:10:57 
1       1       YES        BACKUPLSC
406     B  F  A DISK        13.04.2007 14:11:00 
1       1       YES        BACKUPLSC
...