Saturday, February 19, 2011

Single Client Access Name (SCAN)

Disclaimer: This document is not a replacement for any ORACLE document or whitepaper. This is merely meant for understanding certain fundamental aspects of  Oracle 11gR2 Real Application Cluster( RAC) and it's core components.




SINGLE CLIENT ACCESS NAME (SCAN)


Single Client Access Name (SCAN) is a new Oracle Real Application Clusters (RAC) 11g Release 2 feature that provides a single name for clients to access Oracle Databases running in a cluster.


The benefit is that the client’s connect information does not need to change if you add or remove nodes in the cluster. Having a single name to access the cluster allows clients to use the EZConnect client and the simple JDBC thin URL to access any database running in the cluster, independently of which server(s) in the cluster the database is active. SCAN provides load balancing and failover for client connections to the database. The SCAN works as a cluster alias for databases in the cluster.




NETWORK REQUIREMENTS FOR USING SCAN


The SCAN is configured during the installation of Oracle Grid Infrastructure that is distributed with Oracle Database 11g Release2. Oracle Grid Infrastructure is a single Oracle Home that contains Oracle Clusterware and Oracle Automatic Storage Management. You must install Oracle Grid Infrastructure first in order to use Oracle RAC 11g Release 2. During the interview phase of the Oracle Grid Infrastructure installation, you will be prompted to provide a SCAN name. There are 2 options for defining the SCAN:


1. Define the SCAN in your corporate DNS (Domain Name Service)
2. Use the Grid Naming Service (GNS)


USING OPTION 1 - DEFINE THE SCAN IN YOUR CORPORATE DNS


If you choose Option 1, you must ask your network administrator to create a single name that resolves to 3 IP addresses using a round-robin algorithm. Three IP addresses are recommended considering load balancing and high availability requirements regardless of the number of servers in the cluster. The IP addresses must be on the same subnet as your public network in the cluster. The name must be 15 characters or less in length, not including the domain, and must be resolvable without the domain suffix (for example: “sanrac-scan’ must be resolvable as opposed to “sanrac-scan.xyz.com”). The IPs must not be assigned to a network interface (on the cluster), since Oracle Clusterware will take care of it.
You can check the SCAN configuration in DNS using “nslookup”. If your DNS is set up to provide round-robin access to the IPs resolved by the SCAN entry, then run the “nslookup” command at least twice to see the round-robin algorithm work. The result should be that each time, the “nslookup” would return a set of 3 IPs in a different order.


EZconnet sqlplus system/manager@ sanrac-scan:1521/oltp
JDBC connect jdbc:oracle:thin:@ sanrac-scan:1521/oltp
sanrac-scan.xyz.com IN A 150.1.1.34 IN A 150.1.1.33
IN A 150.1.1.32


Note: If your DNS server does not return a set of 3 IPs OR does not round-robin, ask your network administrator to enable such a setup. DNS using a round-robin algorithm on its own does not ensure failover of connections. However, the Oracle Client typically handles this. It is therefore recommended that the minimum version of the client used is the Oracle Database 11g Release 2 client.


USING OPTION 2 - THE GRID NAMING SERVICE (GNS)


If you choose option 2, you only need to enter the SCAN during the interview. During the cluster configuration, three IP addresses will be acquired from a DHCP service (using GNS assumes you have a DHCP service available on your public network) to create the SCAN and name resolution for the SCAN will be provided by the GNS.





HOW CONNECTION LOAD BALANCING WORKS USING SCAN


For clients connecting using Oracle SQL*Net 11g Release 2, three IP addresses will be received by the client by resolving the SCAN name through DNS as discussed. The client will then go through the list it receives from the DNS and try connecting through one of the IPs received. If the client receives an error, it will try the other addresses before returning an error to the user or application. This is similar to how client connection failover works  in previous releases when an address list is provided in the client connection string.     
When a SCAN Listener receives a connection request, the SCAN Listener will check for the least loaded instance providing the requested service. It will then re-direct the connection request to the local listener on the node where the least loaded instance is running. Subsequently, the client will be given the address of the local listener. The local listener will finally create the connection to the database instance.  











Sunday, February 6, 2011

Calculate Value for UNDO_RETENTION


Calculate UNDO_RETENTION  for given UNDO Tabespace
You can choose to allocate a specific size for the UNDO tablespace and then set the UNDO_RETENTION parameter to an optimal value according to the UNDO size and the database activity. If your disk space is limited and you do not want to allocate more space than necessary to the UNDO tablespace, this is the way to proceed. The following query will help you to optimize the UNDO_RETENTION parameter:
As these following queries use the V$UNDOSTAT statistics, run the queries only after the database has been running with UNDO for a significant  time.






Actual Undo Size
SQL> SELECT SUM(a.bytes) "UNDO_SIZE"
  2    FROM v$datafile a,
  3         v$tablespace b,
  4         dba_tablespaces c
  5   WHERE c.contents = 'UNDO'
  6     AND c.status = 'ONLINE'
  7     AND b.name = c.tablespace_name
  8     AND a.ts# = b.ts#;

 UNDO_SIZE
----------
1.4098E+10

SQL> SELECT MAX(undoblks/((end_time-begin_time)*3600*24))
  2        "UNDO_BLOCK_PER_SEC"
  3    FROM v$undostat;

UNDO_BLOCK_PER_SEC
------------------
              4.01

SQL> SELECT TO_NUMBER(value) "DB_BLOCK_SIZE [KByte]"
  2   FROM v$parameter
  3  WHERE name = 'db_block_size';

DB_BLOCK_SIZE [KByte]
---------------------
                 8192

SELECT d.undo_size/(1024*1024) "ACTUAL UNDO SIZE [MByte]",
       SUBSTR(
e.value,1,25) "UNDO RETENTION [Sec]",
       ROUND((
d.undo_size / (to_number(f.value) *
       g.undo_block_per_sec))) "OPTIMAL UNDO RETENTION [Sec]"
  FROM 
(
       SELECT SUM(a.bytes) undo_size
          FROM v$datafile a,
               v$tablespace b,
               dba_tablespaces c
         WHERE c.contents = 'UNDO'
           AND c.status = 'ONLINE'
           AND b.name = c.tablespace_name
           AND a.ts# = b.ts#
       ) d,

       
v$parameter e,
       
v$parameter f,
       
(
       SELECT MAX(undoblks/((end_time-begin_time)*3600*24))
              undo_block_per_sec
         FROM v$undostat
       ) g

WHERE 
e.name = 'undo_retention'
  AND 
f.name = 'db_block_size'
/

Saturday, February 5, 2011

Real Application Clusters 1 | Fundamentals

Disclaimer: This document is not a replacement for any ORACLE document or whitepaper. This is merely meant for understanding certain fundamental aspects of Real Application Cluster( RAC) and it's core components.


RAC Architecture
Oracle Real Application clusters allows multiple instances to access a single database, the instances will be running on multiple nodes. In an standard Oracle configuration a database can only be mounted by one instance but in a RAC environment many instances can access a single database.



Oracle's RAC is heavy dependent on a efficient, high reliable high speed private network called the interconnect, make sure when designing a RAC system that you get the best that you can afford.
The table below describes the difference of a standard oracle database (single instance) an a RAC environment
Component
Single Instance Environment
RAC Environment
SGA
Instance has its own SGA
Each instance has its own SGA
Background processes
Instance has its own set of
background processes
Each instance has its own set of background processes
Datafiles
Accessed by only one instance
Shared by all instances (shared storage)
Control Files
Accessed by only one instance
Shared by all instances (shared storage)
Online Redo Logfile
Dedicated for write/read to 
only one instance
Only one instance can write but other instances can
read during recovery and archiving. If an instance is shutdown,
log switches by other instances can force the idle instance redo
logs to be archived
Archived Redo Logfile
Dedicated to the instance
Private to the instance but other instances will need access to
 all required archive logs during media recovery
Flash Recovery Log
Accessed by only one instance
Shared by all instances (shared storage)
Alert Log and Trace Files
Dedicated to the instance
Private to each instance, other instances never read or write
to those files.
ORACLE_HOME
Multiple instances on the same server
 accessing different databases ca use
the same executable files
Same as single instance plus can be placed on shared file system
allowing a common ORACLE_HOME for all instances in a RAC environment.
RAC Components
The major components of a Oracle RAC system are
  • Shared disk system
  • Oracle Clusterware
  • Cluster Interconnects
  • Oracle Kernel Components

Disk architecture
With today's SAN and NAS disk storage systems, sharing storage is fairly easy and is required for a RAC environment, you can use the below storage setups
  • SAN (Storage Area Networks) - generally using fibre to connect to the SAN
  • NAS ( Network Attached Storage) - generally using a network to connect to the NAS using either NFS, ISCSI
  • JBOD - direct attached storage, the old traditional way and still used by many companies as a cheap option
All of the above solutions can offer multi-pathing to reduce SPOFs within the RAC environment, there is no reason not to configure multi-pathing as the cost is cheap when adding additional paths to the disk because most of the expense is paid when out when configuring the first path, so an additional controller card and network/fibre cables is all that is need.
The last thing to think about is how to setup the underlining disk structure this is known as a raid level, there are about 12 different raid levels that I know off, here are the most common ones
raid 0 (Striping)
A number of disks are concatenated together to give the appearance of one very large disk.
Advantages
   Improved performance
   Can Create very large Volumes

Disadvantages
   Not highly available (if one disk fails, the volume fails)
raid 1 (Mirroring)
A single disk is mirrored by another disk, if one disk fails the system is unaffected as it can use its mirror.
Advantages
   Improved performance
   Highly Available (if one disk fails the mirror takes over)

Disadvantages
   Expensive (requires double the number of disks)
raid 5
Raid stands for Redundant Array of Inexpensive Disks, the disks are striped with parity across 3 or more disks, the parity is used in the event that one of the disks fails, the data on the failed disk is reconstructed by using the parity bit.
Advantages
   Improved performance (read only)
   Not expensive

Disadvantages
   Slow write operations (caused by having to create the parity bit)
There are many other raid levels that can be used with a particular hardware environment for example EMC storage uses the RAID-S, HP storage uses Auto RAID, so check with the manufacture for the best solution that will provide you with the best performance and resilience.
Once you have you storage attached to the servers, you have three choices on how to setup the disks
  • Raw Volumes - normally used for performance benefits, however they are hard to manage and backup
  • Cluster FileSystem - used to hold all the Oracle datafiles can be used by windows and linux, its not used widely
  • Automatic Storage Management (ASM) - Oracle choice of storage management, its a portable, dedicated and optimized cluster filesystem
Oracle Clusterware
Oracle Clusterware software is designed to run Oracle in a cluster mode, it can support you to 64 nodes, it can even be used with a vendor cluster like Sun Cluster.
The Clusterware software allows nodes to communicate with each other and forms the cluster that makes the nodes work as a single logical server. The software is run by the Cluster Ready Services (CRS) using the Oracle Cluster Registry (OCR) that records and maintains the cluster and node membership information and the voting disk which acts as a tiebreaker during communication failures. Consistent heartbeat information travels across the interconnect to the voting disk when the cluster is running.
The CRS has four components
  • OPROCd - Process Monitor Daemon
  • CRSd - CRS daemon, the failure of this daemon results in a node being reboot to avoid data corruption
  • OCSSd - Oracle Cluster Synchronization Service Daemon (updates the registry)
  • EVMd - Event Volume Manager Daemon
The OPROCd daemon provides the I/O fencing for the Oracle cluster, it uses the hangcheck timer or watchdog timer for the cluster integrity. It is locked into memory and runs as a realtime processes, failure of this daemon results in the node being rebooted. Fencing is used to protect the data, if a node were to have problems fencing presumes the worst and protects the data thus restarts the node in question, its better to be save than sorry.
The CRSd process manages resources such as starting and stopping the services and failover of the application resources, it also spawns separate processes to manage application resources. CRS manages the OCR and stores the current know state of the cluster, it requires a public, private and VIP interface in order to run. OCSSd provides synchronization services among nodes, it provides access to the node membership and enables basic cluster services, including cluster group services and locking, failure of this daemon causes the node to be rebooted to avoid split-brain situations.
The below functions are covered by the OCSSd
  • CSS provides basic Group Services Support, it is a distributed group membership system that allows applications to coordinate activities to archive a common result.
  • Group services use vendor clusterware group services when it is available.
  • Lock services provide the basic cluster-wide serialization locking functions, it uses the First In, First Out (FIFO) mechanism to manage locking
  • Node services uses OCR to store data and updates the information during reconfiguration, it also manages the OCR data which is static otherwise.
The last component is the Event Management Logger, which runs the EVMd process. The daemon spawns a processes called evmlogger and generates the events when things happen. The evmlogger spawns new children processes on demand and scans the callout directory to invoke callouts. Death of the EVMd daemon will not halt the instance and will be restarted.
Quick recap

CRS Process
Functionality
Failure of the Process
Run AS
OPROCd - Process Monitor
provides basic cluster integrity services
Node Restart
root
EVMd - Event Management
spawns a child process event logger and generates callouts
Daemon automatically restarted, no node restart
oracle
OCSSd - Cluster Synchronization Services
basic node membership, group services, basic locking
Node Restart
oracle
CRSd - Cluster Ready Services
resource monitoring, failover and node recovery
Daemon restarted automatically, no node restart
root

The cluster-ready services (CRS) is a new component in 10g RAC, its is installed in a separate home directory called ORACLE_CRS_HOME. It is a mandatory component but can be used with a third party cluster (Veritas, Sun Cluster), by default it manages the node membership functionality along with managing regular RAC-related resources and services

RAC uses a membership scheme, thus any node wanting to join the cluster as to become a member. RAC can evict any member that it seems as a problem, its primary concern is protecting the data. You can add and remove nodes from the cluster and the membership increases or decrease, when network problems occur membership becomes the deciding factor on which part stays as the cluster and what nodes get evicted, the use of a voting disk is used which I will talk about later.
The resource management framework manage the resources to the cluster (disks, volumes), thus you can have only have one resource management framework per resource. Multiple frameworks are not supported as it can lead to undesirable affects.
The Oracle Cluster Ready Services (CRS) uses the registry to keep the cluster configuration, it should reside on a shared storage and accessible to all nodes within the cluster. This shared storage is known as the Oracle Cluster Registry (OCR) and its a major part of the cluster, it is automatically backed up (every 4 hours) the daemons plus you can manually back it up. The OCSSd uses the OCR extensively and writes the changes to the registry
The OCR keeps details of all resources and services, it stores name and value pairs of information such as resources that are used to manage the resource equivalents by the CRS stack. Resources with the CRS stack are components that are managed by CRS and have the information on the good/bad state and the callout scripts. The OCR is also used to supply bootstrap information ports, nodes, etc, it is a binary file.
The OCR is loaded as cache on each node, each node will update the cache then only one node is allowed to write the cache to the OCR file, the node is called the master. The Enterprise manager also uses the OCR cache, it should be at least 100MB in size. The CRS daemon will update the OCR about status of the nodes in the cluster during reconfigurations and failures.
The voting disk (or quorum disk) is shared by all nodes within the cluster, information about the cluster is constantly being written to the disk, this is know as the heartbeat. If for any reason a node cannot access the voting disk it is immediately evicted from the cluster, this protects the cluster from split-brains (the Instance Membership Recovery algorithm IMR is used to detect and resolve split-brains) as the voting disk decides what part is the really cluster. The voting disk manages the cluster membership and arbitrates the cluster ownership during communication failures between nodes. Voting is often confused with quorum the are similar but distinct, below details what each means
Voting
A vote is usually a formal expression of opinion or will in response to a proposed decision
Quorum
is defined as the number, usually a majority of members of a body, that, when assembled is legally competent to transact business
The only vote that counts is the quorum member vote, the quorum member vote defines the cluster. If a node or group of nodes cannot archive a quorum, they should not start any services because they risk conflicting with an established quorum.
The voting disk has to reside on shared storage, it is a a small file (20MB) that can be accessed by all nodes in the cluster. In Oracle 10g R1 you can have only one voting disk, but in R2 you can have upto 32 voting disks allowing you to eliminate any SPOF's.
The original Virtual IP in Oracle was Transparent Application Failover (TAF), this had limitations, this has now been replaced with cluster VIPs. The cluster VIPs will failover to working nodes if a node should fail, these public IPs are configured in DNS so that users can access them. The cluster VIPs are different from the cluster interconnect IP address and are only used to access the database.
The cluster interconnect is used to synchronize the resources of the RAC cluster, and also used to transfer some data from one instance to another. This interconnect should be private, highly available and fast with low latency, ideally they should be on a minimum private 1GB network. What ever hardware you are using the NIC should use multi-pathing. You can use crossover cables in a QA/DEV environment but it is not supported in a production environment, also crossover cables limit you to a two node cluster.
Oracle Kernel Components
The kernel components relate to the background processes, buffer cache and shared pool and managing the resources without conflicts and corruptions requires special handling.
In RAC as more than one instance is accessing the resource, the instances require better coordination at the resource management level. Each node will have its own set of buffers but will be able to request and receive data blocks currently held in another instance's cache. The management of data sharing and exchange is done by the Global Cache Services (GCS).
All the resources in the cluster group form a central repository called the Global Resource Directory (GRD), which is distributed. Each instance masters some set of resources and together all instances form the GRD. The resources are equally distributed among the nodes based on their weight. The GRD is managed by two services called Global Caches Services (GCS) and Global Enqueue Services (GES), together they form and manage the GRD. When a node leaves the cluster, the GRD portion of that instance needs to be redistributed to the surviving nodes, a similar action is performed when a new node joins.
RAC Background Processes
Each node has its own background processes and memory structures, there are additional processes than the norm to manage the shared resources, theses additional processes maintain cache coherency across the nodes.
Cache coherency is the technique of keeping multiple copies of a buffer consistent between different Oracle instances on different nodes. Global cache management ensures that access to a master copy of a data block in one buffer cache is coordinated with the copy of the block in another buffer cache.
The sequence of a operation would go as below
  1. When instance A needs a block of data to modify, it reads the bock from disk, before reading it must inform the GCS (DLM). GCS keeps track of the lock status of the data block by keeping an exclusive lock on it on behalf of instance A
  2. Now instance B wants to modify that same data block, it to must inform GCS, GCS will then request instance A to release the lock, thus GCS ensures that instance B gets the latest version of the data block (including instance A modifications) and then exclusively locks it on instance B behalf.
  3. At any one point in time, only one instance has the current copy of the block, thus keeping the integrity of the block.
GCS maintains data coherency and coordination by keeping track of all lock status of each block that can be read/written to by any nodes in the RAC. GCS is an in memory database that contains information about current locks on blocks and instances waiting to acquire locks. This is known as Parallel Cache Management (PCM). The Global Resource Manager (GRM) helps to coordinate and communicate the lock requests from Oracle processes between instances in the RAC. Each instance has a buffer cache in its SGA, to ensure that each RAC instance obtains the block that it needs to satisfy a query or transaction. RAC uses two processes the GCS and GES which maintain records of lock status of each data file and each cached block using a GRD.
So what is a resource, it is an identifiable entity, it basically has a name or a reference, it can be a area in memory, a disk file or an abstract entity. A resource can be owned or locked in various states (exclusive or shared). Any shared resource is lockable and if it is not shared no access conflict will occur.
A global resource is a resource that is visible to all the nodes within the cluster. Data buffer cache blocks are the most obvious and most heavily global resource, transaction enqueue's and database data structures are other examples. GCS handle data buffer cache blocks and GES handle all the non-data block resources.
All caches in the SGA are either global or local, dictionary and buffer caches are global, large and java pool buffer caches are local. Cache fusion is used to read the data buffer cache from another instance instead of getting the block from disk, thus cache fusion moves current copies of data blocks between instances (hence why you need a fast private network), GCS manages the block transfers between the instances.
Finally we get to the processes
Oracle RAC Daemons and Processes
LMSn
Lock Manager Server process - GCS
this is the cache fusion part and the most active process, it handles the consistent copies of blocks that are transferred between instances. It receives requests from LMD to perform lock requests. I rolls back any uncommitted transactions. There can be up to ten LMS processes running and can be started dynamically if demand requires it.
they manage lock manager service requests for GCS resources and send them to a service queue to be handled by the LMSn process. It also handles global deadlock detection and monitors for lock conversion timeouts.
as a performance gain you can increase this process priority to make sure CPU starvation does not occur
you can see the statistics of this daemon by looking at the view X$KJMSDP
LMON
Lock Monitor Process - GES
this process manages the GES, it maintains consistency of GCS memory structure in case of process death. It is also responsible for cluster reconfiguration and locks reconfiguration (node joining or leaving), it checks for instance deaths and listens for local messaging.
A detailed log file is created that tracks any reconfigurations that have happened.
LMD
Lock Manager Daemon - GES
this manages the enqueue manager service requests for the GCS. It also handles deadlock detention and remote resource requests from other instances.
you can see the statistics of this daemon by looking at the view X$KJMDDP
LCK0
Lock Process - GES
manages instance resource requests and cross-instance call operations for shared resources. It builds a list of invalid lock elements and validates lock elements during recovery.
DIAG
Diagnostic Daemon
This is a lightweight process, it uses the DIAG framework to monitor the health of the cluster. It captures information for later diagnosis in the event of failures. It will perform any necessary recovery if an operational hang is detected.