Unless otherwise noted, articles © 2005-2008 Doug Spencer, SecurityBulletins.com. Linking to articles is welcomed. Articles on this site are general information and are NOT GUARANTEED to work for your specific needs. I offer paid professional consulting services and will be happy to develop custom solutions for your specific needs. View the consulting page for more information.


Veritas Cluster Server (VCS) Troubleshooting

From SecurityBulletins.com

Jump to: navigation, search

Written by Doug Spencer

The following may be helpful for troubleshooting Veritas Cluster problems. If you need Veritas Cluster expertise, I offer consulting services.

Contents

Commands to check cluster status

hastatus -sum # Show a summary of resources

hastatus # show the running status of resources. VCS even tracks frozen resource groups, so you can verify that VCS can effectively discern the status of a resource when you manually bring it online or offline.

hagrp -clear GROUP_NAME # Clear a faulted resource group

Using gabconfig -a output to determine problems

gabconfig -a # Shows the state of the VCS resources required to implement clustering.

The letters returned from gabconfig -a mean the resource is available on a particular node:

   a    gab driver
   b    I/O fencing (designed to guarantee data integrity)
   d    ODM (Oracle Disk Manager)
   f    CFS (Cluster File System)
   h    VCS (VERITAS Cluster Server: high availability daemon)
   o    VCSMM driver (kernel module needed for Oracle and VCS interface)
   q    QuickLog daemon
   v    CVM (Cluster Volume Manager)
   w    vxconfigd (module for cvm)


With regard to the GAB driver(Port a)

The /etc/gabtab file will contain the number of nodes defined in the cluster. During an initial build, the cluster won't fully start until all nodes are seen. The gabtab is in the following format:

 /sbin/gabconfig -c -n2

Where -n2 specifies there are 2 nodes required to "seed" the cluster. That number should reflect the actual number of nodes in the cluster. Once that number of nodes is seen, the "Port a" membership is established. Running gabconfig -a | grep "Port a" will show the current membership ID and count for the Port a membership. This check is in place to prevent split-brain conditions and the resulting data corruption that occurs if the cluster starts two or more mini-clusters and related resources.

If you are certain that no split-brain condition is happening, gabconfig -cx can be used to manually bypass the protection from pre-existing partitions.

IOFencing driver(Port b)

Port b/IOFencing is started as a result of the /etc/rc2.d/S97vxfen start script. It performs the following actions:

  • reads /etc/vxfendg to determine name of the diskgroup (DG) that contains the coordinator disks
  • parses "vxdisk -o alldgs list" output for list of disks in that DG
  • performs a "vxdisk list diskname" for each to determine all available paths to each coordinator disk
  • uses all paths to each disk in the DG to build a current /etc/vxfentab

The purpose of all this is that the IOFencing driver is simply trying to find the same shared disk on all nodes to use for the coordinator disk.

Oracle Disk Manager/ODM (Port d)

This port is started by the commands in /etc/rc2.d/S92odm

Cluster File System/CFS (Port f)

There are various methods that can be done to reload CFS if required. Much of VxFS needs to be unloaded to reload this and it usually isn't required.

Veritas Cluster Server/VCS (Port h)

This is the cluster daemon itself.

CVM (ports v and w)

Cluster Volume Manager allows multiple disks to be mounted and shared on the Veritas cluster. You must have the IOFencing driver running before you can start CVM. You can check CVM status with the following commands:

  • gabconfig -a | egrep "Port v|Port w"
  • vxdctl -c mode
  • vxclustadm -v nodestate

For debugging purposes, you can start CVM manually with the following command on each node:

  vxclustadm -m vcs -t gab startnode
  vxclustadm: initialization completed

All diskgroups with disks marked with "shared flag" should now automatically be imported shared. You can check their status with:

vxdg list

and look for "enabled,shared" in the result for each shared disk group.

To see if a disk has the shared flag, run:

 vxdisk -o alldgs list | grep shared

and

 vxdisk list DISKNAME 

QuickLog daemon (Port q)

To reload the QuickLog daemon:

 # ps -ef| grep qlog
     root  2099     1  0 13:04:44 ?        0:00 /opt/VRTSvxfs/sbin/qlogckd
 # kill -9 2099
 # modinfo | grep qlog
 195 7821e000  17fc7 208   1  qlog (VxQLOG 3.5_REV-MP1f QuickLog dr)
 # modunload -i 195
 # /opt/VRTSvxfs/sbin/qlogckd

VCSMM(port_o)

VCSMM is required for RAC communications. It loads in /etc/rc2.d/S98vcsmm

Changing cluster status

hagrp -online RESOURCE_GROUP -sys SYSTEM # Bring a resource online on a particular system

hagrp -switch RESOURCE_GROUP -to SYSTEM # Move a resource to a particular system

hagrp -autoenable RESOURCE_GROUP # Enable a group that has been autodisabled.

Editing cluster configuration

/etc/VRTSvcs/conf/config/main.cf # The main configuration file for VCS.

  I usually copy the config directory elsewhere, then do a hacf -verify . 
  in the config directory, then hacf -cftocmd . and then hacf -cmdtocf . to 
  rebuild the dependency mapping in main.cf. When it looks good, put the 
  main.cf in place and activate it. 
  
  If you only do a hacf -verify, it doesn't find some problems in the main.cf 
  and does not rebuild the dependency tree diagram in the file. 

tail /var/VRTSvcs/log/engine_A.log # The logging file

vxdctl -c mode # Determine current node status when using CVM

lltstat # will print output similar to the following to diagnose the low latency transport:

LLT statistics:
    15903      Snd data packets
    469        Snd retransmit data
    4384       Snd connect packets
    2999       Snd independent ACKs
    10355      Snd piggyback ACKs
    0          Snd independent NACKs
    0          Snd piggyback NACKs
    4138       Snd loopback packets
    15749      Rcv data packets
    586        Rcv out of window
    0          Rcv duplicates
    0          Rcv datagrams dropped
    0          Rcv multiblock data
    0          Rcv misaligned data
LLT errors:
    0          Rcv not connected
    0          Rcv unconfigured
    0          Rcv bad dest address
    0          Rcv bad source address
    0          Rcv bad generation
    0          Rcv no buffer
    0          Rcv malformed packet
    0          Rcv bad dest SAP
    0          Rcv bad STREAM primitive
    0          Rcv bad DLPI primitive
    0          Rcv DLPI error
    26         Snd not connected
    0          Snd no buffer
    0          Snd stream flow drops
    26         Snd no links up
    0          Rcv bad checksum

If you run an lltstat -nvv it will show a verbose status of each Low Latency Transport (LLT) interface. This can be used to check that each interface is plugged into the right destination. It shows what the node thinks its interface name is and what it thinks the remote interface names are. Running the command on all nodes will give a map of the overall LLT network.

Files

/etc/gabtab

/etc/llttab

Other common problems

SCSI reservations on a RAC or other cluster file system install are sometimes a problem if one gets stuck to a particular node that is unavailable.

Consulting

Put my experience to work to improve your Veritas Cluster Server infrastructure. I offer professional consulting services. E-mail sales@securitybulletins.com or click to have GrandCentral call to set up a service contract.

Personal tools