Last week I released a blog article regarding Gotcha’s when deploying a vIDM cluster with vRLCM 8.1. This week it’s time to reveal Gotcha 2 and if time allows also Gotcha 3.
Gotcha 2 is all about Power Off and Power On of the vIDM cluster. The preferred way for Powering On or Powering Off a vIDM cluster is by using the Day 2 Operations of the globalenvironment in the vRLCM gui.
Go to Lifecycle Operations and navigate to Environments.
Next Go to “VIEW DETAILS” of your globalenvironment and click on the 3 dots. This is the location where the Day 2 Operations are located for your environment. In the list of Day 2 Operations you will find Power On and Power Off.
When the Power On or Power Off Day 2 Operations are not used, there is a risk that the vIDM cluster will not start anymore. This can happen for Example when a vSphere HA event occurs or when the vIDM virtual machines will be Powered On or Powered Off directly with the vSphere Client.
If this happens, it is good to know about some troubleshooting steps. VMware released the following KB Article specifically on this topic. https://kb.vmware.com/s/article/75080
In my situation, most of the time when a vIDM cluster was not Powered Off via the vRLCM gui, the DelegateIP was gone from the vIDM virtual appliance running as the primary postgres instance. What also happened was that one or both of the secondary postgres instances turned into a state with a ‘down’ status.
To find out what vIDM node is configured as the master postgres instance, run the following command on one of the vIDM nodes in the cluster. (when a password is asked, just provide an enter here.)
su postgres -c “echo -e ‘password’|/opt/vmware/vpostgres/current/bin/psql -h localhost -p 9999 -U pgpool postgres -c \”show pool_nodes\””
In the above screenshot you can see that the vIDM node with IP-address 10.1.0.31 is the primary postgres instance. You can also see that the vIDM node with IP-adress 10.1.0.40 turned into a state with a ‘down’ status.
To validate if we are hitting the issue regarding “No DelegateIP assigned to the primary postgres instance”, we can run the following command on the vIDM node running as the primary postgres instance.
ifconfig eth0:0 | grep ‘inet addr:’ | cut -d: -f2
If the command returns the DelegateIP like the screenshot below, you are not hitting this specific issue. However, if the command returns nothing, you are hitting this specific issue.
Make sure the DelegateIP is not held by any other non-primary instances by running above ifconfig command on the other instances. If at all any of the non-primary instances are still having the DelegateIP run the following command first to detach it.
ifconfig eth0:0 down
Run the below command on the primary instance to re-assign the DelegateIP.
ifconfig eth0:0 inet delegateIP netmask <Netmask>
After you re-assign the DelegateIP you need to restart horizon service on all the vIDM nodes by running the command “service horizon-workspace restart”.
If you also hit the second issue where the secondary vIDM postgres instance or instances are turned into a state with a ‘down’ status, you can use the following procedure to fix this.
First shutdown the postgres service on the impacted vIDM postgres instance(s) by running the command “service vpostgres stop“.
Secondly run the following command to recover the impacted vIDM postgres instance. (The default password for user pgpool = password)
/usr/local/bin/pcp_recovery_node -h delegateIP -p 9898 -U pgpool -n <node_id>
Finally validate if all of the vDIM postgres instances are up again.
su postgres -c “echo -e ‘password’|/opt/vmware/vpostgres/current/bin/psql -h localhost -p 9999 -U pgpool postgres -c \”show pool_nodes\””
That’s it for now. Hopefully this info was useful for you.
In my next blog I will continue to reveal even more Gotcha’s.