Working with a customer, we decided to update the 3 Nodes from 10.6 to 10.7 and in the meantime apply Rolling Patch 1 and 2.
The fact is that, 2 days, after apply the update we saw very strange issue.
Customer is using SSL Offload configuration which is using port 80 between NetScaler and XMS (same port as Cluster Nodes).
For an unknown reason the port 80 and only this one went Down and Up, the other ports like 443 and 8443 still work as well as port 22.
The only option when the issue appears was to reboot the node.
A case has been opened with Citrix and escalation found that there was Hazelcast Cluster Cache issue on XMS.
Citrix is still reviewing the logs to analyse and understand what happened exactly.
As this impact nearly 10,000 enrolled Devices, we finally decided to Roll Back.
The problem is that even if Roll Back is supported by Citrix, there is no official documentation.
So here after are the tasks we did:
- Disable NetScaler vServer => 09:00am
- Download logs and support bundle for future analysis => 09:05am
- List all enrolled user since update (to ask them to re-enroll) => 09:40am
- Shutdown the XMS Nodes => 09:45am
- Backup the Database (10.7 version) => 09:50am
- Create a Snapshot of each Node => 09:55am
- Restore Database backup that has been done before migration (version 10.6) => 10:00am
- Restart the SQL Server, check the logs and verify that the Database is Up and running => 10:15am
- Restore backup of the Node to 10.6 (using Veeam Backup) => 10:20am
- Power up one Node, check system, cluster state, date, log, … => 10:25am
- Enable NetScaler vServer => 11:00am
- Verify enrollment, access to store, … => 11:10am
- Power down this Node and power up another one => 12:10pm
- Proceed with same steps verification => 12:15pm
- Validate configuration => 12:20pm
- Ready to GO in Prod => 04:00pm
- Communicate to users => 04:15pm
Total time of Down Service: 7h15.
Note: This is based on my own experience based on customer issue.