Permanent Device Loss and All Paths Down conditions in Hyper-V 2012 R2

As you’d expect, PDL and APD scenarios in vSphere are covered at length in the VMware KB’s and hundreds of blogs. On the contrary, there are none as to how Hyper-V 2012 R2 reacts to it. Odd to me at least. Or perhaps the search terms need to be a bit different. Anyway, this blog post covers how Hyper-V reacts to a LUN going down. In a single line, the process is similar to how ESXi reacts. For details, read on.

lunloss_draw_io

Ignore the name of the disk, the quorum was already migrated to another CSV on a different array and I’m using this old quorum (aka witness) disk as an example.

Here’s the timeline:

LUN yanked: 11.24.30 Cluster resource ‘Quorum_Disk’ in clustered role ‘Available Storage’ has transitioned from state Online to state ProcessingFailure.

State change 1: 11.24.30 Cluster resource ‘Quorum_Disk’ in clustered role ‘Available Storage’ has transitioned from state ProcessingFailure to state WaitingToTerminate. Cluster resource ‘Quorum_Disk’ is waiting on the following resources: .

State change 2: 11.24.30 Cluster resource ‘Quorum_Disk’ in clustered role ‘Available Storage’ has transitioned from state WaitingToTerminate to state Terminating.

State change 3: 11.24.30 Cluster resource ‘Quorum_Disk’ in clustered role ‘Available Storage’ has transitioned from state Terminating to state DelayRestartingResource.

State change 4: 11.24.46 Physical Disk resource ‘7f839997-62b2-40be-8391-0aa61bebe78f’ has been disconnected from this node.

As you see above, it took 16 seconds from the moment the LUN was pulled to the time the cluster disconnected it (All Paths Down)

Next, it tries to bring it back by issuing a call.

State change 5: 11.24.46 Cluster resource ‘Quorum_Disk’ in clustered role ‘Available Storage’ has transitioned from state DelayRestartingResource to state OnlineCallIssued.

State change 6: 11.24.46 Cluster resource ‘Quorum_Disk’ in clustered role ‘Available Storage’ has transitioned from state OnlineCallIssued to state ProcessingFailure.

State change 7: 11.24.46 The Cluster service failed to bring clustered role ‘Available Storage’ completely online or offline. One or more resources may be in a failed or an offline state. This may impact the availability of the clustered role.

State change 8: 11.24.46 Clustered service or application ‘Available Storage’ has exceeded its failover threshold. It has exhausted the configured number of failover attempts within the failover period of time alotted to it and will be left in a failed state. No additional attempts will be made to bring the service or application online or fail it over to another node in the cluster. Please check the events associated with the failure. After the issues causing the failure are resolved the service or application can be brought online manually or the cluster may attempt to bring it online again after the restart delay period.

State change 9: 11.24.46 Cluster resource ‘Quorum_Disk’ in clustered role ‘Available Storage’ has transitioned from state ProcessingFailure to state WaitingToTerminate. Cluster resource ‘Quorum_Disk’ is waiting on the following resources: .

State change 10: 11.24.46 Cluster resource ‘Quorum_Disk’ in clustered role ‘Available Storage’ has transitioned from state WaitingToTerminate to state Terminating.

State change 11: 11.24.46 Cluster resource ‘Quorum_Disk’ in clustered role ‘Available Storage’ has transitioned from state Terminating to state CannotComeOnlineOnThisNode.

At this point the LUN was deemed to be not coming back on this node (Permanent Device Loss).

State change 12 (this was me removing it for good from FCM): 11.31.10 Cluster resource ‘Quorum_Disk’ was removed from the failover cluster.

Note:

  • The above timeline was from a host that did not have ownership of the LUN.
  • All hosts in the cluster had the same event logs
  • In FCM, one host has ownership of a LUN at any given time. On this owner host, some extra logs were noticed. This host cycled the LUN across all hosts as the cluster tried to bring it back online. It took an extra 20 seconds to do this
  • All up, it took the cluster 16 + 20 = 36 seconds from normal operation to PDL and the cluster ceasing to issue IO on it

I’m going to try and get our storage vendor to provide any information on the storage controllers as to what logs are generated there (if any). I’d imagine there’d be some kind of SCSI codes being sent down to the hosts about a device’s loss. If I get this information, I’ll update this post.

So here you have it. Not much different to how ESXi would react to a LUN disappearing.

5 Comments

 Add your comment
  1. Great explanation. Thanks!

  2. Great article, I have really enjoyed your article. You show how Hyper-V reacts to a LUN going down. I have a some doubts upon this and you have cleared my doubts in this article but I have a still doubt from the storage side. can you please share the details of storage controller, if you have so that we can understand better.

    • Tell me what you want to know about the storage controllers Vikrant. What exactly and I’ll see if I can get information.

      • Hi Manny , Thank you for your instant reply . I want to know about the storage logs , was there any storage logs generated or not ?

        • Hi Vikrant, with the array we have, no logs were generated. I asked the storage guys about it and I was told the vendor could enable deep logging to get the SCSI codes. Next time I get a chance to do this, I’ll ask them.

Leave a Comment

Your email address will not be published.