Details
Description
A couple weeks ago, we did a manual migration of OSTs from one OSS to another in our HA setup. One of the OSTs would not unmount, though, with syslog messages complaining about the "Mount still busy with 7 refs after XX secs." (see appended). After over 40 minutes, we gave up on the unmount attempt and power cycled the OSS node.
This is an occasional problem we've seen over the years with a variety of Lustre versions and OSS (and MDS, too, I believe) hardware platforms, and we've seen it with HA and non-HA setups. Interestingly, it is always 7 refs that keep the umount from completing.
I did not see a ticket in JIRA for this, so I thought I would open one. When this issue comes up, it becomes an outage visible to our users for the affected OSTs, even though we have an HA setup.
However, from the old bugzilla.lustre.org, it appears that this issue may be fixed for Lustre versions greater than 2.0 (bz21726 and apparently bz19550). It is not clear to me from these bugs though what is the actual patch which fixes this issue. Is there a patch for this which can be ported to 1.8.x?
Thanks,
Craig Prescott
Nov 23 09:18:01 hpcoss2 kernel: Lustre: Mount still busy with 7 refs after 120 secs.
Nov 23 09:26:46 hpcoss2 kernel: Lustre: Mount still busy with 7 refs after 180 secs.
Nov 23 09:36:46 hpcoss2 kernel: Lustre: Mount still busy with 7 refs after 780 secs.
Nov 23 09:46:46 hpcoss2 kernel: Lustre: Mount still busy with 7 refs after 1380 secs.
Nov 23 09:56:46 hpcoss2 kernel: Lustre: Mount still busy with 7 refs after 1980 secs.
Nov 23 10:06:46 hpcoss2 kernel: Lustre: Mount still busy with 7 refs after 2580 secs.