Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.10.3
-
None
-
centos7.4, 3.10.0-693.17.1.el7.x86_64, zfs 0.7.6, lustre 2.10.3 + patch from https://review.whamcloud.com/#/c/31354/, OPA, skylake, raidz1 over hardware raid1 MDTs on SSD.
-
1
-
9223372036854775807
Description
Hi,
find's were hanging on the main filesystem from one client. these processes looked to be unkillable. I rebooted the client running the finds and restarted the find sweep, but they hung again.
I then failed over all the MDT's to one MDS (we have 2), and that went ok. I then failed all the MDT's back to the other MDS and it LBUG'd.
kernel: LustreError: 49321:0:(lu_object.c:1177:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
since then 2 of the MDT's won't connect. they are stuck in WAITING state and never get to RECOVERING or COMPLETE.
[warble1]root: cat /proc/fs/lustre/mdt/dagg-MDT0001/recovery_status status: WAITING non-ready MDTs: 0000 recovery_start: 1523093864 time_waited: 388
[warble1]root: cat /proc/fs/lustre/mdt/dagg-MDT0002/recovery_status status: WAITING non-ready MDTs: 0000 recovery_start: 1523093864 time_waited: 391
the other MDT is ok.
[warble2]root: cat /proc/fs/lustre/mdt/dagg-MDT0000/recovery_status status: COMPLETE recovery_start: 1523093168 recovery_duration: 30 completed_clients: 122/122 replayed_requests: 0 last_transno: 214748364800 VBR: DISABLED IR: DISABLED
I've tried umounting a few times and remountnig, but the time_waited: just keeps incrementing. it gets to 900s, spits out a message and then keeps going forever it looks like.
any ideas?
cheers,
robin