Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
Lustre 2.0.0
-
None
-
3
-
4960
Description
ASSERTION(oti && oti->oti_thread && oti->oti_thread->t_watchdog) failed in quota_chk_acq_common()
-------------------------------------------------------------------------------------------------
The following system crash was reported by TGCC Bull customer, after a controller problem occured on a RAID array.
On April 2, at 11:00:35 the syslog notices a "link down" for about 6 minutes, during which a lot a I/O errors and Lustre errors on OST are recorded.
error examples:
end_request: I/O error, dev sdbb, sector 8074493953
Lustre: 22430:0:(filter_io_26.c:747:filter_commitrw_write()) scratch2-OST000b: slow direct_io 30s
LustreError: 16365:0:(obd.h:1581:obd_transno_commit_cb()) scratch2-OST000b: transno 14007590 commit error: 2
Lustre: 26055:0:(filter_io_26.c:684:filter_commitrw_write()) scratch2-OST000b: slow i_mutex 30s
LustreError: 26055:0:(fsfilt-ldiskfs.c:483:fsfilt_ldiskfs_brw_start()) can't get handle for 582 credits: rc = -30
LustreError: 22410:0:(filter_io_26.c:691:filter_commitrw_write()) error starting transaction: rc = -30
lustre-ioerror Check device : /dev/sdbb
When the link became up, the 15 luns where no longer available, and the system continued to record the same errors and finally crashed at 11.25:42 (more than 40000 I/O errors). For this first crash, we do not have a dump.
The system then restarted and the syslog shows that one of the lun was not detected (only 14 luns available). Lustre recorded the following error (19=ENODEV) for the lun not detected:
LustreError: 13493:0:(obd_mount.c:1343:server_kernel_mount()) premount /dev/sdb:0x0 ldiskfs failed: -19 Is the ldiskfs module available?
LustreError: 13493:0:(obd_mount.c:1665:server_fill_super()) Unable to mount device /dev/sdb: -19
LustreError: 13493:0:(obd_mount.c:2136:lustre_fill_super()) Unable to mount (-19)
Lustre then started the recovery on all the OSTs (tgt_recov thread), and logs several times the following error for the lun not detected:
LustreError: 137-5: UUID 'scratch2-OST0000_UUID' is not available for connect (no target)
LustreError: 13946:0:(ldlm_lib.c:2123:target_send_reply_msg()) @@@ processing error (19) req@ffff880431930400 x1364628274475046/t0(0) o8><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1301737426 ref 1 fl Interpret:/ffffffff/ffffffff rc -19/-1
After approximately 1 minute the system crashed on the following assertion in quota_interface.c:451:quota_chk_acq_common():
ASSERTION(oti && oti->oti_thread && oti->oti_thread->t_watchdog)
This crash occured 6 times, as long as one of the OST was not detected. We have 3 dumps available which all show the same following stack:
PID: 19634 TASK: ffff88045728d5a0 CPU: 2 COMMAND: "tgt_recov"
#0 [ffff8804493112d8] machine_kexec at ffffffff8102e67b
#1 [ffff880449311338] crash_kexec at ffffffff810a9af8
#2 [ffff8804493113b8] show_trace at ffffffff810102d5
#3 [ffff880449311408] panic at ffffffff81452147
#4 [ffff8804493114d8] libcfs_assertion_failed at ffffffffa01157d6
#5 [ffff880449311528] quota_chk_acq_common at ffffffffa071aa32
#6 [ffff8804493116a8] filter_commitrw_write at ffffffffa090a488
#7 [ffff880449311898] filter_commitrw at ffffffffa08fd535
#8 [ffff880449311958] obd_commitrw at ffffffffa08b4ffa
#9 [ffff8804493119d8] ost_brw_write at ffffffffa08bd644
#10 [ffff880449311bb8] ost_handle at ffffffffa08c237a
In these 3 dumps, "oti" and "oti->oti_thread" are valid pointers, but "oti->oti_thread->t_watchdog" is a null pointer.
The only place were this pointer is reset is in "ptlrpc_main()", when the service is stopped.
A possible fix could be to remove the LASSERT and modify the "if" statement, as it is done 50 lines above, by replacing these lines:
if (!qctxt->lqc_import && oti) {
cfs_spin_unlock(&qctxt->lqc_lock);
LASSERT(oti && oti->oti_thread &&
oti->oti_thread->t_watchdog);
by
if (!qctxt->lqc_import && oti && oti->oti_thread && oti->oti_thread->t_watchdog) {
cfs_spin_unlock(&qctxt->lqc_lock);
But I'm wondering if this is enough, or if an additional test must be added to exit the while loop.