[LU-369] ASSERTION(oti && oti->oti_thread && oti->oti_thread->t_watchdog) failed in quota_chk_acq_common() - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.0.0
Labels:
None

Severity:
3
Rank (Obsolete):
4960

Description

ASSERTION(oti && oti->oti_thread && oti->oti_thread->t_watchdog) failed in quota_chk_acq_common()
-------------------------------------------------------------------------------------------------

The following system crash was reported by TGCC Bull customer, after a controller problem occured on a RAID array.

On April 2, at 11:00:35 the syslog notices a "link down" for about 6 minutes, during which a lot a I/O errors and Lustre errors on OST are recorded.
error examples:
end_request: I/O error, dev sdbb, sector 8074493953
Lustre: 22430:0:(filter_io_26.c:747:filter_commitrw_write()) scratch2-OST000b: slow direct_io 30s
LustreError: 16365:0:(obd.h:1581:obd_transno_commit_cb()) scratch2-OST000b: transno 14007590 commit error: 2
Lustre: 26055:0:(filter_io_26.c:684:filter_commitrw_write()) scratch2-OST000b: slow i_mutex 30s
LustreError: 26055:0:(fsfilt-ldiskfs.c:483:fsfilt_ldiskfs_brw_start()) can't get handle for 582 credits: rc = -30
LustreError: 22410:0:(filter_io_26.c:691:filter_commitrw_write()) error starting transaction: rc = -30
lustre-ioerror Check device : /dev/sdbb

When the link became up, the 15 luns where no longer available, and the system continued to record the same errors and finally crashed at 11.25:42 (more than 40000 I/O errors). For this first crash, we do not have a dump.

The system then restarted and the syslog shows that one of the lun was not detected (only 14 luns available). Lustre recorded the following error (19=ENODEV) for the lun not detected:
LustreError: 13493:0:(obd_mount.c:1343:server_kernel_mount()) premount /dev/sdb:0x0 ldiskfs failed: -19 Is the ldiskfs module available?
LustreError: 13493:0:(obd_mount.c:1665:server_fill_super()) Unable to mount device /dev/sdb: -19
LustreError: 13493:0:(obd_mount.c:2136:lustre_fill_super()) Unable to mount (-19)

Lustre then started the recovery on all the OSTs (tgt_recov thread), and logs several times the following error for the lun not detected:
LustreError: 137-5: UUID 'scratch2-OST0000_UUID' is not available for connect (no target)
LustreError: 13946:0:(ldlm_lib.c:2123:target_send_reply_msg()) @@@ processing error (~~19) req@ffff880431930400 x1364628274475046/t0(0) o8~~><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1301737426 ref 1 fl Interpret:/ffffffff/ffffffff rc -19/-1

After approximately 1 minute the system crashed on the following assertion in quota_interface.c:451:quota_chk_acq_common():
ASSERTION(oti && oti->oti_thread && oti->oti_thread->t_watchdog)

This crash occured 6 times, as long as one of the OST was not detected. We have 3 dumps available which all show the same following stack:

PID: 19634 TASK: ffff88045728d5a0 CPU: 2 COMMAND: "tgt_recov"
#0 [ffff8804493112d8] machine_kexec at ffffffff8102e67b
#1 [ffff880449311338] crash_kexec at ffffffff810a9af8
#2 [ffff8804493113b8] show_trace at ffffffff810102d5
#3 [ffff880449311408] panic at ffffffff81452147
#4 [ffff8804493114d8] libcfs_assertion_failed at ffffffffa01157d6
#5 [ffff880449311528] quota_chk_acq_common at ffffffffa071aa32
#6 [ffff8804493116a8] filter_commitrw_write at ffffffffa090a488
#7 [ffff880449311898] filter_commitrw at ffffffffa08fd535
#8 [ffff880449311958] obd_commitrw at ffffffffa08b4ffa
#9 [ffff8804493119d8] ost_brw_write at ffffffffa08bd644
#10 [ffff880449311bb8] ost_handle at ffffffffa08c237a

In these 3 dumps, "oti" and "oti->oti_thread" are valid pointers, but "oti->oti_thread->t_watchdog" is a null pointer.

The only place were this pointer is reset is in "ptlrpc_main()", when the service is stopped.

A possible fix could be to remove the LASSERT and modify the "if" statement, as it is done 50 lines above, by replacing these lines:
if (!qctxt->lqc_import && oti) {
cfs_spin_unlock(&qctxt->lqc_lock);

LASSERT(oti && oti->oti_thread &&
oti->oti_thread->t_watchdog);
by
if (!qctxt->lqc_import && oti && oti->oti_thread && oti->oti_thread->t_watchdog) {
cfs_spin_unlock(&qctxt->lqc_lock);

But I'm wondering if this is enough, or if an additional test must be added to exit the while loop.

Attachments

Activity

People

Assignee:: Niu Yawei (Inactive)

Reporter:: Patrick Valentin (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 27/May/11 9:13 AM

Updated:: 14/Jul/11 11:36 PM

Resolved:: 14/Jul/11 11:36 PM