Details
-
Bug
-
Resolution: Duplicate
-
Major
-
Lustre 2.0.0
-
None
-
3
-
20,394
-
10253
Description
While running recovery-mds-scale test with 'FAILURE_MODE=HARD' (power off & on) and 'FLAVOR=OSS' on
Burlington cluster, client loads failed after OSS failed over 3 times:
==== Checking the clients loads AFTER failover – failure NOT OK
ost1 has failed over 3 times, and counting...
sleeping 543 seconds ...
tar: etc/rc.d/init.d/avahi-dnsconfd: Cannot write: No such file or directory
tar: etc/rc.d/init.d/avahi-dnsconfd: Cannot utime: No such file or directory
tar: Error exit delayed from previous errors
Found the END_RUN_FILE file: /testsuite/yujian/end_run_file
client7
Client load failed on node client7
client client7 load stdout and debug files :
/tmp/recovery-mds-scale.log_run_tar.sh-client7
/tmp/recovery-mds-scale.log_run_tar.sh-client7.debug
2009-08-07 05:25:56 Terminating clients loads ...
Duraion: 86400
Server failover period: 600 seconds
Exited after: 1556 seconds
Number of failovers before exit:
ost1 failed over 3 times
Status: FAIL: rc=1
An LBUG occurred after oss1 failed over 2 times:
-------8<-------
Lustre: 10385:0:(ldlm_lib.c:565:target_handle_reconnect()) lustre-OST0000:
lustre-MDT0000-mdtlov_UUID reconnecting
Lustre: 10385:0:(ldlm_lib.c:801:target_handle_connect()) lustre-OST0000: refuse reconnection from
lustre-MDT0000-mdtlov_UUID@192.168.1.4@tcp to 0xffff8102f80a1800/1
LustreError: 10385:0:(ldlm_lib.c:2112:target_send_reply_msg()) @@@ processing error (-16)
req@ffff8103788d1c00 x1310336588446872/t0(0)
o8->lustre-MDT0000-mdtlov_UUID@NET_0x20000c0a80104_UUID:0/0 lens 368/264 e 0 to 0 dl 1249636224 ref
1 fl Interpret:/0/0 rc -16/0
LustreError: 10544:0:(ldlm_lib.c:1494:check_for_next_transno()) lustre-OST0000: waking for gap in
transno, VBR is OFF (skip: 367985, ql: 1, comp: 8, conn: 9, next: 367988, last_committed: 367987)
Lustre: 10295:0:(ldlm_lib.c:565:target_handle_reconnect()) lustre-OST0000:
lustre-MDT0000-mdtlov_UUID reconnecting
Lustre: 10295:0:(sec.c:1414:sptlrpc_import_sec_adapt()) import
lustre-OST0000->NET_0x20000c0a80104_UUID netid 20000: select flavor null
Lustre: 10295:0:(sec.c:1414:sptlrpc_import_sec_adapt()) Skipped 3 previous similar messages
Lustre: 10345:0:(ldlm_lib.c:2009:target_queue_recovery_request()) Next recovery transno: 367989,
current: 367988, replaying: 1
LustreError: 10345:0:(filter.c:151:filter_finish_transno()) LBUG
Pid: 10345, comm: ll_ost_79
Call Trace:
[<ffffffff887a35b1>] libcfs_debug_dumpstack+0x51/0x60 [libcfs]
[<ffffffff887a3aea>] lbug_with_loc+0x7a/0xd0 [libcfs]
[<ffffffff88ce2bf0>] filter_finish_transno+0x1b0/0x4f0 [obdfilter]
[<ffffffff88c2174c>] journal_callback_set+0x30/0x44 [jbd]
[<ffffffff88cf68d0>] filter_cancel_cookies_cb+0x0/0x5b0 [obdfilter]
[<ffffffff88cc158c>] fsfilt_ldiskfs_add_journal_cb+0x2dc/0x310 [fsfilt_ldiskfs]
[<ffffffff88ce660f>] filter_setattr_internal+0x15ef/0x1d70 [obdfilter]
[<ffffffff88cdbf82>] filter_fid2dentry+0x512/0x740 [obdfilter]
[<ffffffff88cddcdc>] __filter_oa2dentry+0x4c/0x250 [obdfilter]
[<ffffffff8000d762>] dput+0x23/0x10a
[<ffffffff88ce7224>] filter_setattr+0x494/0x670 [obdfilter]
[<ffffffff88947e1a>] lustre_pack_reply_v2+0x23a/0x2f0 [ptlrpc]
[<ffffffff88947fb4>] lustre_pack_reply_flags+0xe4/0x1e0 [ptlrpc]
[<ffffffff88ca6aba>] ost_handle+0x2a7a/0x5b3d [ost]
[<ffffffff887ae66d>] libcfs_debug_vmsg2+0x6fd/0x9d0 [libcfs]
[<ffffffff88bdbde7>] vvp_session_key_init+0x147/0x1d0 [lustre]
[<ffffffff88952750>] ptlrpc_server_handle_request+0x950/0xff0 [ptlrpc]
[<ffffffff8008881c>] __wake_up_common+0x3e/0x68
[<ffffffff88957af6>] ptlrpc_main+0x13a6/0x1530 [ptlrpc]
[<ffffffff8008a3f2>] default_wake_function+0x0/0xe
[<ffffffff800b48d6>] audit_syscall_exit+0x327/0x342
[<ffffffff8005dfb1>] child_rip+0xa/0x11
[<ffffffff88956750>] ptlrpc_main+0x0/0x1530 [ptlrpc]
[<ffffffff8005dfa7>] child_rip+0x0/0x11
Jay's analysis: https://bugzilla.lustre.org/show_bug.cgi?id=20394#c5
After discussing w/ tappro - the expert of recovery, he is so kind that I learnt many things about
recovery, this assertion might be over-strict because:
if (last_rcvd <= le64_to_cpu(lcd->lcd_last_transno))
{ spin_unlock(&filter->fo_translock); LBUG(); }last_rcvd likely equals to lcd->lcd_last_transno because the transaction of setattr might already
been committed, but the server doesn't have a chance to send the reply to the client side, which
then causes the request being handled immediately. For this case, at lease assertion for last_rcvd
== lcd->lcd_last_transno might be (wrongly) hit.