Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-134

recovery-mds-scale (FLAVOR=OSS): (filter.c:151:filter_finish_transno()) LBUG

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • Lustre 2.1.0
    • Lustre 2.0.0
    • None
    • 3
    • 20,394
    • 10253

    Description

      While running recovery-mds-scale test with 'FAILURE_MODE=HARD' (power off & on) and 'FLAVOR=OSS' on
      Burlington cluster, client loads failed after OSS failed over 3 times:

      ==== Checking the clients loads AFTER failover – failure NOT OK
      ost1 has failed over 3 times, and counting...
      sleeping 543 seconds ...
      tar: etc/rc.d/init.d/avahi-dnsconfd: Cannot write: No such file or directory
      tar: etc/rc.d/init.d/avahi-dnsconfd: Cannot utime: No such file or directory
      tar: Error exit delayed from previous errors
      Found the END_RUN_FILE file: /testsuite/yujian/end_run_file
      client7
      Client load failed on node client7

      client client7 load stdout and debug files :
      /tmp/recovery-mds-scale.log_run_tar.sh-client7
      /tmp/recovery-mds-scale.log_run_tar.sh-client7.debug
      2009-08-07 05:25:56 Terminating clients loads ...
      Duraion: 86400
      Server failover period: 600 seconds
      Exited after: 1556 seconds
      Number of failovers before exit:
      ost1 failed over 3 times
      Status: FAIL: rc=1

      An LBUG occurred after oss1 failed over 2 times:
      -------8<-------
      Lustre: 10385:0:(ldlm_lib.c:565:target_handle_reconnect()) lustre-OST0000:
      lustre-MDT0000-mdtlov_UUID reconnecting
      Lustre: 10385:0:(ldlm_lib.c:801:target_handle_connect()) lustre-OST0000: refuse reconnection from
      lustre-MDT0000-mdtlov_UUID@192.168.1.4@tcp to 0xffff8102f80a1800/1
      LustreError: 10385:0:(ldlm_lib.c:2112:target_send_reply_msg()) @@@ processing error (-16)
      req@ffff8103788d1c00 x1310336588446872/t0(0)
      o8->lustre-MDT0000-mdtlov_UUID@NET_0x20000c0a80104_UUID:0/0 lens 368/264 e 0 to 0 dl 1249636224 ref
      1 fl Interpret:/0/0 rc -16/0
      LustreError: 10544:0:(ldlm_lib.c:1494:check_for_next_transno()) lustre-OST0000: waking for gap in
      transno, VBR is OFF (skip: 367985, ql: 1, comp: 8, conn: 9, next: 367988, last_committed: 367987)
      Lustre: 10295:0:(ldlm_lib.c:565:target_handle_reconnect()) lustre-OST0000:
      lustre-MDT0000-mdtlov_UUID reconnecting
      Lustre: 10295:0:(sec.c:1414:sptlrpc_import_sec_adapt()) import
      lustre-OST0000->NET_0x20000c0a80104_UUID netid 20000: select flavor null
      Lustre: 10295:0:(sec.c:1414:sptlrpc_import_sec_adapt()) Skipped 3 previous similar messages
      Lustre: 10345:0:(ldlm_lib.c:2009:target_queue_recovery_request()) Next recovery transno: 367989,
      current: 367988, replaying: 1
      LustreError: 10345:0:(filter.c:151:filter_finish_transno()) LBUG
      Pid: 10345, comm: ll_ost_79

      Call Trace:
      [<ffffffff887a35b1>] libcfs_debug_dumpstack+0x51/0x60 [libcfs]
      [<ffffffff887a3aea>] lbug_with_loc+0x7a/0xd0 [libcfs]
      [<ffffffff88ce2bf0>] filter_finish_transno+0x1b0/0x4f0 [obdfilter]
      [<ffffffff88c2174c>] journal_callback_set+0x30/0x44 [jbd]
      [<ffffffff88cf68d0>] filter_cancel_cookies_cb+0x0/0x5b0 [obdfilter]
      [<ffffffff88cc158c>] fsfilt_ldiskfs_add_journal_cb+0x2dc/0x310 [fsfilt_ldiskfs]
      [<ffffffff88ce660f>] filter_setattr_internal+0x15ef/0x1d70 [obdfilter]
      [<ffffffff88cdbf82>] filter_fid2dentry+0x512/0x740 [obdfilter]
      [<ffffffff88cddcdc>] __filter_oa2dentry+0x4c/0x250 [obdfilter]
      [<ffffffff8000d762>] dput+0x23/0x10a
      [<ffffffff88ce7224>] filter_setattr+0x494/0x670 [obdfilter]
      [<ffffffff88947e1a>] lustre_pack_reply_v2+0x23a/0x2f0 [ptlrpc]
      [<ffffffff88947fb4>] lustre_pack_reply_flags+0xe4/0x1e0 [ptlrpc]
      [<ffffffff88ca6aba>] ost_handle+0x2a7a/0x5b3d [ost]
      [<ffffffff887ae66d>] libcfs_debug_vmsg2+0x6fd/0x9d0 [libcfs]
      [<ffffffff88bdbde7>] vvp_session_key_init+0x147/0x1d0 [lustre]
      [<ffffffff88952750>] ptlrpc_server_handle_request+0x950/0xff0 [ptlrpc]
      [<ffffffff8008881c>] __wake_up_common+0x3e/0x68
      [<ffffffff88957af6>] ptlrpc_main+0x13a6/0x1530 [ptlrpc]
      [<ffffffff8008a3f2>] default_wake_function+0x0/0xe
      [<ffffffff800b48d6>] audit_syscall_exit+0x327/0x342
      [<ffffffff8005dfb1>] child_rip+0xa/0x11
      [<ffffffff88956750>] ptlrpc_main+0x0/0x1530 [ptlrpc]
      [<ffffffff8005dfa7>] child_rip+0x0/0x11

      Jay's analysis: https://bugzilla.lustre.org/show_bug.cgi?id=20394#c5
      After discussing w/ tappro - the expert of recovery, he is so kind that I learnt many things about
      recovery, this assertion might be over-strict because:

      if (last_rcvd <= le64_to_cpu(lcd->lcd_last_transno))

      { spin_unlock(&filter->fo_translock); LBUG(); }

      last_rcvd likely equals to lcd->lcd_last_transno because the transaction of setattr might already
      been committed, but the server doesn't have a chance to send the reply to the client side, which
      then causes the request being handled immediately. For this case, at lease assertion for last_rcvd
      == lcd->lcd_last_transno might be (wrongly) hit.

      Attachments

        Activity

          People

            bobijam Zhenyu Xu
            bobijam Zhenyu Xu
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: