Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5128

ASSERTION( atomic_read(&obd->obd_req_replay_clients) == 0 ) failed

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.7.0, Lustre 2.5.3
    • Lustre 2.4.3
    • Lustre-2.4.3
    • 3
    • 14153

    Description

      MDS failovered and once MDS's recovery finished, many OSS crahsed due to following ASSERTION.

      2014-05-30 17:39:07 Lustre: Skipped 3 previous similar messages
      2014-05-30 17:39:07 LustreError: 18967:0:(ldlm_lib.c:1851:target_next_replay_req()) ASSERTION( atomic_read(&obd->obd_req_replay_clients) == 0 ) failed: 
      2014-05-30 17:39:07 LustreError: 18967:0:(ldlm_lib.c:1851:target_next_replay_req()) LBUG
      2014-05-30 17:39:07 Pid: 18967, comm: tgt_recov
      2014-05-30 17:39:07 
      2014-05-30 17:39:07 Call Trace:
      2014-05-30 17:39:07  [<ffffffffa0353895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      2014-05-30 17:39:07  [<ffffffffa0353e97>] lbug_with_loc+0x47/0xb0 [libcfs]
      2014-05-30 17:39:07  [<ffffffffa066f48c>] target_recovery_thread+0x14ac/0x1970 [ptlrpc]
      2014-05-30 17:39:07  [<ffffffffa066dfe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
      2014-05-30 17:39:07  [<ffffffff8100c0ca>] child_rip+0xa/0x20
      2014-05-30 17:39:07  [<ffffffffa066dfe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
      2014-05-30 17:39:07  [<ffffffffa066dfe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
      2014-05-30 17:39:07  [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      2014-05-30 17:39:07 
      2014-05-30 17:39:07 Kernel panic - not syncing: LBUG
      2014-05-30 17:39:07 Pid: 18967, comm: tgt_recov Not tainted 2.6.32-358.18.1.el6_lustre.x86_64 #1
      2014-05-30 17:39:07 Call Trace:
      2014-05-30 17:39:07  [<ffffffff8150de58>] ? panic+0xa7/0x16f
      2014-05-30 17:39:07  [<ffffffffa0353eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      2014-05-30 17:39:07  [<ffffffffa066f48c>] ? target_recovery_thread+0x14ac/0x1970 [ptlrpc]
      2014-05-30 17:39:07  [<ffffffffa066dfe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
      2014-05-30 17:39:07  [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
      2014-05-30 17:39:07  [<ffffffffa066dfe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
      2014-05-30 17:39:07  [<ffffffffa066dfe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
      2014-05-30 17:39:07  [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      

      LU-1522 and LU-2397 reported similar problem, but these patches have been merged in b2_4, already.

      Attachments

        Activity

          [LU-5128] ASSERTION( atomic_read(&obd->obd_req_replay_clients) == 0 ) failed
          pjones Peter Jones added a comment -

          Landed for 2.7

          pjones Peter Jones added a comment - Landed for 2.7
          gnlwlb wu libin (Inactive) added a comment - Here is the patch for b2_5: http://review.whamcloud.com/#/c/11102/

          the patch against master is tracked at http://review.whamcloud.com/#/c/10849/

          hongchao.zhang Hongchao Zhang added a comment - the patch against master is tracked at http://review.whamcloud.com/#/c/10849/
          hongchao.zhang Hongchao Zhang added a comment - - edited

          the issue tracked at http://review.whamcloud.com/#/c/10628/ also exists on b2_5

          hongchao.zhang Hongchao Zhang added a comment - - edited the issue tracked at http://review.whamcloud.com/#/c/10628/ also exists on b2_5

          this only happens on b2_4 branch or same problem maybe occur even on b2_5?

          ihara Shuichi Ihara (Inactive) added a comment - this only happens on b2_4 branch or same problem maybe occur even on b2_5?

          there could be a race between "target_process_req_flags" and "class_export_recovery_cleanup", and if the replay request contains the flag
          "MSG_REQ_REPLAY_DONE", the "exp->exp_req_replay_needed" will be cleared and "obd->obd_req_replay_clients" will be decreased with
          protection "exp_lock" in "target_process_req_flags". but "class_export_recovery_cleanup" checks the "exp_req_replay_needed" without the lock
          "exp_lock", then it could decrement "obd_req_replay_clients" once more and causes this issue.

          the patch against b2_4 is tracked at http://review.whamcloud.com/#/c/10628/

          hongchao.zhang Hongchao Zhang added a comment - there could be a race between "target_process_req_flags" and "class_export_recovery_cleanup", and if the replay request contains the flag "MSG_REQ_REPLAY_DONE", the "exp->exp_req_replay_needed" will be cleared and "obd->obd_req_replay_clients" will be decreased with protection "exp_lock" in "target_process_req_flags". but "class_export_recovery_cleanup" checks the "exp_req_replay_needed" without the lock "exp_lock", then it could decrement "obd_req_replay_clients" once more and causes this issue. the patch against b2_4 is tracked at http://review.whamcloud.com/#/c/10628/

          Hi,

          Could you please attach the whole logs of this issue, thanks!
          btw, did the OSS also failovered along with MDS?

          Thanks

          hongchao.zhang Hongchao Zhang added a comment - Hi, Could you please attach the whole logs of this issue, thanks! btw, did the OSS also failovered along with MDS? Thanks
          pjones Peter Jones added a comment -

          Hongchao

          Could you please advise on this one?

          Thanks

          Peter

          pjones Peter Jones added a comment - Hongchao Could you please advise on this one? Thanks Peter

          People

            hongchao.zhang Hongchao Zhang
            ihara Shuichi Ihara (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: