Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-128

OSSs frequent crashes due to LBUG/[ASSERTION(last_rcvd>=le64_to_cpu(lcd->lcd_last_transno)) failed] in recovery

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.1.0
    • Lustre 2.0.0
    • None
    • x86_64, RHEL6
    • 3
    • 24,420
    • 5040

    Description

      As suggested by Peter Jones, we open a Jira ticket for this issue in order to get the fix landed in 2.1.
      I simply copy here the initial description from bugzilla 24420:

      We are having this bug when we reboot some OSSs. It's being raised in the recovery phase and it's
      provoking a long Lustre service interruption.

      Each time/crash, the panic'ing thread stack-trace looked like following :
      =========================================================================
      #0 [ffff881021fd1238] machine_kexec at ffffffff8102e66b
      #1 [ffff881021fd1298] crash_kexec at ffffffff810a9ae8
      #2 [ffff881021fd1368] panic at ffffffff8145210d
      #3 [ffff881021fd13e8] lbug_with_loc at ffffffffa0454eeb
      #4 [ffff881021fd1438] libcfs_assertion_failed at ffffffffa04607d6
      #5 [ffff881021fd1488] filter_finish_transno at ffffffffa096c825
      #6 [ffff881021fd1548] filter_do_bio at ffffffffa098e390
      #7 [ffff881021fd15e8] filter_commitrw_write at ffffffffa0990a78
      #8 [ffff881021fd17d8] filter_commitrw at ffffffffa09833d5
      #9 [ffff881021fd1898] obd_commitrw at ffffffffa093affa
      #10 [ffff881021fd1918] ost_brw_write at ffffffffa0943644
      #11 [ffff881021fd1af8] ost_handle at ffffffffa094837a
      #12 [ffff881021fd1ca8] ptlrpc_server_handle_request at ffffffffa060eb11
      #13 [ffff881021fd1de8] ptlrpc_main at ffffffffa060feea
      #14 [ffff881021fd1f48] kernel_thread at ffffffff8100d1aa
      =========================================================================

      In a particular analysis from our on-site support we get the following values when the LBUG is
      raised on "filter_finish_transno" function:

      lcd_last_transno=0x4ddebb
      oti_transno=last_rcvd=0x4ddeba
      lsd_last_transno=0x4de0ee

      So we have the client (lcd_last_transno) having a bad transaction number with the actual
      transaction number being lower than client's one which, according the the ASSERT, is bad.

      I could see there is a similar bug (bz23296) but I don't think this bug is related with this one,
      as in bz23296 the problem comes from a bad initialization in obdecho/echo_client.c which is used
      only for tests, not for production as it's our case.

      Does this sound as a known bug for you? In order to work-around this bug, what would be the
      consequences of disabling this LBUG? I mean, I think we would loss some data on a client but I
      don't know if there is any other important consequence.

      I also attach here the patch from bugzilla 24420 that is already landed in 1.8.6.

      Thanks,
      Sebastien.

      Attachments

        Activity

          [LU-128] OSSs frequent crashes due to LBUG/[ASSERTION(last_rcvd>=le64_to_cpu(lcd->lcd_last_transno)) failed] in recovery

          The patch has been installed on our client's cluster, and he started to get MDS crashes
          for LBUG/ASSERTION(req_is_replay(req)) failed.

          The MDS Panic thread stack-trace looks like following :
          =======================================================
          panic()
          lbug_with_loc()
          libcfs_assertion_failed()
          mdt_txn_stop_cb()
          dt_txn_hook_stop()
          osd_trans_stop()
          mdd_trans_stop()
          mdd_create()
          cml_create()
          mdt_reint_open()
          mdt_reint_rec()
          mdt_reint_internal()
          mdt_intent_reint()
          mdt_intent_policy()
          ldlm_lock_enqueue()
          ldlm_handle_enqueue0()
          mdt_enqueue()
          mdt_handle_common()
          mdt_regular_handle()
          ptlrpc_server_handle_request()
          ptlrpc_main()
          kernel_thread()
          =======================================================

          Looking at the stack trace, the failing ASSERTION(req_is_replay(req)) is likely to come from the fix in /lustre/mdt/mdt_recovery.c.

          This should be fixed in LU-617.

          niu Niu Yawei (Inactive) added a comment - The patch has been installed on our client's cluster, and he started to get MDS crashes for LBUG/ASSERTION(req_is_replay(req)) failed. The MDS Panic thread stack-trace looks like following : ======================================================= panic() lbug_with_loc() libcfs_assertion_failed() mdt_txn_stop_cb() dt_txn_hook_stop() osd_trans_stop() mdd_trans_stop() mdd_create() cml_create() mdt_reint_open() mdt_reint_rec() mdt_reint_internal() mdt_intent_reint() mdt_intent_policy() ldlm_lock_enqueue() ldlm_handle_enqueue0() mdt_enqueue() mdt_handle_common() mdt_regular_handle() ptlrpc_server_handle_request() ptlrpc_main() kernel_thread() ======================================================= Looking at the stack trace, the failing ASSERTION(req_is_replay(req)) is likely to come from the fix in /lustre/mdt/mdt_recovery.c. This should be fixed in LU-617 .

          I agree.

          Thank you,
          Sebastien.

          sebastien.buisson Sebastien Buisson (Inactive) added a comment - I agree. Thank you, Sebastien.
          pjones Peter Jones added a comment -

          ok then I think that we can close this ticket and reopen it if it transpires that we need to take any further action before CEA realign on 2.1

          pjones Peter Jones added a comment - ok then I think that we can close this ticket and reopen it if it transpires that we need to take any further action before CEA realign on 2.1

          Peter,

          As suggested by Mike we cooked a patch for CEA with removed assertions just to make it loyal to any error. Also these error messages are deactivated by default, but can be activated via a kernel module option. That way CEA will be able to collect debug messages when the issue reoccurs.

          Sebastien.

          sebastien.buisson Sebastien Buisson (Inactive) added a comment - Peter, As suggested by Mike we cooked a patch for CEA with removed assertions just to make it loyal to any error. Also these error messages are deactivated by default, but can be activated via a kernel module option. That way CEA will be able to collect debug messages when the issue reoccurs. Sebastien.
          pjones Peter Jones added a comment -

          Mike

          Are you still expecting to be able to create a patch based on 2.0 for CEA?

          CEA,

          Would you deploy such a patch or is the window until you rebase on 2.1 small enough that it would not be worthwhile?

          Peter

          pjones Peter Jones added a comment - Mike Are you still expecting to be able to create a patch based on 2.0 for CEA? CEA, Would you deploy such a patch or is the window until you rebase on 2.1 small enough that it would not be worthwhile? Peter
          pjones Peter Jones added a comment -

          Ah thanks for clarifying Mike. This can certainly remain an important support issue for CEA and a priority for us without being considered a 2.1 blocker. I have adjusted the status accordingly.

          pjones Peter Jones added a comment - Ah thanks for clarifying Mike. This can certainly remain an important support issue for CEA and a priority for us without being considered a 2.1 blocker. I have adjusted the status accordingly.

          Peter, it is not quite correct, patch was landed for 2.1 but problems with it were seen with 2.0.0. I am afraid the differences between 2.1 and 2.0.0 can be the reason for this. So it is not blocker for 2.1 at least we saw no issues with it so far, but we need patch which works with 2.0.0 correctly.

          I'd propose to cook special patch for Bull with removed assertions just to make it loyal to any error and we will be able to see debug messages if issues will occur again.

          tappro Mikhail Pershin added a comment - Peter, it is not quite correct, patch was landed for 2.1 but problems with it were seen with 2.0.0. I am afraid the differences between 2.1 and 2.0.0 can be the reason for this. So it is not blocker for 2.1 at least we saw no issues with it so far, but we need patch which works with 2.0.0 correctly. I'd propose to cook special patch for Bull with removed assertions just to make it loyal to any error and we will be able to see debug messages if issues will occur again.
          pjones Peter Jones added a comment -

          Adding as 2.1 blocker upon advice of Bull because this patch is landed for 2.1 and caused issues when deployed in production at CEA.

          pjones Peter Jones added a comment - Adding as 2.1 blocker upon advice of Bull because this patch is landed for 2.1 and caused issues when deployed in production at CEA.

          That reason can be other just 2.0.0 code itself, some bug which causes this assertion. The assert can be removed from patch, in that case there will be client evictions only and more debug info about why that is happening.

          tappro Mikhail Pershin added a comment - That reason can be other just 2.0.0 code itself, some bug which causes this assertion. The assert can be removed from patch, in that case there will be client evictions only and more debug info about why that is happening.

          It is the same cluster than initial problem (Lustre 2.0.0, x86_64, RHEL6).

          Here is the information from the client:
          The 1st occurence was during normal operations, and next ones during restart+recovery.
          And no msgs of connection problems nor client eviction at that time ...

          The patch has been removed from the cluster.

          pichong Gregoire Pichon added a comment - It is the same cluster than initial problem (Lustre 2.0.0, x86_64, RHEL6). Here is the information from the client: The 1st occurence was during normal operations, and next ones during restart+recovery. And no msgs of connection problems nor client eviction at that time ... The patch has been removed from the cluster.

          People

            niu Niu Yawei (Inactive)
            sebastien.buisson Sebastien Buisson (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: