Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-128

OSSs frequent crashes due to LBUG/[ASSERTION(last_rcvd>=le64_to_cpu(lcd->lcd_last_transno)) failed] in recovery

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.1.0
    • Lustre 2.0.0
    • None
    • x86_64, RHEL6
    • 3
    • 24,420
    • 5040

    Description

      As suggested by Peter Jones, we open a Jira ticket for this issue in order to get the fix landed in 2.1.
      I simply copy here the initial description from bugzilla 24420:

      We are having this bug when we reboot some OSSs. It's being raised in the recovery phase and it's
      provoking a long Lustre service interruption.

      Each time/crash, the panic'ing thread stack-trace looked like following :
      =========================================================================
      #0 [ffff881021fd1238] machine_kexec at ffffffff8102e66b
      #1 [ffff881021fd1298] crash_kexec at ffffffff810a9ae8
      #2 [ffff881021fd1368] panic at ffffffff8145210d
      #3 [ffff881021fd13e8] lbug_with_loc at ffffffffa0454eeb
      #4 [ffff881021fd1438] libcfs_assertion_failed at ffffffffa04607d6
      #5 [ffff881021fd1488] filter_finish_transno at ffffffffa096c825
      #6 [ffff881021fd1548] filter_do_bio at ffffffffa098e390
      #7 [ffff881021fd15e8] filter_commitrw_write at ffffffffa0990a78
      #8 [ffff881021fd17d8] filter_commitrw at ffffffffa09833d5
      #9 [ffff881021fd1898] obd_commitrw at ffffffffa093affa
      #10 [ffff881021fd1918] ost_brw_write at ffffffffa0943644
      #11 [ffff881021fd1af8] ost_handle at ffffffffa094837a
      #12 [ffff881021fd1ca8] ptlrpc_server_handle_request at ffffffffa060eb11
      #13 [ffff881021fd1de8] ptlrpc_main at ffffffffa060feea
      #14 [ffff881021fd1f48] kernel_thread at ffffffff8100d1aa
      =========================================================================

      In a particular analysis from our on-site support we get the following values when the LBUG is
      raised on "filter_finish_transno" function:

      lcd_last_transno=0x4ddebb
      oti_transno=last_rcvd=0x4ddeba
      lsd_last_transno=0x4de0ee

      So we have the client (lcd_last_transno) having a bad transaction number with the actual
      transaction number being lower than client's one which, according the the ASSERT, is bad.

      I could see there is a similar bug (bz23296) but I don't think this bug is related with this one,
      as in bz23296 the problem comes from a bad initialization in obdecho/echo_client.c which is used
      only for tests, not for production as it's our case.

      Does this sound as a known bug for you? In order to work-around this bug, what would be the
      consequences of disabling this LBUG? I mean, I think we would loss some data on a client but I
      don't know if there is any other important consequence.

      I also attach here the patch from bugzilla 24420 that is already landed in 1.8.6.

      Thanks,
      Sebastien.

      Attachments

        Activity

          [LU-128] OSSs frequent crashes due to LBUG/[ASSERTION(last_rcvd>=le64_to_cpu(lcd->lcd_last_transno)) failed] in recovery

          That reason can be other just 2.0.0 code itself, some bug which causes this assertion. The assert can be removed from patch, in that case there will be client evictions only and more debug info about why that is happening.

          tappro Mikhail Pershin added a comment - That reason can be other just 2.0.0 code itself, some bug which causes this assertion. The assert can be removed from patch, in that case there will be client evictions only and more debug info about why that is happening.

          It is the same cluster than initial problem (Lustre 2.0.0, x86_64, RHEL6).

          Here is the information from the client:
          The 1st occurence was during normal operations, and next ones during restart+recovery.
          And no msgs of connection problems nor client eviction at that time ...

          The patch has been removed from the cluster.

          pichong Gregoire Pichon added a comment - It is the same cluster than initial problem (Lustre 2.0.0, x86_64, RHEL6). Here is the information from the client: The 1st occurence was during normal operations, and next ones during restart+recovery. And no msgs of connection problems nor client eviction at that time ... The patch has been removed from the cluster.

          Can you provide more information about setup, what are versions on clients and servers? When does bug occur - immediately after start or occasionally during normal work?

          tappro Mikhail Pershin added a comment - Can you provide more information about setup, what are versions on clients and servers? When does bug occur - immediately after start or occasionally during normal work?

          The patch has been installed on our client's cluster, and he started to get MDS crashes
          for LBUG/ASSERTION(req_is_replay(req)) failed.

          The MDS Panic thread stack-trace looks like following :
          =======================================================
          panic()
          lbug_with_loc()
          libcfs_assertion_failed()
          mdt_txn_stop_cb()
          dt_txn_hook_stop()
          osd_trans_stop()
          mdd_trans_stop()
          mdd_create()
          cml_create()
          mdt_reint_open()
          mdt_reint_rec()
          mdt_reint_internal()
          mdt_intent_reint()
          mdt_intent_policy()
          ldlm_lock_enqueue()
          ldlm_handle_enqueue0()
          mdt_enqueue()
          mdt_handle_common()
          mdt_regular_handle()
          ptlrpc_server_handle_request()
          ptlrpc_main()
          kernel_thread()
          =======================================================

          Looking at the stack trace, the failing ASSERTION(req_is_replay(req)) is likely to come from the fix in /lustre/mdt/mdt_recovery.c.

          It appears some scenario are still not covered by the patch.

          Could you have a look ?
          Thanks,

          Grégoire.

          pichong Gregoire Pichon added a comment - The patch has been installed on our client's cluster, and he started to get MDS crashes for LBUG/ASSERTION(req_is_replay(req)) failed. The MDS Panic thread stack-trace looks like following : ======================================================= panic() lbug_with_loc() libcfs_assertion_failed() mdt_txn_stop_cb() dt_txn_hook_stop() osd_trans_stop() mdd_trans_stop() mdd_create() cml_create() mdt_reint_open() mdt_reint_rec() mdt_reint_internal() mdt_intent_reint() mdt_intent_policy() ldlm_lock_enqueue() ldlm_handle_enqueue0() mdt_enqueue() mdt_handle_common() mdt_regular_handle() ptlrpc_server_handle_request() ptlrpc_main() kernel_thread() ======================================================= Looking at the stack trace, the failing ASSERTION(req_is_replay(req)) is likely to come from the fix in /lustre/mdt/mdt_recovery.c. It appears some scenario are still not covered by the patch. Could you have a look ? Thanks, Grégoire.

          Integrated in lustre-master » i686,server,el6,inkernel #117
          LU-128 Avoid assertion on wire data in last_rcvd update

          Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
          Files :

          • lustre/obdfilter/filter.c
          • lustre/mdt/mdt_open.c
          • lustre/mdt/mdt_recovery.c
          hudson Build Master (Inactive) added a comment - Integrated in lustre-master » i686,server,el6,inkernel #117 LU-128 Avoid assertion on wire data in last_rcvd update Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193 Files : lustre/obdfilter/filter.c lustre/mdt/mdt_open.c lustre/mdt/mdt_recovery.c

          Integrated in lustre-master » i686,client,el6,inkernel #117
          LU-128 Avoid assertion on wire data in last_rcvd update

          Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
          Files :

          • lustre/obdfilter/filter.c
          • lustre/mdt/mdt_open.c
          • lustre/mdt/mdt_recovery.c
          hudson Build Master (Inactive) added a comment - Integrated in lustre-master » i686,client,el6,inkernel #117 LU-128 Avoid assertion on wire data in last_rcvd update Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193 Files : lustre/obdfilter/filter.c lustre/mdt/mdt_open.c lustre/mdt/mdt_recovery.c

          Integrated in lustre-master » x86_64,server,el6,inkernel #117
          LU-128 Avoid assertion on wire data in last_rcvd update

          Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
          Files :

          • lustre/obdfilter/filter.c
          • lustre/mdt/mdt_recovery.c
          • lustre/mdt/mdt_open.c
          hudson Build Master (Inactive) added a comment - Integrated in lustre-master » x86_64,server,el6,inkernel #117 LU-128 Avoid assertion on wire data in last_rcvd update Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193 Files : lustre/obdfilter/filter.c lustre/mdt/mdt_recovery.c lustre/mdt/mdt_open.c

          Integrated in lustre-master » i686,server,el5,inkernel #117
          LU-128 Avoid assertion on wire data in last_rcvd update

          Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
          Files :

          • lustre/mdt/mdt_recovery.c
          • lustre/obdfilter/filter.c
          • lustre/mdt/mdt_open.c
          hudson Build Master (Inactive) added a comment - Integrated in lustre-master » i686,server,el5,inkernel #117 LU-128 Avoid assertion on wire data in last_rcvd update Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193 Files : lustre/mdt/mdt_recovery.c lustre/obdfilter/filter.c lustre/mdt/mdt_open.c

          Integrated in lustre-master » i686,server,el5,ofa #117
          LU-128 Avoid assertion on wire data in last_rcvd update

          Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
          Files :

          • lustre/obdfilter/filter.c
          • lustre/mdt/mdt_recovery.c
          • lustre/mdt/mdt_open.c
          hudson Build Master (Inactive) added a comment - Integrated in lustre-master » i686,server,el5,ofa #117 LU-128 Avoid assertion on wire data in last_rcvd update Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193 Files : lustre/obdfilter/filter.c lustre/mdt/mdt_recovery.c lustre/mdt/mdt_open.c

          Integrated in lustre-master » x86_64,server,el5,ofa #117
          LU-128 Avoid assertion on wire data in last_rcvd update

          Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
          Files :

          • lustre/obdfilter/filter.c
          • lustre/mdt/mdt_open.c
          • lustre/mdt/mdt_recovery.c
          hudson Build Master (Inactive) added a comment - Integrated in lustre-master » x86_64,server,el5,ofa #117 LU-128 Avoid assertion on wire data in last_rcvd update Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193 Files : lustre/obdfilter/filter.c lustre/mdt/mdt_open.c lustre/mdt/mdt_recovery.c

          Integrated in lustre-master » x86_64,client,ubuntu1004,ofa #117
          LU-128 Avoid assertion on wire data in last_rcvd update

          Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
          Files :

          • lustre/mdt/mdt_recovery.c
          • lustre/obdfilter/filter.c
          • lustre/mdt/mdt_open.c
          hudson Build Master (Inactive) added a comment - Integrated in lustre-master » x86_64,client,ubuntu1004,ofa #117 LU-128 Avoid assertion on wire data in last_rcvd update Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193 Files : lustre/mdt/mdt_recovery.c lustre/obdfilter/filter.c lustre/mdt/mdt_open.c

          People

            niu Niu Yawei (Inactive)
            sebastien.buisson Sebastien Buisson (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: