Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-128

OSSs frequent crashes due to LBUG/[ASSERTION(last_rcvd>=le64_to_cpu(lcd->lcd_last_transno)) failed] in recovery

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.1.0
    • Lustre 2.0.0
    • None
    • x86_64, RHEL6
    • 3
    • 24,420
    • 5040

    Description

      As suggested by Peter Jones, we open a Jira ticket for this issue in order to get the fix landed in 2.1.
      I simply copy here the initial description from bugzilla 24420:

      We are having this bug when we reboot some OSSs. It's being raised in the recovery phase and it's
      provoking a long Lustre service interruption.

      Each time/crash, the panic'ing thread stack-trace looked like following :
      =========================================================================
      #0 [ffff881021fd1238] machine_kexec at ffffffff8102e66b
      #1 [ffff881021fd1298] crash_kexec at ffffffff810a9ae8
      #2 [ffff881021fd1368] panic at ffffffff8145210d
      #3 [ffff881021fd13e8] lbug_with_loc at ffffffffa0454eeb
      #4 [ffff881021fd1438] libcfs_assertion_failed at ffffffffa04607d6
      #5 [ffff881021fd1488] filter_finish_transno at ffffffffa096c825
      #6 [ffff881021fd1548] filter_do_bio at ffffffffa098e390
      #7 [ffff881021fd15e8] filter_commitrw_write at ffffffffa0990a78
      #8 [ffff881021fd17d8] filter_commitrw at ffffffffa09833d5
      #9 [ffff881021fd1898] obd_commitrw at ffffffffa093affa
      #10 [ffff881021fd1918] ost_brw_write at ffffffffa0943644
      #11 [ffff881021fd1af8] ost_handle at ffffffffa094837a
      #12 [ffff881021fd1ca8] ptlrpc_server_handle_request at ffffffffa060eb11
      #13 [ffff881021fd1de8] ptlrpc_main at ffffffffa060feea
      #14 [ffff881021fd1f48] kernel_thread at ffffffff8100d1aa
      =========================================================================

      In a particular analysis from our on-site support we get the following values when the LBUG is
      raised on "filter_finish_transno" function:

      lcd_last_transno=0x4ddebb
      oti_transno=last_rcvd=0x4ddeba
      lsd_last_transno=0x4de0ee

      So we have the client (lcd_last_transno) having a bad transaction number with the actual
      transaction number being lower than client's one which, according the the ASSERT, is bad.

      I could see there is a similar bug (bz23296) but I don't think this bug is related with this one,
      as in bz23296 the problem comes from a bad initialization in obdecho/echo_client.c which is used
      only for tests, not for production as it's our case.

      Does this sound as a known bug for you? In order to work-around this bug, what would be the
      consequences of disabling this LBUG? I mean, I think we would loss some data on a client but I
      don't know if there is any other important consequence.

      I also attach here the patch from bugzilla 24420 that is already landed in 1.8.6.

      Thanks,
      Sebastien.

      Attachments

        Activity

          People

            niu Niu Yawei (Inactive)
            sebastien.buisson Sebastien Buisson (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: