Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-128

OSSs frequent crashes due to LBUG/[ASSERTION(last_rcvd>=le64_to_cpu(lcd->lcd_last_transno)) failed] in recovery

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.1.0
    • Lustre 2.0.0
    • None
    • x86_64, RHEL6
    • 3
    • 24,420
    • 5040

    Description

      As suggested by Peter Jones, we open a Jira ticket for this issue in order to get the fix landed in 2.1.
      I simply copy here the initial description from bugzilla 24420:

      We are having this bug when we reboot some OSSs. It's being raised in the recovery phase and it's
      provoking a long Lustre service interruption.

      Each time/crash, the panic'ing thread stack-trace looked like following :
      =========================================================================
      #0 [ffff881021fd1238] machine_kexec at ffffffff8102e66b
      #1 [ffff881021fd1298] crash_kexec at ffffffff810a9ae8
      #2 [ffff881021fd1368] panic at ffffffff8145210d
      #3 [ffff881021fd13e8] lbug_with_loc at ffffffffa0454eeb
      #4 [ffff881021fd1438] libcfs_assertion_failed at ffffffffa04607d6
      #5 [ffff881021fd1488] filter_finish_transno at ffffffffa096c825
      #6 [ffff881021fd1548] filter_do_bio at ffffffffa098e390
      #7 [ffff881021fd15e8] filter_commitrw_write at ffffffffa0990a78
      #8 [ffff881021fd17d8] filter_commitrw at ffffffffa09833d5
      #9 [ffff881021fd1898] obd_commitrw at ffffffffa093affa
      #10 [ffff881021fd1918] ost_brw_write at ffffffffa0943644
      #11 [ffff881021fd1af8] ost_handle at ffffffffa094837a
      #12 [ffff881021fd1ca8] ptlrpc_server_handle_request at ffffffffa060eb11
      #13 [ffff881021fd1de8] ptlrpc_main at ffffffffa060feea
      #14 [ffff881021fd1f48] kernel_thread at ffffffff8100d1aa
      =========================================================================

      In a particular analysis from our on-site support we get the following values when the LBUG is
      raised on "filter_finish_transno" function:

      lcd_last_transno=0x4ddebb
      oti_transno=last_rcvd=0x4ddeba
      lsd_last_transno=0x4de0ee

      So we have the client (lcd_last_transno) having a bad transaction number with the actual
      transaction number being lower than client's one which, according the the ASSERT, is bad.

      I could see there is a similar bug (bz23296) but I don't think this bug is related with this one,
      as in bz23296 the problem comes from a bad initialization in obdecho/echo_client.c which is used
      only for tests, not for production as it's our case.

      Does this sound as a known bug for you? In order to work-around this bug, what would be the
      consequences of disabling this LBUG? I mean, I think we would loss some data on a client but I
      don't know if there is any other important consequence.

      I also attach here the patch from bugzilla 24420 that is already landed in 1.8.6.

      Thanks,
      Sebastien.

      Attachments

        Activity

          [LU-128] OSSs frequent crashes due to LBUG/[ASSERTION(last_rcvd>=le64_to_cpu(lcd->lcd_last_transno)) failed] in recovery
          pjones Peter Jones made changes -
          Resolution New: Fixed [ 1 ]
          Status Original: Open [ 1 ] New: Resolved [ 5 ]
          pjones Peter Jones made changes -
          Priority Original: Blocker [ 1 ] New: Major [ 3 ]
          pjones Peter Jones made changes -
          Priority Original: Major [ 3 ] New: Blocker [ 1 ]
          pjones Peter Jones made changes -
          Priority Original: Blocker [ 1 ] New: Major [ 3 ]
          niu Niu Yawei (Inactive) made changes -
          Comment [ Hi, Tappro

          The check of

          if (transno < obd->obd_next_recovery_transno) {
              /* Processing the queue right now, don't re-add. */
              LASSERT(cfs_list_empty(&req->rq_list));
              cfs_spin_unlock(&obd->obd_recovery_task_lock);
              RETURN(1);
          }

          in target_queue_recovery_request() not only allows open requests to pass, it also allows resent replay request to be proccessed immediately, such kind of resent replay requests could trigger this LBUG.

          There are two kinds of resent replays:

          1) Replay request (transno A) timeout (reply lost): in such case, if A wasn't committed on server, client will reconnect and start replay from transno A, otherwise, client reconnect and start replay from next transno (greater than A). No matter which transno the re-replay started from, the timeouted request in sending list has to be replayed again after all replay and replay locks done, at that time, transno A is quite possible less than current lcd_last_transno, so LBUG could be triggered.

          2) Replay request (transno A) got error reply: in such case, client will reconnect and start replay from transno A, if A was already committed on server, then we'll get (last_rcvd == lcd_last_transno) in filter_finish_transno(); No resend sending list for this kind of resent replay, I think it shouldn't trigger the LBUG.

          If there is anything wrong, please correct me.
          ]
          pjones Peter Jones made changes -
          Fix Version/s New: Lustre 2.1.0 [ 10021 ]
          pjones Peter Jones made changes -
          Priority Original: Critical [ 2 ] New: Blocker [ 1 ]
          pjones Peter Jones made changes -
          Assignee Original: Robert Read [ rread ] New: Niu Yawei [ niu ]
          sebastien.buisson Sebastien Buisson (Inactive) created issue -

          People

            niu Niu Yawei (Inactive)
            sebastien.buisson Sebastien Buisson (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: