Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5266

LBUG on Failover -ldlm_process_extent_lock()) ASSERTION( lock->l_granted_mode != lock->l_req_mode )

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.6.0, Lustre 2.5.4
    • Lustre 2.6.0
    • Hyperion - 2.5.60 build 2538
    • 3
    • 14694

    Description

      After hard failover of devices to server iws19, server wedged, then hit LBUG.
      Services never complete recovery, and appear to either restart the timer or something:

      Jun 27 10:08:53 iws19 kernel: Lustre: lustre-OST000c: Will be in recovery for at least 2:30, or until 316 clients reconnect
      Jun 27 10:11:53 iws19 kernel: Lustre: lustre-OST000c: recovery is timed out, evict stale exports
      Jun 27 10:18:30 iws19 kernel: Lustre: lustre-OST000c: Client c52d4856-d1df-b87b-911c-f1bfbc23a24d (at 192.168.124.182@o2ib) reconnecting, waiting for 316 clients in recovery for 2:27
      

      The server reports being cpu-bound prior to the failure
      Console log attached - unfortunately the dump after the LBUG failed.

      Attachments

        1. iws19.crash.txt
          24 kB
        2. iws23.dmesg
          29 kB
        3. iws29.lustre-log.1405540543.8118.txt
          0.3 kB
        4. iws29.lustre-log.1405540562.8181.txt
          316 kB
        5. iws29.messages.txt
          37 kB
        6. lustre-log.1404916388.10827.txt
          0.2 kB
        7. lustre-log.1404916402.10701.txt
          145 kB
        8. lustre-log.1404916421.10764.txt
          210 kB

        Issue Links

          Activity

            [LU-5266] LBUG on Failover -ldlm_process_extent_lock()) ASSERTION( lock->l_granted_mode != lock->l_req_mode )

            Yes, the issue could be triggered by the resent lock request.

            hongchao.zhang Hongchao Zhang added a comment - Yes, the issue could be triggered by the resent lock request.

            not sure if this failure is the same as the fixed one, but as caught by the same assertion, to not create another ticket, I put it here: http://review.whamcloud.com/10903

            vitaly_fertman Vitaly Fertman added a comment - not sure if this failure is the same as the fixed one, but as caught by the same assertion, to not create another ticket, I put it here: http://review.whamcloud.com/10903

            HongChao,
            could you please look into this one?
            Thank you!

            jlevi Jodi Levi (Inactive) added a comment - HongChao, could you please look into this one? Thank you!

            Yes, we had multiple failures

            cliffw Cliff White (Inactive) added a comment - Yes, we had multiple failures

            Cliff,
            Did you try to reboot the server again and/or try more failovers after this one?

            adilger Andreas Dilger added a comment - Cliff, Did you try to reboot the server again and/or try more failovers after this one?
            green Oleg Drokin added a comment -

            We need the backtrace for the crash please.

            green Oleg Drokin added a comment - We need the backtrace for the crash please.

            People

              hongchao.zhang Hongchao Zhang
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: