Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2232

LustreError: 9120:0:(ost_handler.c:1673:ost_prolong_lock_one()) LBUG

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.4.0, Lustre 2.1.2
    • Dell R710 servers running TOSS-2.0-2 and DDN 10k storage.
    • 4
    • 5290

    Description

      Last night we had two OSSs panic at virtually the same time with and LBUG error being thrown. We just updated our servers and clients to 2.1.2-4chaos from 2.1.2-3chaos releases with the past 2 days and had not experienced this issue with the previous release. Below is a sample of the console log from one of the servers. I have also captured all the console messages up until the system panicked and am attaching it.

      LustreError: 9044:0:(ost_handler.c:1673:ost_prolong_lock_one()) ASSERTION(lock->l_export == opd->opd_exp) failed
      LustreError: 9120:0:(ost_handler.c:1673:ost_prolong_lock_one()) ASSERTION(lock->l_export == opd->opd_exp) failed
      LustreError: 9120:0:(ost_handler.c:1673:ost_prolong_lock_one()) LBUG
      Pid: 9120, comm: ll_ost_io_341

      Call Trace:
      LustreError: 9083:0:(ost_handler.c:1673:ost_prolong_lock_one()) ASSERTION(lock->l_export == opd->opd_exp) failed
      LustreError: 9083:0:(ost_handler.c:1673:ost_prolong_lock_one()) LBUG
      [<ffffffffa0440895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      Pid: 9083, comm: ll_ost_io_304

      Attachments

        Issue Links

          Activity

            [LU-2232] LustreError: 9120:0:(ost_handler.c:1673:ost_prolong_lock_one()) LBUG
            pjones Peter Jones added a comment -

            Thanks Marc!

            pjones Peter Jones added a comment - Thanks Marc!

            We have not seen any more occurrences of this error since we rolled out our 2.5.4-4chaos version into production.

            marc@llnl.gov D. Marc Stearman (Inactive) added a comment - We have not seen any more occurrences of this error since we rolled out our 2.5.4-4chaos version into production.

            I have pulled change 14950, Patch Set 1, into LLNL's local tree. It is in the queue to go into the next TOSS release and eventually roll out into production.

            morrone Christopher Morrone (Inactive) added a comment - I have pulled change 14950, Patch Set 1, into LLNL's local tree. It is in the queue to go into the next TOSS release and eventually roll out into production.
            laisiyao Lai Siyao added a comment -

            The updated debug patch will print both lock and request details, which can tell us whether this request is new, or resent/replay request.

            laisiyao Lai Siyao added a comment - The updated debug patch will print both lock and request details, which can tell us whether this request is new, or resent/replay request.

            Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/14950
            Subject: LU-2232 debug: print debug for prolonged lock
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set: 1
            Commit: 1d21596de85b81bb3f98277f8f0b425368ccb187

            gerrit Gerrit Updater added a comment - Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/14950 Subject: LU-2232 debug: print debug for prolonged lock Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: 1d21596de85b81bb3f98277f8f0b425368ccb187
            laisiyao Lai Siyao added a comment -

            The log shows after client eviction (and quite possibly reconnected), a stale lock handle is packed in client rw request.

            I added some code to try to simulate this, but failed. According to the code, after eviction, client inflight RPCs will be aborted, and locks be cleaned up. If a full debug log covering this recovery(at least on client) can be obtained upon this error message is seen, it can help move this forward.

            laisiyao Lai Siyao added a comment - The log shows after client eviction (and quite possibly reconnected), a stale lock handle is packed in client rw request. I added some code to try to simulate this, but failed. According to the code, after eviction, client inflight RPCs will be aborted, and locks be cleaned up. If a full debug log covering this recovery(at least on client) can be obtained upon this error message is seen, it can help move this forward.

            People

              laisiyao Lai Siyao
              jamervi Joe Mervini
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: