Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2232

LustreError: 9120:0:(ost_handler.c:1673:ost_prolong_lock_one()) LBUG

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.4.0, Lustre 2.1.2
    • Dell R710 servers running TOSS-2.0-2 and DDN 10k storage.
    • 4
    • 5290

    Description

      Last night we had two OSSs panic at virtually the same time with and LBUG error being thrown. We just updated our servers and clients to 2.1.2-4chaos from 2.1.2-3chaos releases with the past 2 days and had not experienced this issue with the previous release. Below is a sample of the console log from one of the servers. I have also captured all the console messages up until the system panicked and am attaching it.

      LustreError: 9044:0:(ost_handler.c:1673:ost_prolong_lock_one()) ASSERTION(lock->l_export == opd->opd_exp) failed
      LustreError: 9120:0:(ost_handler.c:1673:ost_prolong_lock_one()) ASSERTION(lock->l_export == opd->opd_exp) failed
      LustreError: 9120:0:(ost_handler.c:1673:ost_prolong_lock_one()) LBUG
      Pid: 9120, comm: ll_ost_io_341

      Call Trace:
      LustreError: 9083:0:(ost_handler.c:1673:ost_prolong_lock_one()) ASSERTION(lock->l_export == opd->opd_exp) failed
      LustreError: 9083:0:(ost_handler.c:1673:ost_prolong_lock_one()) LBUG
      [<ffffffffa0440895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      Pid: 9083, comm: ll_ost_io_304

      Attachments

        Issue Links

          Activity

            [LU-2232] LustreError: 9120:0:(ost_handler.c:1673:ost_prolong_lock_one()) LBUG
            pjones Peter Jones added a comment -

            Thanks Marc!

            pjones Peter Jones added a comment - Thanks Marc!

            We have not seen any more occurrences of this error since we rolled out our 2.5.4-4chaos version into production.

            marc@llnl.gov D. Marc Stearman (Inactive) added a comment - We have not seen any more occurrences of this error since we rolled out our 2.5.4-4chaos version into production.

            I have pulled change 14950, Patch Set 1, into LLNL's local tree. It is in the queue to go into the next TOSS release and eventually roll out into production.

            morrone Christopher Morrone (Inactive) added a comment - I have pulled change 14950, Patch Set 1, into LLNL's local tree. It is in the queue to go into the next TOSS release and eventually roll out into production.
            laisiyao Lai Siyao added a comment -

            The updated debug patch will print both lock and request details, which can tell us whether this request is new, or resent/replay request.

            laisiyao Lai Siyao added a comment - The updated debug patch will print both lock and request details, which can tell us whether this request is new, or resent/replay request.

            Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/14950
            Subject: LU-2232 debug: print debug for prolonged lock
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set: 1
            Commit: 1d21596de85b81bb3f98277f8f0b425368ccb187

            gerrit Gerrit Updater added a comment - Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/14950 Subject: LU-2232 debug: print debug for prolonged lock Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: 1d21596de85b81bb3f98277f8f0b425368ccb187
            laisiyao Lai Siyao added a comment -

            The log shows after client eviction (and quite possibly reconnected), a stale lock handle is packed in client rw request.

            I added some code to try to simulate this, but failed. According to the code, after eviction, client inflight RPCs will be aborted, and locks be cleaned up. If a full debug log covering this recovery(at least on client) can be obtained upon this error message is seen, it can help move this forward.

            laisiyao Lai Siyao added a comment - The log shows after client eviction (and quite possibly reconnected), a stale lock handle is packed in client rw request. I added some code to try to simulate this, but failed. According to the code, after eviction, client inflight RPCs will be aborted, and locks be cleaned up. If a full debug log covering this recovery(at least on client) can be obtained upon this error message is seen, it can help move this forward.

            The debug message hit on an MDS. Here it is:

            2015-04-24 09:00:42 Lustre: lsd-OST0000: haven't heard from client 6048f9e4-4557-a727-5b7d-dd9f7cb9c8b1 (at 192.168.116.115@o2ib5) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8805cbcd5000, cur 1429891242 expire 1429891092 last 1429891015
            2015-04-24 09:14:05 LustreError: 0:0:(ldlm_lockd.c:355:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 172.19.1.85@o2ib100  ns: filter-lsd-OST0000_UUID lock: ffff880bf976a880/0x7e1b5086f6ebf35e lrc: 3/0,0 mode: PR/PR res: [0x2678a59:0x0:0x0].0 rrc: 266 type: EXT [0->18446744073709551615] (req 468705280->468717567) flags: 0x60000000000020 nid: 172.19.1.85@o2ib100 remote: 0xa9a53ce40d684d49 expref: 54562 pid: 6158 timeout: 5072055798 lvb_type: 0
            2015-04-24 09:14:05 LustreError: 22433:0:(ldlm_lockd.c:2342:ldlm_cancel_handler()) ldlm_cancel from 172.19.1.85@o2ib100 arrived at 1429892045 with bad export cookie 9086945213692644542
            2015-04-24 09:14:05 LustreError: 6684:0:(ldlm_lock.c:2448:ldlm_lock_dump_handle()) ### ### ns: filter-lsd-OST0000_UUID lock: ffff880d7abe3680/0x7e1b5086f3c622c3 lrc: 3/0,0 mode: PR/PR res: [0x24bf940:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x40000000000000 nid: 172.19.1.85@o2ib100 remote: 0xa9a53ce40617fec7 expref: 54547 pid: 6183 timeout: 0 lvb_type: 1
            2015-04-24 09:14:05 LustreError: 22433:0:(ldlm_lockd.c:2342:ldlm_cancel_handler()) Skipped 1 previous similar message
            2015-04-24 09:18:04 LustreError: 6784:0:(ost_handler.c:1956:ost_prolong_locks()) ### LU-2232: lock export ffff880e2d590400 != req export ffff880d8a2eb400, this lock is obsolete!
            2015-04-24 09:18:04  ns: filter-lsd-OST0000_UUID lock: ffff880bf976a880/0x7e1b5086f6ebf35e lrc: 4/0,0 mode: PR/PR res: [0x2678a59:0x0:0x0].0 rrc: 135 type: EXT [0->18446744073709551615] (req 468705280->468717567) flags: 0x60000000000020 nid: 172.19.1.85@o2ib100 remote: 0xa9a53ce40d684d49 expref: 35636 pid: 6158 timeout: 5072055798 lvb_type: 0
            

            No assertion, though. Lustre 2.5.3-5chaos.

            morrone Christopher Morrone (Inactive) added a comment - The debug message hit on an MDS. Here it is: 2015-04-24 09:00:42 Lustre: lsd-OST0000: haven't heard from client 6048f9e4-4557-a727-5b7d-dd9f7cb9c8b1 (at 192.168.116.115@o2ib5) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8805cbcd5000, cur 1429891242 expire 1429891092 last 1429891015 2015-04-24 09:14:05 LustreError: 0:0:(ldlm_lockd.c:355:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 172.19.1.85@o2ib100 ns: filter-lsd-OST0000_UUID lock: ffff880bf976a880/0x7e1b5086f6ebf35e lrc: 3/0,0 mode: PR/PR res: [0x2678a59:0x0:0x0].0 rrc: 266 type: EXT [0->18446744073709551615] (req 468705280->468717567) flags: 0x60000000000020 nid: 172.19.1.85@o2ib100 remote: 0xa9a53ce40d684d49 expref: 54562 pid: 6158 timeout: 5072055798 lvb_type: 0 2015-04-24 09:14:05 LustreError: 22433:0:(ldlm_lockd.c:2342:ldlm_cancel_handler()) ldlm_cancel from 172.19.1.85@o2ib100 arrived at 1429892045 with bad export cookie 9086945213692644542 2015-04-24 09:14:05 LustreError: 6684:0:(ldlm_lock.c:2448:ldlm_lock_dump_handle()) ### ### ns: filter-lsd-OST0000_UUID lock: ffff880d7abe3680/0x7e1b5086f3c622c3 lrc: 3/0,0 mode: PR/PR res: [0x24bf940:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x40000000000000 nid: 172.19.1.85@o2ib100 remote: 0xa9a53ce40617fec7 expref: 54547 pid: 6183 timeout: 0 lvb_type: 1 2015-04-24 09:14:05 LustreError: 22433:0:(ldlm_lockd.c:2342:ldlm_cancel_handler()) Skipped 1 previous similar message 2015-04-24 09:18:04 LustreError: 6784:0:(ost_handler.c:1956:ost_prolong_locks()) ### LU-2232: lock export ffff880e2d590400 != req export ffff880d8a2eb400, this lock is obsolete! 2015-04-24 09:18:04 ns: filter-lsd-OST0000_UUID lock: ffff880bf976a880/0x7e1b5086f6ebf35e lrc: 4/0,0 mode: PR/PR res: [0x2678a59:0x0:0x0].0 rrc: 135 type: EXT [0->18446744073709551615] (req 468705280->468717567) flags: 0x60000000000020 nid: 172.19.1.85@o2ib100 remote: 0xa9a53ce40d684d49 expref: 35636 pid: 6158 timeout: 5072055798 lvb_type: 0 No assertion, though. Lustre 2.5.3-5chaos.

            Lai - Thanks for the answer. We'll go with that and I'll let you know if Cray sees any issues. (We're just moving in to internal deployment of 2.6.)

            paf Patrick Farrell (Inactive) added a comment - Lai - Thanks for the answer. We'll go with that and I'll let you know if Cray sees any issues. (We're just moving in to internal deployment of 2.6.)
            laisiyao Lai Siyao added a comment -

            Per Oleg's comment, this may be a duplicate of LU-5116, I've tried to verify it by adding some code and test, but not have enough time to finish it. I will keep an eye on it, and at the moment we can mark it resolved.

            laisiyao Lai Siyao added a comment - Per Oleg's comment, this may be a duplicate of LU-5116 , I've tried to verify it by adding some code and test, but not have enough time to finish it. I will keep an eye on it, and at the moment we can mark it resolved.

            Lai - Given that the code in this area was changed a fair bit in 2.6, do you think this bug is likely to still be present?

            paf Patrick Farrell (Inactive) added a comment - Lai - Given that the code in this area was changed a fair bit in 2.6, do you think this bug is likely to still be present?

            People

              laisiyao Lai Siyao
              jamervi Joe Mervini
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: