Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2232

LustreError: 9120:0:(ost_handler.c:1673:ost_prolong_lock_one()) LBUG

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.4.0, Lustre 2.1.2
    • Dell R710 servers running TOSS-2.0-2 and DDN 10k storage.
    • 4
    • 5290

    Description

      Last night we had two OSSs panic at virtually the same time with and LBUG error being thrown. We just updated our servers and clients to 2.1.2-4chaos from 2.1.2-3chaos releases with the past 2 days and had not experienced this issue with the previous release. Below is a sample of the console log from one of the servers. I have also captured all the console messages up until the system panicked and am attaching it.

      LustreError: 9044:0:(ost_handler.c:1673:ost_prolong_lock_one()) ASSERTION(lock->l_export == opd->opd_exp) failed
      LustreError: 9120:0:(ost_handler.c:1673:ost_prolong_lock_one()) ASSERTION(lock->l_export == opd->opd_exp) failed
      LustreError: 9120:0:(ost_handler.c:1673:ost_prolong_lock_one()) LBUG
      Pid: 9120, comm: ll_ost_io_341

      Call Trace:
      LustreError: 9083:0:(ost_handler.c:1673:ost_prolong_lock_one()) ASSERTION(lock->l_export == opd->opd_exp) failed
      LustreError: 9083:0:(ost_handler.c:1673:ost_prolong_lock_one()) LBUG
      [<ffffffffa0440895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      Pid: 9083, comm: ll_ost_io_304

      Attachments

        Issue Links

          Activity

            [LU-2232] LustreError: 9120:0:(ost_handler.c:1673:ost_prolong_lock_one()) LBUG
            pjones Peter Jones added a comment -

            thanks Patrick. Wow. That is odd.

            pjones Peter Jones added a comment - thanks Patrick. Wow. That is odd.

            Peter,

            That's correct. It was on b2_5.

            • Patrick
            paf Patrick Farrell (Inactive) added a comment - Peter, That's correct. It was on b2_5. Patrick
            pjones Peter Jones added a comment -

            Patrick

            To be clear, Cray could reliably reproduce this issue, then applied the diagnostic patch and could not? What code line was this on?

            Thanks

            Peter

            pjones Peter Jones added a comment - Patrick To be clear, Cray could reliably reproduce this issue, then applied the diagnostic patch and could not? What code line was this on? Thanks Peter

            Lai - Sorry we didn't update this earlier. Cray tried to reproduce this bug with your debug patch, and were unable to do so. After that, we pulled your patch set 1 in to our Lustre version and haven't seen this bug since.

            paf Patrick Farrell (Inactive) added a comment - Lai - Sorry we didn't update this earlier. Cray tried to reproduce this bug with your debug patch, and were unable to do so. After that, we pulled your patch set 1 in to our Lustre version and haven't seen this bug since.
            pjones Peter Jones added a comment -

            It would be a sensible approach to try both the fix from LU-5116 and the diagnostic patch from LU-2232. That way perhaps the issue would be fixed by LU-5116 but if there is still a residual problem we will have fuller information to go forward on.

            pjones Peter Jones added a comment - It would be a sensible approach to try both the fix from LU-5116 and the diagnostic patch from LU-2232 . That way perhaps the issue would be fixed by LU-5116 but if there is still a residual problem we will have fuller information to go forward on.
            green Oleg Drokin added a comment -

            I think LU-5116 might be related here to explan resend across eviction

            green Oleg Drokin added a comment - I think LU-5116 might be related here to explan resend across eviction
            laisiyao Lai Siyao added a comment -

            I did some test but couldn't reproduce this failure, so I updated the patch for 2.4/2.5 to debug patch which will print lock and request export address and crash as before, so that we can dump these two exports to help analyse.

            Could you apply the patch and reproduce it? and if it crashes upload the crash dump.

            laisiyao Lai Siyao added a comment - I did some test but couldn't reproduce this failure, so I updated the patch for 2.4/2.5 to debug patch which will print lock and request export address and crash as before, so that we can dump these two exports to help analyse. Could you apply the patch and reproduce it? and if it crashes upload the crash dump.
            laisiyao Lai Siyao added a comment -

            I'm afraid some requests were not aborted upon eviction on client, so that they were resent through new connection. I'll make some test to find more proof.

            laisiyao Lai Siyao added a comment - I'm afraid some requests were not aborted upon eviction on client, so that they were resent through new connection. I'll make some test to find more proof.
            green Oleg Drokin added a comment -

            Question - if the client was evicted - the request resending should have failed because the server would reject it.
            So there's something else happening that your explanation does not quite explain I think.

            green Oleg Drokin added a comment - Question - if the client was evicted - the request resending should have failed because the server would reject it. So there's something else happening that your explanation does not quite explain I think.
            laisiyao Lai Siyao added a comment - Patches are ready: 2.4: http://review.whamcloud.com/#/c/9925/ 2.5: http://review.whamcloud.com/#/c/9926/ master: http://review.whamcloud.com/#/c/9927/
            laisiyao Lai Siyao added a comment -

            The backtrace shows it LBUG on the first ost_prolong_lock_one() in ost_prolong_locks(), IMO what happened is like this:
            1. client did IO with lock handle in the request.
            2. IO bulk failed on server, so no reply to client to let it resend.
            3. lock cancelling timed out on server, the client was evicted.
            4. client reconnected and resent previous IO request, however the lock handle is obsolete, so the LASSERT was triggered. (this lock should have been replayed, but this request was simply resent, there is no way to update lock handle in the resent request)

            I'll provide a patch to check lock->l_export against opd->opd_exp other than assert for the first ost_prolong_lock_one().

            laisiyao Lai Siyao added a comment - The backtrace shows it LBUG on the first ost_prolong_lock_one() in ost_prolong_locks(), IMO what happened is like this: 1. client did IO with lock handle in the request. 2. IO bulk failed on server, so no reply to client to let it resend. 3. lock cancelling timed out on server, the client was evicted. 4. client reconnected and resent previous IO request, however the lock handle is obsolete, so the LASSERT was triggered. (this lock should have been replayed, but this request was simply resent, there is no way to update lock handle in the resent request) I'll provide a patch to check lock->l_export against opd->opd_exp other than assert for the first ost_prolong_lock_one().

            People

              laisiyao Lai Siyao
              jamervi Joe Mervini
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: