Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5629

osp_sync_interpret() ASSERTION( rc || req->rq_transno ) failed

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.6.0, Lustre 2.4.2, Lustre 2.5.3
    • Lustre 2.4.2-14chaos (see github.com/chaos/lustre)
    • 3
    • 15744

    Description

      One of our MDS nodes crashed to day with the following assertion:

      client.c:304:ptlrpc_at_adj_net_latency()) Reported service time 548 > total measured time 165
      osp_sync.c:355:osp_sync_interpret())  ASSERTION( rc || req->rq_transno ) failed
      

      Note that the two messages above were printed in the same second (as reported by syslog) and by the same kernel thread. I don't know if the ptlrpc_at_adj_net_latency() message is actually related to the assertion or not, but the proximity makes it worth noting.

      There were a few OST to which the MDS lost and reestablished a connection a couple of minutes earlier in the log.

      The backtrace was:

      panic
      lbug_with_loc
      osp_sync_interpret
      ptlrpc_check_set
      ptlrpcd_check
      ptlrpcd
      kernel_thread
      

      It was running lustre version 2.4.2-14chaos (see github.com/chaos/lustre).

      We cannot provide logs or crash dumps for this machine.

      Attachments

        1. lbugmay2.zip
          53.38 MB
        2. LU-5629-syslog.bz2
          174 kB

        Issue Links

          Activity

            [LU-5629] osp_sync_interpret() ASSERTION( rc || req->rq_transno ) failed

            Attached server syslogs, the stack dumps were not captured unfortunately, but the location for that collection is mounted now in case it happens again.

            There had been network issues earlier in the day, reportedly resolved by 4pm.

            fyi the number of clients on the fs is currently 6395. And the exact version of the software is
            lustre: 2.5.5
            kernel: patchless_client
            build: -6chaos-CHANGED-2.6.32-573.26.1.1chaos.ch5.4.x86_64

            ruth.klundt@gmail.com Ruth Klundt (Inactive) added a comment - Attached server syslogs, the stack dumps were not captured unfortunately, but the location for that collection is mounted now in case it happens again. There had been network issues earlier in the day, reportedly resolved by 4pm. fyi the number of clients on the fs is currently 6395. And the exact version of the software is lustre: 2.5.5 kernel: patchless_client build: -6chaos-CHANGED-2.6.32-573.26.1.1chaos.ch5.4.x86_64

            any logs/dumps?

            bzzz Alex Zhuravlev added a comment - any logs/dumps?

            Add one more at snl last night, also running 2.5.5.

            ruth.klundt@gmail.com Ruth Klundt (Inactive) added a comment - Add one more at snl last night, also running 2.5.5.
            weems2 Lance Weems added a comment -

            Wanted to report we hit this over the weekend on our 2.5.5 production file system here at LLNL.

            weems2 Lance Weems added a comment - Wanted to report we hit this over the weekend on our 2.5.5 production file system here at LLNL.

            Looks like we just hit this on our 2.5.3+ production file system

            simmonsja James A Simmons added a comment - Looks like we just hit this on our 2.5.3+ production file system

            I am reopening this ticket because it does not appear that the issue was resolved as previously believed. We are still seeing the same assertion with Lustre 2.5.3, which contains the patch from LU-3892 (commit 7f4a635, which landed well before 2.5.0).

            morrone Christopher Morrone (Inactive) added a comment - - edited I am reopening this ticket because it does not appear that the issue was resolved as previously believed. We are still seeing the same assertion with Lustre 2.5.3, which contains the patch from LU-3892 (commit 7f4a635, which landed well before 2.5.0).
            pjones Peter Jones added a comment -

            duplicate of lu-3892

            pjones Peter Jones added a comment - duplicate of lu-3892

            Commit e12b89a9e7d8409c2b624162760c2e7e3481d7be with fix was landed in 2.4.2.

            dmiter Dmitry Eremin (Inactive) added a comment - Commit e12b89a9e7d8409c2b624162760c2e7e3481d7be with fix was landed in 2.4.2.
            pjones Peter Jones added a comment -

            Dmitry is looking into this one

            pjones Peter Jones added a comment - Dmitry is looking into this one

            FYI, I think a possible reason of ptlrpc_at_adj_net_latency() warning is because early reply is lost so RPC is expired, then client(it is OSP of MDS at here) resends the request, because reply of original request (on server) is still using the same rq_xid as match-bits of reply, so it can fit into the reposted reply buffer.

            If this happened, service time returned by original reply can be longer than execution time of the resent RPC. I'm not sure if this is relevant to the assertion, but we at least should remove this warning and only put it in debug info.

            liang Liang Zhen (Inactive) added a comment - FYI, I think a possible reason of ptlrpc_at_adj_net_latency() warning is because early reply is lost so RPC is expired, then client(it is OSP of MDS at here) resends the request, because reply of original request (on server) is still using the same rq_xid as match-bits of reply, so it can fit into the reposted reply buffer. If this happened, service time returned by original reply can be longer than execution time of the resent RPC. I'm not sure if this is relevant to the assertion, but we at least should remove this warning and only put it in debug info.

            In LU-5193 Cray reports hitting the same assertion under lustre 2.6.

            morrone Christopher Morrone (Inactive) added a comment - In LU-5193 Cray reports hitting the same assertion under lustre 2.6.

            People

              dmiter Dmitry Eremin (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: