Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5629

osp_sync_interpret() ASSERTION( rc || req->rq_transno ) failed

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.6.0, Lustre 2.4.2, Lustre 2.5.3
    • Lustre 2.4.2-14chaos (see github.com/chaos/lustre)
    • 3
    • 15744

    Description

      One of our MDS nodes crashed to day with the following assertion:

      client.c:304:ptlrpc_at_adj_net_latency()) Reported service time 548 > total measured time 165
      osp_sync.c:355:osp_sync_interpret())  ASSERTION( rc || req->rq_transno ) failed
      

      Note that the two messages above were printed in the same second (as reported by syslog) and by the same kernel thread. I don't know if the ptlrpc_at_adj_net_latency() message is actually related to the assertion or not, but the proximity makes it worth noting.

      There were a few OST to which the MDS lost and reestablished a connection a couple of minutes earlier in the log.

      The backtrace was:

      panic
      lbug_with_loc
      osp_sync_interpret
      ptlrpc_check_set
      ptlrpcd_check
      ptlrpcd
      kernel_thread
      

      It was running lustre version 2.4.2-14chaos (see github.com/chaos/lustre).

      We cannot provide logs or crash dumps for this machine.

      Attachments

        1. lbugmay2.zip
          53.38 MB
        2. LU-5629-syslog.bz2
          174 kB

        Issue Links

          Activity

            [LU-5629] osp_sync_interpret() ASSERTION( rc || req->rq_transno ) failed
            pjones Peter Jones added a comment -

            Closing as a duplicate of LU-9135

            pjones Peter Jones added a comment - Closing as a duplicate of LU-9135

            Probably the patch https://review.whamcloud.com/30129/ should resolve this.

            dmiter Dmitry Eremin (Inactive) added a comment - Probably the patch https://review.whamcloud.com/30129/ should resolve this.

            Exact same issue @ LANL, Lustre 2.5.5.

            skirvan Scott Kirvan (Inactive) added a comment - Exact same issue @ LANL, Lustre 2.5.5.
            charr Cameron Harr added a comment -

            Saw the same crash on 8/03/16.
            lustre-2.5.5-6chaos_2.6.32_573.26.1.1chaos.ch5.4.x86_64.x86_64

            We have a 16GB dump available if need be.

            Not sure how related it is, but an OSS node suffered major hardware problems (MCEs) throughout the 30 minutes before the LBUG on the MDS. The MDS console log messages directly (~2 min) before the assertion were an evict/reconnect to that OST.

            charr Cameron Harr added a comment - Saw the same crash on 8/03/16. lustre-2.5.5-6chaos_2.6.32_573.26.1.1chaos.ch5.4.x86_64.x86_64 We have a 16GB dump available if need be. Not sure how related it is, but an OSS node suffered major hardware problems (MCEs) throughout the 30 minutes before the LBUG on the MDS. The MDS console log messages directly (~2 min) before the assertion were an evict/reconnect to that OST.

            Attached server syslogs, the stack dumps were not captured unfortunately, but the location for that collection is mounted now in case it happens again.

            There had been network issues earlier in the day, reportedly resolved by 4pm.

            fyi the number of clients on the fs is currently 6395. And the exact version of the software is
            lustre: 2.5.5
            kernel: patchless_client
            build: -6chaos-CHANGED-2.6.32-573.26.1.1chaos.ch5.4.x86_64

            ruth.klundt@gmail.com Ruth Klundt (Inactive) added a comment - Attached server syslogs, the stack dumps were not captured unfortunately, but the location for that collection is mounted now in case it happens again. There had been network issues earlier in the day, reportedly resolved by 4pm. fyi the number of clients on the fs is currently 6395. And the exact version of the software is lustre: 2.5.5 kernel: patchless_client build: -6chaos-CHANGED-2.6.32-573.26.1.1chaos.ch5.4.x86_64

            any logs/dumps?

            bzzz Alex Zhuravlev added a comment - any logs/dumps?

            Add one more at snl last night, also running 2.5.5.

            ruth.klundt@gmail.com Ruth Klundt (Inactive) added a comment - Add one more at snl last night, also running 2.5.5.
            weems2 Lance Weems added a comment -

            Wanted to report we hit this over the weekend on our 2.5.5 production file system here at LLNL.

            weems2 Lance Weems added a comment - Wanted to report we hit this over the weekend on our 2.5.5 production file system here at LLNL.

            Looks like we just hit this on our 2.5.3+ production file system

            simmonsja James A Simmons added a comment - Looks like we just hit this on our 2.5.3+ production file system

            I am reopening this ticket because it does not appear that the issue was resolved as previously believed. We are still seeing the same assertion with Lustre 2.5.3, which contains the patch from LU-3892 (commit 7f4a635, which landed well before 2.5.0).

            morrone Christopher Morrone (Inactive) added a comment - - edited I am reopening this ticket because it does not appear that the issue was resolved as previously believed. We are still seeing the same assertion with Lustre 2.5.3, which contains the patch from LU-3892 (commit 7f4a635, which landed well before 2.5.0).

            People

              dmiter Dmitry Eremin (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: