Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6819

LBUG ASSERTION( tdtd->tdtd_last_update_transno <= transno ) failed

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.8.0
    • Lustre 2.8.0
    • None
    • Lola - 2.7.56 tag
    • 3
    • 9223372036854775807

    Description

      After several MDT failovers (ldiskfs format) the MDT is failing to mount. Mount produces this LBUG:

      Jul  8 15:55:53 lola-11 kernel: LustreError: 38479:0:(ldlm_lib.c:2313:replay_request_or_update()) ASSERTION( tdtd->tdtd_last_update_transno <= transno ) failed:
      Jul  8 15:55:53 lola-11 kernel: LustreError: 38479:0:(ldlm_lib.c:2313:replay_request_or_update()) LBUG
      

      Unfortunately panic_on_lbug was set. Will attempt to reproduce and get stack
      So far we have

      Call Trace:^M
       [<ffffffffa0741875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
       [<ffffffffa0741e77>] lbug_with_loc+0x47/0xb0 [libcfs]^M
       [<ffffffffa0a909e7>] target_recovery_thread+0x2147/0x2720 [ptlrpc]^M
       [<ffffffff8152a2be>] ? thread_return+0x4e/0x7d0^M
       [<ffffffffa0a8e8a0>] ? target_recovery_thread+0x0/0x2720 [ptlrpc]^M
       [<ffffffff8109e78e>] kthread+0x9e/0xc0^M
       [<ffffffff8100c28a>] child_rip+0xa/0x20^M
       [<ffffffff8109e6f0>] ? kthread+0x0/0xc0^M
       [<ffffffff8100c280>] ? child_rip+0x0/0x20^M
      

      Attachments

        Issue Links

          Activity

            [LU-6819] LBUG ASSERTION( tdtd->tdtd_last_update_transno <= transno ) failed

            Landed for 2.8.

            jgmitter Joseph Gmitter (Inactive) added a comment - Landed for 2.8.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15594/
            Subject: LU-6819 update: update transno of replay properly
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: af13cfff297d4882de35fb7c11bf5261293b8287

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15594/ Subject: LU-6819 update: update transno of replay properly Project: fs/lustre-release Branch: master Current Patch Set: Commit: af13cfff297d4882de35fb7c11bf5261293b8287

            wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/15594
            Subject: LU-6819 update: update transno of replay properly
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 2e2b244de4b7094934eb3c9dd07d3d4af7f87fbd

            gerrit Gerrit Updater added a comment - wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/15594 Subject: LU-6819 update: update transno of replay properly Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2e2b244de4b7094934eb3c9dd07d3d4af7f87fbd
            di.wang Di Wang added a comment -

            Cliff: I add some debug information in https://build.hpdd.intel.com/job/lustre-reviews/33316/ . Could you please try this rpm and get me the debug log? Thanks

            di.wang Di Wang added a comment - Cliff: I add some debug information in https://build.hpdd.intel.com/job/lustre-reviews/33316/ . Could you please try this rpm and get me the debug log? Thanks
            di.wang Di Wang added a comment -

            Andreas: Even the llog (on disk) has wrapped around. But after it reconstruct the llog record into update replay request and insert to the replay list, it will make sure the list is sorted and each item will be unique. I will investigate this further. Thanks.

            di.wang Di Wang added a comment - Andreas: Even the llog (on disk) has wrapped around. But after it reconstruct the llog record into update replay request and insert to the replay list, it will make sure the list is sorted and each item will be unique. I will investigate this further. Thanks.

            Di, it looks like maybe the llog has wrapped around? What is the last_committed transno, and should those older ones be dropped?

            adilger Andreas Dilger added a comment - Di, it looks like maybe the llog has wrapped around? What is the last_committed transno, and should those older ones be dropped?
            di.wang Di Wang added a comment -

            According to the debug log, it seems update records in the replay list is out of order. But I can not any issue in the code.

            00000020:00080000:10.0:1436398138.407158:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8591034069 batchid = 4307872635 flags = 0 ops = 29 params = 21
            00000020:00080000:10.0:1436398138.407160:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8591034071 batchid = 4307872636 flags = 0 ops = 10 params = 3
            00000020:00080000:10.0:1436398138.407163:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8591034071 batchid = 4307872637 flags = 0 ops = 37 params = 9
            00000020:00080000:10.0:1436398138.407165:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8591252290 batchid = 4307879249 flags = 0 ops = 67 params = 46
            00000020:00080000:10.0:1436398138.407167:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8591252303 batchid = 4307879250 flags = 0 ops = 9 params = 5
            00000020:00080000:10.0:1436398138.407169:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8590494596 batchid = 4307862210 flags = 0 ops = 4 params = 2
            00000020:00080000:10.0:1436398138.407171:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8590494600 batchid = 4307862211 flags = 0 ops = 5 params = 8
            00000020:00080000:10.0:1436398138.407173:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8590494602 batchid = 4307862212 flags = 0 ops = 4 params = 2
            00000020:00080000:10.0:1436398138.407175:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8590494606 batchid = 4307862213 flags = 0 ops = 5 params = 8
            00000020:00080000:10.0:1436398138.407177:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8590494608 batchid = 4307862214 flags = 0 ops = 4 params = 2
            

            The list is supposed to be sorted by master transno, but seems not since 8590494596.

            di.wang Di Wang added a comment - According to the debug log, it seems update records in the replay list is out of order. But I can not any issue in the code. 00000020:00080000:10.0:1436398138.407158:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8591034069 batchid = 4307872635 flags = 0 ops = 29 params = 21 00000020:00080000:10.0:1436398138.407160:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8591034071 batchid = 4307872636 flags = 0 ops = 10 params = 3 00000020:00080000:10.0:1436398138.407163:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8591034071 batchid = 4307872637 flags = 0 ops = 37 params = 9 00000020:00080000:10.0:1436398138.407165:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8591252290 batchid = 4307879249 flags = 0 ops = 67 params = 46 00000020:00080000:10.0:1436398138.407167:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8591252303 batchid = 4307879250 flags = 0 ops = 9 params = 5 00000020:00080000:10.0:1436398138.407169:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8590494596 batchid = 4307862210 flags = 0 ops = 4 params = 2 00000020:00080000:10.0:1436398138.407171:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8590494600 batchid = 4307862211 flags = 0 ops = 5 params = 8 00000020:00080000:10.0:1436398138.407173:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8590494602 batchid = 4307862212 flags = 0 ops = 4 params = 2 00000020:00080000:10.0:1436398138.407175:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8590494606 batchid = 4307862213 flags = 0 ops = 5 params = 8 00000020:00080000:10.0:1436398138.407177:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8590494608 batchid = 4307862214 flags = 0 ops = 4 params = 2 The list is supposed to be sorted by master transno, but seems not since 8590494596.

            Lustre log from repeat of LBUG:
            stack:

            Jul  8 16:28:59 lola-11 kernel: LustreError: 19026:0:(ldlm_lib.c:2313:replay_request_or_update()) ASSERTION( tdtd->tdtd_last_update_transno <= transno ) failed:
            Jul  8 16:28:59 lola-11 kernel: LustreError: 19026:0:(ldlm_lib.c:2313:replay_request_or_update()) LBUG
            Jul  8 16:28:59 lola-11 kernel: Pid: 19026, comm: tgt_recov
            Jul  8 16:28:59 lola-11 kernel:
            Jul  8 16:28:59 lola-11 kernel: Call Trace:
            Jul  8 16:28:59 lola-11 kernel: [<ffffffffa0741875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
            Jul  8 16:28:59 lola-11 kernel: [<ffffffffa0741e77>] lbug_with_loc+0x47/0xb0 [libcfs]
            Jul  8 16:28:59 lola-11 kernel: [<ffffffffa0a909e7>] target_recovery_thread+0x2147/0x2720 [ptlrpc]
            Jul  8 16:28:59 lola-11 kernel: [<ffffffff8152a2be>] ? thread_return+0x4e/0x7d0
            Jul  8 16:28:59 lola-11 kernel: [<ffffffffa0a8e8a0>] ? target_recovery_thread+0x0/0x2720 [ptlrpc]
            Jul  8 16:28:59 lola-11 kernel: [<ffffffff8109e78e>] kthread+0x9e/0xc0
            Jul  8 16:28:59 lola-11 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
            Jul  8 16:28:59 lola-11 kernel: [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
            Jul  8 16:28:59 lola-11 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
            Jul  8 16:28:59 lola-11 kernel:
            
            cliffw Cliff White (Inactive) added a comment - Lustre log from repeat of LBUG: stack: Jul 8 16:28:59 lola-11 kernel: LustreError: 19026:0:(ldlm_lib.c:2313:replay_request_or_update()) ASSERTION( tdtd->tdtd_last_update_transno <= transno ) failed: Jul 8 16:28:59 lola-11 kernel: LustreError: 19026:0:(ldlm_lib.c:2313:replay_request_or_update()) LBUG Jul 8 16:28:59 lola-11 kernel: Pid: 19026, comm: tgt_recov Jul 8 16:28:59 lola-11 kernel: Jul 8 16:28:59 lola-11 kernel: Call Trace: Jul 8 16:28:59 lola-11 kernel: [<ffffffffa0741875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Jul 8 16:28:59 lola-11 kernel: [<ffffffffa0741e77>] lbug_with_loc+0x47/0xb0 [libcfs] Jul 8 16:28:59 lola-11 kernel: [<ffffffffa0a909e7>] target_recovery_thread+0x2147/0x2720 [ptlrpc] Jul 8 16:28:59 lola-11 kernel: [<ffffffff8152a2be>] ? thread_return+0x4e/0x7d0 Jul 8 16:28:59 lola-11 kernel: [<ffffffffa0a8e8a0>] ? target_recovery_thread+0x0/0x2720 [ptlrpc] Jul 8 16:28:59 lola-11 kernel: [<ffffffff8109e78e>] kthread+0x9e/0xc0 Jul 8 16:28:59 lola-11 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20 Jul 8 16:28:59 lola-11 kernel: [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 Jul 8 16:28:59 lola-11 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 Jul 8 16:28:59 lola-11 kernel:

            People

              di.wang Di Wang
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: