[LU-6819] LBUG ASSERTION( tdtd->tdtd_last_update_transno <= transno ) failed Created: 08/Jul/15 Updated: 26/Aug/15 Resolved: 26/Aug/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Cliff White (Inactive) | Assignee: | Di Wang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lola - 2.7.56 tag |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
After several MDT failovers (ldiskfs format) the MDT is failing to mount. Mount produces this LBUG: Jul 8 15:55:53 lola-11 kernel: LustreError: 38479:0:(ldlm_lib.c:2313:replay_request_or_update()) ASSERTION( tdtd->tdtd_last_update_transno <= transno ) failed: Jul 8 15:55:53 lola-11 kernel: LustreError: 38479:0:(ldlm_lib.c:2313:replay_request_or_update()) LBUG Unfortunately panic_on_lbug was set. Will attempt to reproduce and get stack Call Trace:^M [<ffffffffa0741875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M [<ffffffffa0741e77>] lbug_with_loc+0x47/0xb0 [libcfs]^M [<ffffffffa0a909e7>] target_recovery_thread+0x2147/0x2720 [ptlrpc]^M [<ffffffff8152a2be>] ? thread_return+0x4e/0x7d0^M [<ffffffffa0a8e8a0>] ? target_recovery_thread+0x0/0x2720 [ptlrpc]^M [<ffffffff8109e78e>] kthread+0x9e/0xc0^M [<ffffffff8100c28a>] child_rip+0xa/0x20^M [<ffffffff8109e6f0>] ? kthread+0x0/0xc0^M [<ffffffff8100c280>] ? child_rip+0x0/0x20^M |
| Comments |
| Comment by Cliff White (Inactive) [ 08/Jul/15 ] |
|
Lustre log from repeat of LBUG: Jul 8 16:28:59 lola-11 kernel: LustreError: 19026:0:(ldlm_lib.c:2313:replay_request_or_update()) ASSERTION( tdtd->tdtd_last_update_transno <= transno ) failed: Jul 8 16:28:59 lola-11 kernel: LustreError: 19026:0:(ldlm_lib.c:2313:replay_request_or_update()) LBUG Jul 8 16:28:59 lola-11 kernel: Pid: 19026, comm: tgt_recov Jul 8 16:28:59 lola-11 kernel: Jul 8 16:28:59 lola-11 kernel: Call Trace: Jul 8 16:28:59 lola-11 kernel: [<ffffffffa0741875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Jul 8 16:28:59 lola-11 kernel: [<ffffffffa0741e77>] lbug_with_loc+0x47/0xb0 [libcfs] Jul 8 16:28:59 lola-11 kernel: [<ffffffffa0a909e7>] target_recovery_thread+0x2147/0x2720 [ptlrpc] Jul 8 16:28:59 lola-11 kernel: [<ffffffff8152a2be>] ? thread_return+0x4e/0x7d0 Jul 8 16:28:59 lola-11 kernel: [<ffffffffa0a8e8a0>] ? target_recovery_thread+0x0/0x2720 [ptlrpc] Jul 8 16:28:59 lola-11 kernel: [<ffffffff8109e78e>] kthread+0x9e/0xc0 Jul 8 16:28:59 lola-11 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20 Jul 8 16:28:59 lola-11 kernel: [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 Jul 8 16:28:59 lola-11 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 Jul 8 16:28:59 lola-11 kernel: |
| Comment by Di Wang [ 09/Jul/15 ] |
|
According to the debug log, it seems update records in the replay list is out of order. But I can not any issue in the code. 00000020:00080000:10.0:1436398138.407158:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8591034069 batchid = 4307872635 flags = 0 ops = 29 params = 21 00000020:00080000:10.0:1436398138.407160:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8591034071 batchid = 4307872636 flags = 0 ops = 10 params = 3 00000020:00080000:10.0:1436398138.407163:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8591034071 batchid = 4307872637 flags = 0 ops = 37 params = 9 00000020:00080000:10.0:1436398138.407165:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8591252290 batchid = 4307879249 flags = 0 ops = 67 params = 46 00000020:00080000:10.0:1436398138.407167:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8591252303 batchid = 4307879250 flags = 0 ops = 9 params = 5 00000020:00080000:10.0:1436398138.407169:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8590494596 batchid = 4307862210 flags = 0 ops = 4 params = 2 00000020:00080000:10.0:1436398138.407171:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8590494600 batchid = 4307862211 flags = 0 ops = 5 params = 8 00000020:00080000:10.0:1436398138.407173:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8590494602 batchid = 4307862212 flags = 0 ops = 4 params = 2 00000020:00080000:10.0:1436398138.407175:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8590494606 batchid = 4307862213 flags = 0 ops = 5 params = 8 00000020:00080000:10.0:1436398138.407177:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8590494608 batchid = 4307862214 flags = 0 ops = 4 params = 2 The list is supposed to be sorted by master transno, but seems not since 8590494596. |
| Comment by Andreas Dilger [ 09/Jul/15 ] |
|
Di, it looks like maybe the llog has wrapped around? What is the last_committed transno, and should those older ones be dropped? |
| Comment by Di Wang [ 11/Jul/15 ] |
|
Andreas: Even the llog (on disk) has wrapped around. But after it reconstruct the llog record into update replay request and insert to the replay list, it will make sure the list is sorted and each item will be unique. I will investigate this further. Thanks. |
| Comment by Di Wang [ 13/Jul/15 ] |
|
Cliff: I add some debug information in https://build.hpdd.intel.com/job/lustre-reviews/33316/ . Could you please try this rpm and get me the debug log? Thanks |
| Comment by Gerrit Updater [ 14/Jul/15 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/15594 |
| Comment by Gerrit Updater [ 26/Aug/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15594/ |
| Comment by Joseph Gmitter (Inactive) [ 26/Aug/15 ] |
|
Landed for 2.8. |