[LU-6819] LBUG ASSERTION( tdtd->tdtd_last_update_transno <= transno ) failed Created: 08/Jul/15  Updated: 26/Aug/15  Resolved: 26/Aug/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Blocker
Reporter: Cliff White (Inactive) Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: None
Environment:

Lola - 2.7.56 tag


Attachments: File lola-11.log.gz    
Issue Links:
Related
is related to LU-6831 The ticket for tracking all DNE2 bugs Reopened
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

After several MDT failovers (ldiskfs format) the MDT is failing to mount. Mount produces this LBUG:

Jul  8 15:55:53 lola-11 kernel: LustreError: 38479:0:(ldlm_lib.c:2313:replay_request_or_update()) ASSERTION( tdtd->tdtd_last_update_transno <= transno ) failed:
Jul  8 15:55:53 lola-11 kernel: LustreError: 38479:0:(ldlm_lib.c:2313:replay_request_or_update()) LBUG

Unfortunately panic_on_lbug was set. Will attempt to reproduce and get stack
So far we have

Call Trace:^M
 [<ffffffffa0741875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
 [<ffffffffa0741e77>] lbug_with_loc+0x47/0xb0 [libcfs]^M
 [<ffffffffa0a909e7>] target_recovery_thread+0x2147/0x2720 [ptlrpc]^M
 [<ffffffff8152a2be>] ? thread_return+0x4e/0x7d0^M
 [<ffffffffa0a8e8a0>] ? target_recovery_thread+0x0/0x2720 [ptlrpc]^M
 [<ffffffff8109e78e>] kthread+0x9e/0xc0^M
 [<ffffffff8100c28a>] child_rip+0xa/0x20^M
 [<ffffffff8109e6f0>] ? kthread+0x0/0xc0^M
 [<ffffffff8100c280>] ? child_rip+0x0/0x20^M


 Comments   
Comment by Cliff White (Inactive) [ 08/Jul/15 ]

Lustre log from repeat of LBUG:
stack:

Jul  8 16:28:59 lola-11 kernel: LustreError: 19026:0:(ldlm_lib.c:2313:replay_request_or_update()) ASSERTION( tdtd->tdtd_last_update_transno <= transno ) failed:
Jul  8 16:28:59 lola-11 kernel: LustreError: 19026:0:(ldlm_lib.c:2313:replay_request_or_update()) LBUG
Jul  8 16:28:59 lola-11 kernel: Pid: 19026, comm: tgt_recov
Jul  8 16:28:59 lola-11 kernel:
Jul  8 16:28:59 lola-11 kernel: Call Trace:
Jul  8 16:28:59 lola-11 kernel: [<ffffffffa0741875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Jul  8 16:28:59 lola-11 kernel: [<ffffffffa0741e77>] lbug_with_loc+0x47/0xb0 [libcfs]
Jul  8 16:28:59 lola-11 kernel: [<ffffffffa0a909e7>] target_recovery_thread+0x2147/0x2720 [ptlrpc]
Jul  8 16:28:59 lola-11 kernel: [<ffffffff8152a2be>] ? thread_return+0x4e/0x7d0
Jul  8 16:28:59 lola-11 kernel: [<ffffffffa0a8e8a0>] ? target_recovery_thread+0x0/0x2720 [ptlrpc]
Jul  8 16:28:59 lola-11 kernel: [<ffffffff8109e78e>] kthread+0x9e/0xc0
Jul  8 16:28:59 lola-11 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
Jul  8 16:28:59 lola-11 kernel: [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
Jul  8 16:28:59 lola-11 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
Jul  8 16:28:59 lola-11 kernel:
Comment by Di Wang [ 09/Jul/15 ]

According to the debug log, it seems update records in the replay list is out of order. But I can not any issue in the code.

00000020:00080000:10.0:1436398138.407158:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8591034069 batchid = 4307872635 flags = 0 ops = 29 params = 21
00000020:00080000:10.0:1436398138.407160:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8591034071 batchid = 4307872636 flags = 0 ops = 10 params = 3
00000020:00080000:10.0:1436398138.407163:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8591034071 batchid = 4307872637 flags = 0 ops = 37 params = 9
00000020:00080000:10.0:1436398138.407165:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8591252290 batchid = 4307879249 flags = 0 ops = 67 params = 46
00000020:00080000:10.0:1436398138.407167:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8591252303 batchid = 4307879250 flags = 0 ops = 9 params = 5
00000020:00080000:10.0:1436398138.407169:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8590494596 batchid = 4307862210 flags = 0 ops = 4 params = 2
00000020:00080000:10.0:1436398138.407171:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8590494600 batchid = 4307862211 flags = 0 ops = 5 params = 8
00000020:00080000:10.0:1436398138.407173:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8590494602 batchid = 4307862212 flags = 0 ops = 4 params = 2
00000020:00080000:10.0:1436398138.407175:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8590494606 batchid = 4307862213 flags = 0 ops = 5 params = 8
00000020:00080000:10.0:1436398138.407177:0:19026:0:(update_records.c:72:update_records_dump()) master transno = 8590494608 batchid = 4307862214 flags = 0 ops = 4 params = 2

The list is supposed to be sorted by master transno, but seems not since 8590494596.

Comment by Andreas Dilger [ 09/Jul/15 ]

Di, it looks like maybe the llog has wrapped around? What is the last_committed transno, and should those older ones be dropped?

Comment by Di Wang [ 11/Jul/15 ]

Andreas: Even the llog (on disk) has wrapped around. But after it reconstruct the llog record into update replay request and insert to the replay list, it will make sure the list is sorted and each item will be unique. I will investigate this further. Thanks.

Comment by Di Wang [ 13/Jul/15 ]

Cliff: I add some debug information in https://build.hpdd.intel.com/job/lustre-reviews/33316/ . Could you please try this rpm and get me the debug log? Thanks

Comment by Gerrit Updater [ 14/Jul/15 ]

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/15594
Subject: LU-6819 update: update transno of replay properly
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2e2b244de4b7094934eb3c9dd07d3d4af7f87fbd

Comment by Gerrit Updater [ 26/Aug/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15594/
Subject: LU-6819 update: update transno of replay properly
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: af13cfff297d4882de35fb7c11bf5261293b8287

Comment by Joseph Gmitter (Inactive) [ 26/Aug/15 ]

Landed for 2.8.

Generated at Sat Feb 10 02:03:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.