[LU-13408] tgt_cancel_slc_locks()) ASSERTION( lock->l_client_cookie != 0 ) Created: 02/Apr/20 Updated: 16/Jun/20 Resolved: 16/Jun/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0 |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Lai Siyao | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
An assertion is triggered, and it mean the request->rq_transno is 0. [25892.430672] LustreError: 137-5: fs0a92-OST0006_UUID: not available for connect from 172.16.0.32@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. [25892.435324] LustreError: Skipped 74 previous similar messages [27887.087931] LustreError: 27219:0:(tgt_main.c:354:tgt_cancel_slc_locks()) ASSERTION( lock->l_client_cookie != 0 ) failed: [27887.090610] LustreError: 27219:0:(tgt_main.c:354:tgt_cancel_slc_locks()) LBUG |
| Comments |
| Comment by Oleg Drokin [ 02/Apr/20 ] |
|
This same failure made a noticeable appearance in current master-next, but ther's nothing I am able to attribute it to yet. It always happened in sanity 300a [30778.381687] Lustre: DEBUG MARKER: == sanity test 300a: basic striped dir sanity test =================================================== 13:43:36 (1585763016) [30779.451744] LustreError: 23319:0:(tgt_main.c:357:tgt_cancel_slc_locks()) ASSERTION( lock->l_client_cookie != 0 ) failed: [30779.474964] LustreError: 23319:0:(tgt_main.c:357:tgt_cancel_slc_locks()) LBUG [30779.477837] Pid: 23319, comm: jbd2/dm-0-8 3.10.0-7.7-debug #1 SMP Wed Oct 30 08:47:36 EDT 2019 [30779.482233] Call Trace: [30779.483993] [<ffffffffa03b3ddc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [30779.486127] [<ffffffffa03b3e8c>] lbug_with_loc+0x4c/0xa0 [libcfs] [30779.488094] [<ffffffffa07eb08f>] tgt_cancel_slc_locks+0x1cf/0x1e0 [ptlrpc] [30779.491227] [<ffffffffa07ecdb6>] tgt_cb_last_committed+0x116/0x390 [ptlrpc] [30779.496651] [<ffffffffa0ce44db>] osd_trans_commit_cb+0xcb/0x2c0 [osd_ldiskfs] [30779.500512] [<ffffffffa0c86fa4>] ldiskfs_journal_commit_callback+0x84/0xc0 [ldiskfs] [30779.504687] [<ffffffffa0b14e9b>] jbd2_journal_commit_transaction+0x186b/0x1ca0 [jbd2] [30779.514340] [<ffffffffa0b1a87d>] kjournald2+0xcd/0x280 [jbd2] [30779.516415] [<ffffffff810b8254>] kthread+0xe4/0xf0 [30779.518263] [<ffffffff817e0ddd>] ret_from_fork_nospec_begin+0x7/0x21 [30779.522814] [<ffffffffffffffff>] 0xffffffffffffffff [30779.528997] Kernel panic - not syncing: LBUG I got like 5 crashes (have crashdumps) on the first run then it subdued and on testing restart got another crash and then it went silent again for now. |
| Comment by Gerrit Updater [ 08/Apr/20 ] |
|
Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38177 |
| Comment by Alex Zhuravlev [ 08/Apr/20 ] |
|
the patch above is a workaround. the root cause is that MDT got mounted w/o localrecov option and then the client's request had no transno, |
| Comment by Andreas Dilger [ 19/Apr/20 ] |
|
Alex, wouldn't it be better to handle this by ignoring the "localrecov" behavior for MDT and MGT mounts, rather than changing the transaction callback? |
| Comment by Andreas Dilger [ 23/Apr/20 ] |
|
I think this issue is fixed by patch https://review.whamcloud.com/38138 " |
| Comment by Alex Zhuravlev [ 24/Apr/20 ] |
|
Andreas, no, these are different problems. this issue happens because client (running on MDS) is excluded from recovery, thus doesn't generate a transno which is used to track committness (in turn used to cancel locks). |
| Comment by Andreas Dilger [ 01/May/20 ] |
|
+1 on master running sanity test_103a in my VM. Seems like this is easy for me to reproduce if there is something that you think will fix this properly. |
| Comment by Lai Siyao [ 05/Jun/20 ] |
|
LDLM lock is handled in MDT layer, while transaction is in MDD layer, it's layer violation to mix them together. IMO if "req_transno" of an operation is not 0, it's not needed to enforce Commit-on-Sharing for such operation. |
| Comment by Gerrit Updater [ 07/Jun/20 ] |
|
Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38855 |
| Comment by Gerrit Updater [ 16/Jun/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38855/ |
| Comment by Peter Jones [ 16/Jun/20 ] |
|
Landed for 2.14 |