[LU-13408] tgt_cancel_slc_locks()) ASSERTION( lock->l_client_cookie != 0 ) Created: 02/Apr/20  Updated: 16/Jun/20  Resolved: 16/Jun/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Critical
Reporter: Lai Siyao Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-13402 sanity test_252: Invalid number of md... Resolved
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

An assertion is triggered, and it mean the request->rq_transno is 0.

[25892.430672] LustreError: 137-5: fs0a92-OST0006_UUID: not available for connect from 172.16.0.32@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
[25892.435324] LustreError: Skipped 74 previous similar messages
[27887.087931] LustreError: 27219:0:(tgt_main.c:354:tgt_cancel_slc_locks()) ASSERTION( lock->l_client_cookie != 0 ) failed: 
[27887.090610] LustreError: 27219:0:(tgt_main.c:354:tgt_cancel_slc_locks()) LBUG


 Comments   
Comment by Oleg Drokin [ 02/Apr/20 ]

This same failure made a noticeable appearance in current master-next, but ther's nothing I am able to attribute it to yet.

It always happened in sanity 300a

[30778.381687] Lustre: DEBUG MARKER: == sanity test 300a: basic striped dir sanity test =================================================== 13:43:36 (1585763016)
[30779.451744] LustreError: 23319:0:(tgt_main.c:357:tgt_cancel_slc_locks()) ASSERTION( lock->l_client_cookie != 0 ) failed: 
[30779.474964] LustreError: 23319:0:(tgt_main.c:357:tgt_cancel_slc_locks()) LBUG
[30779.477837] Pid: 23319, comm: jbd2/dm-0-8 3.10.0-7.7-debug #1 SMP Wed Oct 30 08:47:36 EDT 2019
[30779.482233] Call Trace:
[30779.483993]  [<ffffffffa03b3ddc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[30779.486127]  [<ffffffffa03b3e8c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[30779.488094]  [<ffffffffa07eb08f>] tgt_cancel_slc_locks+0x1cf/0x1e0 [ptlrpc]
[30779.491227]  [<ffffffffa07ecdb6>] tgt_cb_last_committed+0x116/0x390 [ptlrpc]
[30779.496651]  [<ffffffffa0ce44db>] osd_trans_commit_cb+0xcb/0x2c0 [osd_ldiskfs]
[30779.500512]  [<ffffffffa0c86fa4>] ldiskfs_journal_commit_callback+0x84/0xc0 [ldiskfs]
[30779.504687]  [<ffffffffa0b14e9b>] jbd2_journal_commit_transaction+0x186b/0x1ca0 [jbd2]
[30779.514340]  [<ffffffffa0b1a87d>] kjournald2+0xcd/0x280 [jbd2]
[30779.516415]  [<ffffffff810b8254>] kthread+0xe4/0xf0
[30779.518263]  [<ffffffff817e0ddd>] ret_from_fork_nospec_begin+0x7/0x21
[30779.522814]  [<ffffffffffffffff>] 0xffffffffffffffff
[30779.528997] Kernel panic - not syncing: LBUG

I got like 5 crashes (have crashdumps) on the first run then it subdued and on testing restart got another crash and then it went silent again for now.

Comment by Gerrit Updater [ 08/Apr/20 ]

Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38177
Subject: LU-13408 tests: pass localrecov to MGS
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c89337791dc9f026dfbbf2ac695dac7254a9800e

Comment by Alex Zhuravlev [ 08/Apr/20 ]

the patch above is a workaround. the root cause is that MDT got mounted w/o localrecov option and then the client's request had no transno,
which in turn used to track commit status of cross-mdt operations.
I guess this can be solved by a per-tx callback, instead of using transno.

Comment by Andreas Dilger [ 19/Apr/20 ]

Alex, wouldn't it be better to handle this by ignoring the "localrecov" behavior for MDT and MGT mounts, rather than changing the transaction callback?

Comment by Andreas Dilger [ 23/Apr/20 ]

I think this issue is fixed by patch https://review.whamcloud.com/38138 "LU-13402 target: never exclude MDT/OST from last_rcvd" but I'd like Alex to confirm before this issue is closed.

Comment by Alex Zhuravlev [ 24/Apr/20 ]

Andreas, no, these are different problems. this issue happens because client (running on MDS) is excluded from recovery, thus doesn't generate a transno which is used to track committness (in turn used to cancel locks).

Comment by Andreas Dilger [ 01/May/20 ]

+1 on master running sanity test_103a in my VM.
+1 on master running sanity test_103b in my VM.

Seems like this is easy for me to reproduce if there is something that you think will fix this properly.

Comment by Lai Siyao [ 05/Jun/20 ]

LDLM lock is handled in MDT layer, while transaction is in MDD layer, it's layer violation to mix them together. IMO if "req_transno" of an operation is not 0, it's not needed to enforce Commit-on-Sharing for such operation.

Comment by Gerrit Updater [ 07/Jun/20 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38855
Subject: LU-13408 mdt: don't save remote lock if req_transno is 0
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: af33439d1e383d10b226870fd533f33ffac7f078

Comment by Gerrit Updater [ 16/Jun/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38855/
Subject: LU-13408 target: update in-memory per client data
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 300858ccfcd00b52663de45e0bb472012242f342

Comment by Peter Jones [ 16/Jun/20 ]

Landed for 2.14

Generated at Sat Feb 10 03:01:01 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.