[LU-13927] MDS crash when increasing max_rpcs_in_flight to 256 Created: 26/Aug/20  Updated: 24/Oct/20  Resolved: 24/Oct/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.5
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: James A Simmons Assignee: Mikhail Pershin
Resolution: Duplicate Votes: 0
Labels: ORNL
Environment:

RHEL7 server nodes running 2.12.5 LTS.


Issue Links:
Related
is related to LU-11276 racer: mdc_dev.c:1346:mdc_req_attr_se... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

When setting max_rpc_in_flight to 256 the MDS crashed with the following back trace.

[3072807.665012] LustreError: 106301:0:(ldlm_lockd.c:1543:ldlm_handle_convert0()) Skipped 6 previous similar messages

[3072920.767949] LustreError: 107784:0:(ldlm_lockd.c:1543:ldlm_handle_convert0()) ### convert on canceled lock! ns: mdt-storm-MDT0000_UUID lock: ffff8fbfd69a2

400/0x8f43eb98e65eb06e lrc: 3/0,0 mode: PR/PR res: [0x20000560c:0x9f09:0x0].0x0 bits 0x58/0x0 rrc: 4 type: IBT flags: 0x54a01400010020 nid: 10.134.129.9@tcp55

remote: 0xc1b65128fa6df589 expref: 31059 pid: 154261 timeout: 3080537 lvb_type: 0

[3072920.805945] LustreError: 107784:0:(ldlm_lockd.c:1543:ldlm_handle_convert0()) Skipped 4 previous similar messages

[3072929.398817] LustreError: 106301:0:(ldlm_lock.c:1106:ldlm_grant_lock_with_skiplist()) ASSERTION( ldlm_is_granted(lock) ) failed:

[3072929.412226] LustreError: 106301:0:(ldlm_lock.c:1106:ldlm_grant_lock_with_skiplist()) LBUG

[3072929.421404] Pid: 106301, comm: ldlm_cn00_002 3.10.0-1127.13.1.el7.x86_64 #1 SMP Fri Jun 12 14:34:17 EDT 2020

[3072929.432225] Call Trace:

[3072929.435691]  [<ffffffffc282a7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]

[3072929.443252]  [<ffffffffc282a87c>] lbug_with_loc+0x4c/0xa0 [libcfs]

[3072929.450458]  [<ffffffffc164fa87>] ldlm_grant_lock_with_skiplist+0x607/0x750 [ptlrpc]

[3072929.459259]  [<ffffffffc1682d0a>] ldlm_inodebits_drop+0xaa/0x170 [ptlrpc]

[3072929.467092]  [<ffffffffc167b3fb>] ldlm_handle_convert0+0x2db/0x460 [ptlrpc]

[3072929.475080]  [<ffffffffc167bacb>] ldlm_cancel_handler+0x29b/0x590 [ptlrpc]

[3072929.482957]  [<ffffffffc16ae48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]

[3072929.491613]  [<ffffffffc16b1df4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]

[3072929.498873]  [<ffffffff930c6691>] kthread+0xd1/0xe0

[3072929.504710]  [<ffffffff93792d1d>] ret_from_fork_nospec_begin+0x7/0x21

[3072929.512100]  [<ffffffffffffffff>] 0xffffffffffffffff

[3072929.518025] Kernel panic - not syncing: LBUG

[3072929.523194] CPU: 1 PID: 106301 Comm: ldlm_cn00_002 Kdump: loaded Tainted: P           OE  ------------ T 3.10.0-1127.13.1.el7.x86_64 #1

[3072929.536964] Hardware name: Dell Inc. PowerEdge R640/0RGP26, BIOS 2.3.10 08/15/2019

[3072929.545412] Call Trace:

[3072929.548751]  [<ffffffff9377ffa5>] dump_stack+0x19/0x1b

[3072929.554758]  [<ffffffff93779541>] panic+0xe8/0x21f

[3072929.560410]  [<ffffffffc282a8cb>] lbug_with_loc+0x9b/0xa0 [libcfs]

[3072929.567463]  [<ffffffffc164fa87>] ldlm_grant_lock_with_skiplist+0x607/0x750 [ptlrpc]

[3072929.576066]  [<ffffffffc1682d0a>] ldlm_inodebits_drop+0xaa/0x170 [ptlrpc]

[3072929.583705]  [<ffffffffc167b3fb>] ldlm_handle_convert0+0x2db/0x460 [ptlrpc]

[3072929.591502]  [<ffffffffc167bacb>] ldlm_cancel_handler+0x29b/0x590 [ptlrpc]

[3072929.599199]  [<ffffffffc16ae48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]

[3072929.607671]  [<ffffffffc16ab2a5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]

[3072929.615245]  [<ffffffff930d3dc3>] ? __wake_up+0x13/0x20

[3072929.621272]  [<ffffffffc16b1df4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]

[3072929.628307]  [<ffffffff93785942>] ? __schedule+0x402/0x840



 Comments   
Comment by Peter Jones [ 27/Aug/20 ]

Mike

 Could you please advise?

Thanks

Peter

Comment by Oleg Drokin [ 27/Aug/20 ]

I think potentially this might be fixed by https://review.whamcloud.com/#/c/36466/11

Comment by James A Simmons [ 28/Aug/20 ]

I noticed patch 36466 doesn't cleanly apply to 2.12 LTS. Do you need the early patch to make this apply or just modify it?

Comment by Mikhail Pershin [ 30/Aug/20 ]

James, I am checking that

Comment by Mikhail Pershin [ 09/Sep/20 ]

Here is b2_12 patch: https://review.whamcloud.com/39854/

 

Comment by James A Simmons [ 23/Oct/20 ]

To let you know the patch worked. 

Comment by Peter Jones [ 24/Oct/20 ]

Good news - I'll close the ticket then because the fix has also been landed for 2.12.6

Generated at Sat Feb 10 03:05:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.