[LU-13927] MDS crash when increasing max_rpcs_in_flight to 256 Created: 26/Aug/20 Updated: 24/Oct/20 Resolved: 24/Oct/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | James A Simmons | Assignee: | Mikhail Pershin |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | ORNL | ||
| Environment: |
RHEL7 server nodes running 2.12.5 LTS. |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
When setting max_rpc_in_flight to 256 the MDS crashed with the following back trace. [3072807.665012] LustreError: 106301:0:(ldlm_lockd.c:1543:ldlm_handle_convert0()) Skipped 6 previous similar messages [3072920.767949] LustreError: 107784:0:(ldlm_lockd.c:1543:ldlm_handle_convert0()) ### convert on canceled lock! ns: mdt-storm-MDT0000_UUID lock: ffff8fbfd69a2 400/0x8f43eb98e65eb06e lrc: 3/0,0 mode: PR/PR res: [0x20000560c:0x9f09:0x0].0x0 bits 0x58/0x0 rrc: 4 type: IBT flags: 0x54a01400010020 nid: 10.134.129.9@tcp55 remote: 0xc1b65128fa6df589 expref: 31059 pid: 154261 timeout: 3080537 lvb_type: 0 [3072920.805945] LustreError: 107784:0:(ldlm_lockd.c:1543:ldlm_handle_convert0()) Skipped 4 previous similar messages [3072929.398817] LustreError: 106301:0:(ldlm_lock.c:1106:ldlm_grant_lock_with_skiplist()) ASSERTION( ldlm_is_granted(lock) ) failed: [3072929.412226] LustreError: 106301:0:(ldlm_lock.c:1106:ldlm_grant_lock_with_skiplist()) LBUG [3072929.421404] Pid: 106301, comm: ldlm_cn00_002 3.10.0-1127.13.1.el7.x86_64 #1 SMP Fri Jun 12 14:34:17 EDT 2020 [3072929.432225] Call Trace: [3072929.435691] [<ffffffffc282a7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [3072929.443252] [<ffffffffc282a87c>] lbug_with_loc+0x4c/0xa0 [libcfs] [3072929.450458] [<ffffffffc164fa87>] ldlm_grant_lock_with_skiplist+0x607/0x750 [ptlrpc] [3072929.459259] [<ffffffffc1682d0a>] ldlm_inodebits_drop+0xaa/0x170 [ptlrpc] [3072929.467092] [<ffffffffc167b3fb>] ldlm_handle_convert0+0x2db/0x460 [ptlrpc] [3072929.475080] [<ffffffffc167bacb>] ldlm_cancel_handler+0x29b/0x590 [ptlrpc] [3072929.482957] [<ffffffffc16ae48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [3072929.491613] [<ffffffffc16b1df4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [3072929.498873] [<ffffffff930c6691>] kthread+0xd1/0xe0 [3072929.504710] [<ffffffff93792d1d>] ret_from_fork_nospec_begin+0x7/0x21 [3072929.512100] [<ffffffffffffffff>] 0xffffffffffffffff [3072929.518025] Kernel panic - not syncing: LBUG [3072929.523194] CPU: 1 PID: 106301 Comm: ldlm_cn00_002 Kdump: loaded Tainted: P OE ------------ T 3.10.0-1127.13.1.el7.x86_64 #1 [3072929.536964] Hardware name: Dell Inc. PowerEdge R640/0RGP26, BIOS 2.3.10 08/15/2019 [3072929.545412] Call Trace: [3072929.548751] [<ffffffff9377ffa5>] dump_stack+0x19/0x1b [3072929.554758] [<ffffffff93779541>] panic+0xe8/0x21f [3072929.560410] [<ffffffffc282a8cb>] lbug_with_loc+0x9b/0xa0 [libcfs] [3072929.567463] [<ffffffffc164fa87>] ldlm_grant_lock_with_skiplist+0x607/0x750 [ptlrpc] [3072929.576066] [<ffffffffc1682d0a>] ldlm_inodebits_drop+0xaa/0x170 [ptlrpc] [3072929.583705] [<ffffffffc167b3fb>] ldlm_handle_convert0+0x2db/0x460 [ptlrpc] [3072929.591502] [<ffffffffc167bacb>] ldlm_cancel_handler+0x29b/0x590 [ptlrpc] [3072929.599199] [<ffffffffc16ae48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [3072929.607671] [<ffffffffc16ab2a5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [3072929.615245] [<ffffffff930d3dc3>] ? __wake_up+0x13/0x20 [3072929.621272] [<ffffffffc16b1df4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [3072929.628307] [<ffffffff93785942>] ? __schedule+0x402/0x840 |
| Comments |
| Comment by Peter Jones [ 27/Aug/20 ] |
|
Mike Could you please advise? Thanks Peter |
| Comment by Oleg Drokin [ 27/Aug/20 ] |
|
I think potentially this might be fixed by https://review.whamcloud.com/#/c/36466/11 |
| Comment by James A Simmons [ 28/Aug/20 ] |
|
I noticed patch 36466 doesn't cleanly apply to 2.12 LTS. Do you need the early patch to make this apply or just modify it? |
| Comment by Mikhail Pershin [ 30/Aug/20 ] |
|
James, I am checking that |
| Comment by Mikhail Pershin [ 09/Sep/20 ] |
|
Here is b2_12 patch: https://review.whamcloud.com/39854/
|
| Comment by James A Simmons [ 23/Oct/20 ] |
|
To let you know the patch worked. |
| Comment by Peter Jones [ 24/Oct/20 ] |
|
Good news - I'll close the ticket then because the fix has also been landed for 2.12.6 |