[LU-13692] MDS slow/hung threads at mdt_object_local_lock Created: 17/Jun/20 Updated: 18/May/21 Resolved: 29/Oct/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.4 |
| Fix Version/s: | Lustre 2.14.0, Lustre 2.12.6 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Mahmoud Hanafi | Assignee: | Oleg Drokin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
clients running l2.12.3 |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 2 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
MDS threads hang/slow getting stack trace dumps Jun 16 14:20:20 nbp8-mds1 kernel: [1660354.377038] Lustre: MGS: Connection restored to 09397a7f-3fe2-2dc8-d25a-74d157cb2008 (at 10.151.50.72@o2ib) Jun 16 14:20:20 nbp8-mds1 kernel: [1660354.377044] Lustre: Skipped 277 previous similar messages Jun 16 14:22:20 nbp8-mds1 kernel: [1660474.852981] Lustre: 8579:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (30/30), not sending early reply Jun 16 14:22:20 nbp8-mds1 kernel: [1660474.852981] req@ffff8979f9e16300 x1668974534442592/t1531807577764(0) o36->67852282-a3ed-5acb-a9e2-3cae43fe0406@10.151.0.201@o2ib:5/0 lens 488/3152 e 0 to 0 dl 1592342570 ref 2 fl Interpret:/0/0 rc 0/0 Jun 16 14:22:20 nbp8-mds1 kernel: [1660474.952277] Lustre: 8579:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 134 previous similar messages Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.333617] LNet: Service thread pid 14118 was inactive for 551.86s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.389992] LNet: Skipped 3 previous similar messages Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.407185] Pid: 14118, comm: mdt00_115 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020 Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.407190] Call Trace: Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.407202] [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412473] [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412496] [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412506] [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412516] [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412530] [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412541] [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412552] [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412562] [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412571] [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412623] [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412659] [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412691] [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412696] [<ffffffff858c61f1>] kthread+0xd1/0xe0 Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412699] [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39 Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412722] [<ffffffffffffffff>] 0xffffffffffffffff Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412725] LustreError: dumping log to /tmp/lustre-log.1592342774.14118 Jun 16 14:26:14 nbp8-mds1 sec[2849]: SEC_EVENT |msg lustre hung thread Jun 16 14:26:15 nbp8-mds1 kernel: [1660709.178343] LNet: Service thread pid 8575 was inactive for 552.75s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jun 16 14:26:15 nbp8-mds1 kernel: [1660709.234458] Pid: 8575, comm: mdt00_043 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020 Jun 16 14:26:15 nbp8-mds1 kernel: [1660709.234459] Call Trace: Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.234472] [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239700] [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239749] [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239763] [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239773] [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239785] [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239797] [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239809] [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239819] [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239829] [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239874] [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239909] [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239945] [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239952] [<ffffffff858c61f1>] kthread+0xd1/0xe0 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239958] [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239982] [<ffffffffffffffff>] 0xffffffffffffffff Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239988] Pid: 14083, comm: mdt00_088 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239988] Call Trace: Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240020] [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240050] [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240060] [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240071] [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240082] [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240093] [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240104] [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240116] [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240127] [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240137] [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240174] [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240207] [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240240] [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240244] [<ffffffff858c61f1>] kthread+0xd1/0xe0 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240247] [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240251] [<ffffffffffffffff>] 0xffffffffffffffff Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240254] Pid: 10527, comm: mdt00_069 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240254] Call Trace: Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240285] [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240313] [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240325] [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240335] [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240344] [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240357] [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240369] [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240380] [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240389] [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240400] [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240439] [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240478] [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240510] [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240514] [<ffffffff858c61f1>] kthread+0xd1/0xe0 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240516] [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240520] [<ffffffffffffffff>] 0xffffffffffffffff Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240523] Pid: 14084, comm: mdt00_089 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240523] Call Trace: Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240554] [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240585] [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240598] [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240610] [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240620] [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240631] [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240647] [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240657] [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240666] [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240677] [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240716] [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240749] [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240781] [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240784] [<ffffffff858c61f1>] kthread+0xd1/0xe0 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240788] [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240792] [<ffffffffffffffff>] 0xffffffffffffffff Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240796] LNet: Service thread pid 8568 was inactive for 552.81s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. |
| Comments |
| Comment by Mahmoud Hanafi [ 17/Jun/20 ] |
|
This filesystem is not using DOM. |
| Comment by Andreas Dilger [ 18/Jun/20 ] |
|
Mahmoud, could you please attach a larger chunk of the /var/log/message file before these stacks are dumped. It looks like there were client connection issues before these stacks were dumped? Do you still have the /tmp/lustre-log.* files that were saved on the MDS from the first stack dumps? It would be useful to attach at least the first ones (lower time stamp). Did this issue resolve over time, or is it still ongoing? |
| Comment by Mahmoud Hanafi [ 18/Jun/20 ] |
|
Attaching logs files.vmcore-dmesg.txt |
| Comment by Peter Jones [ 19/Jun/20 ] |
|
Amir What do you recommend here? Peter |
| Comment by Mahmoud Hanafi [ 24/Jun/20 ] |
|
Any updates? |
| Comment by Amir Shehata (Inactive) [ 25/Jun/20 ] |
|
The messages which could be relevant to the slow down are: (o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds This means that transmits have been posted but haven't completed yet. This indicates that the OFED on the passive side is unable to process these transmits in a timely manner. What do you have the lnet_transaction_timeout set to? How many CPTs are configured? Looking at top, do you see only a subset of the ko2iblnd scheduler threads working? Or do you see the work spread across all the threads across all the CPTs? I'm just trying to identify if there is a bottleneck some where. |
| Comment by Mahmoud Hanafi [ 07/Jul/20 ] |
|
Past 2 days we been hitting this issue several times. We have identified a user's code that is causing this issue. I don't see any network errors or timeouts. This may not be a networking issue. More like a ldlm deadlock... The issue start like this... Client Logs Mon Jul 6 16:38:30 2020 C r135i2n18 [1594078710.191034] Lustre: 43689:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1594078087/real 1594078087] req@ffff8c2eda903240 x1669114153850208/t0(0) o101->nbp13-MDT0000-mdc-ffff8c16da19e000@10.151.26.183@o2ib:12/10 lens 376/1392 e 0 to 1 dl 1594078710 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Mon Jul 6 16:38:30 2020 C r135i2n18 [1594078710.219034] Lustre: nbp13-MDT0000-mdc-ffff8c16da19e000: Connection to nbp13-MDT0000 (at 10.151.26.183@o2ib) was lost; in progress operations using this service will wait for recovery to complete Mon Jul 6 16:38:30 2020 C r135i4n11 [1594078710.803698] Lustre: 44292:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1594078087/real 1594078087] req@ffff8aa99b843680 x1669114204037872/t0(0) o101->nbp13-MDT0000-mdc-ffff8a92f11a6800@10.151.26.183@o2ib:12/10 lens 376/1392 e 0 to 1 dl 1594078710 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Mon Jul 6 16:38:30 2020 C r135i4n11 [1594078710.831698] Lustre: nbp13-MDT0000-mdc-ffff8a92f11a6800: Connection to nbp13-MDT0000 (at 10.151.26.183@o2ib) was lost; in progress operations using this service will wait for recovery to complete Mon Jul 6 16:38:30 2020 C r147i0n28 [1594078710.161451] Lustre: 55178:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1594078087/real 1594078087] req@ffff9ecf51918000 x1668975755824896/t0(0) o101->nbp13-MDT0000-mdc-ffff9ed0d75e9800@10.151.26.183@o2ib:12/10 lens 376/1392 e 0 to 1 dl 1594078710 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Mon Jul 6 16:38:30 2020 C r147i0n28 [1594078710.189451] Lustre: nbp13-MDT0000-mdc-ffff9ed0d75e9800: Connection to nbp13-MDT0000 (at 10.151.26.183@o2ib) was lost; in progress operations using this service will wait for recovery to complete Mon Jul 6 16:38:30 2020 C r147i0n28 [1594078710.205451] Lustre: 55178:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 15 previous similar messages Mon Jul 6 16:38:30 2020 C r147i0n30 [1594078710.924656] Lustre: 1843:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1594078087/real 1594078087] req@ffff92d658d45b00 x1668975460885376/t0(0) o101->nbp13-MDT0000-mdc-ffff92c2161d7800@10.151.26.183@o2ib:12/10 lens 376/1392 e 0 to 1 dl 1594078710 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Mon Jul 6 16:38:30 2020 C r147i0n30 [1594078710.952656] Lustre: nbp13-MDT0000-mdc-ffff92c2161d7800: Connection to nbp13-MDT0000 (at 10.151.26.183@o2ib) was lost; in progress operations using this service will wait for recovery to complete Mon Jul 6 16:38:30 2020 C r147i0n30 [1594078710.968656] Lustre: 1843:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 15 previous similar messages Mon Jul 6 16:38:30 2020 C r147i3n11 [1594078710.345202] Lustre: 99945:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1594078087/real 1594078087] req@ffff8cea213b2480 x1668975668763616/t0(0) o101->nbp13-MDT0000-mdc-ffff8cfdc2e07800@10.151.26.183@o2ib:12/10 lens 376/1392 e 0 to 1 dl 1594078710 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Mon Jul 6 16:38:30 2020 C r147i3n11 [1594078710.373202] Lustre: nbp13-MDT0000-mdc-ffff8cfdc2e07800: Connection to nbp13-MDT0000 (at 10.151.26.183@o2ib) was lost; in progress operations using this service will wait for recovery to complete Mon Jul 6 16:38:30 2020 C r147i3n11 [1594078710.389202] Lustre: 99945:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Server Logs Jul 6 15:39:18 nbp13-srv1 kernel: [14768.604982] Lustre: 34157:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (30/30), not sending early reply Jul 6 15:39:18 nbp13-srv1 kernel: [14768.604982] req@ffff9f6a31f58900 x1668977731129264/t0(0) o101->10e00de0-1f20-c94c-9a9b-c9d0c3e9a774@10.149.10.9@o2ib313:653/0 lens 376/1392 e 0 to 0 dl 1594075188 ref 2 fl Interpret:/0/0 rc 0/0 Jul 6 15:39:18 nbp13-srv1 kernel: [14768.693110] Lustre: 34157:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 193 previous similar messages Jul 6 15:39:52 nbp13-srv1 kernel: [14802.511304] Lustre: 32519:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (348:4s); client may timeout. req@ffff9f7c4bd93f00 x1668975755407840/t0(0) o101->8dc79bfc-5f46-cac5-08bb-3869eb835eca@10.149.9.136@o2ib313:653/0 lens 376/832 e 0 to 0 dl 1594075188 ref 1 fl Complete:/0/0 rc 0/0 Jul 6 15:39:52 nbp13-srv1 kernel: [14802.597869] Lustre: 32519:0:(service.c:2165:ptlrpc_server_handle_request()) Skipped 91 previous similar messages Jul 6 15:42:05 nbp13-srv1 kernel: [14935.249217] LustreError: 137-5: nbp13-OST000c_UUID: not available for connect from 10.149.3.193@o2ib313 (no target). If you are running an HA pair check that the target is mounted on the other server. Jul 6 15:44:07 nbp13-srv1 kernel: [15057.419615] LustreError: 14961:0:(tgt_grant.c:758:tgt_grant_check()) nbp13-OST0007: cli 1dd3e76b-132a-6719-83dc-8060654c4e5c claims 40960 GRANT, real grant 0 Jul 6 15:44:07 nbp13-srv1 kernel: [15057.461996] LustreError: 14961:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 110 previous similar messages Jul 6 15:45:00 nbp13-srv1 kernel: [15109.616188] Lustre: nbp13-OST000a: Connection restored to 13dd3081-fbf9-b7c0-0398-194130227ad4 (at 10.151.56.237@o2ib) Jul 6 15:45:00 nbp13-srv1 kernel: [15109.616191] Lustre: Skipped 2324 previous similar messages Jul 6 15:54:09 nbp13-srv1 kernel: [15657.879212] LustreError: 15055:0:(tgt_grant.c:758:tgt_grant_check()) nbp13-OST0005: cli 60265261-ea48-de16-a49c-beb94df9c24a claims 28672 GRANT, real grant 0 Jul 6 15:54:09 nbp13-srv1 kernel: [15657.921581] LustreError: 15055:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 92 previous similar messages Jul 6 15:55:00 nbp13-srv1 kernel: [15708.471963] Lustre: nbp13-OST000a: Connection restored to d6946db5-2ac8-b89e-14d0-4b161627dac1 (at 10.151.56.222@o2ib) Jul 6 15:55:00 nbp13-srv1 kernel: [15708.471966] Lustre: Skipped 2289 previous similar messages Jul 6 15:59:24 nbp13-srv1 kernel: [15972.045956] Lustre: 16461:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (30/30), not sending early reply Jul 6 15:59:24 nbp13-srv1 kernel: [15972.045956] req@ffff9f80ae28da00 x1668975743536160/t0(0) o101->3f04c178-1d91-8836-3e9e-0c4282050daf@10.149.9.137@o2ib313:349/0 lens 376/1392 e 0 to 0 dl 1594076394 ref 2 fl Interpret:/0/0 rc 0/0 Jul 6 15:59:24 nbp13-srv1 kernel: [15972.134340] Lustre: 16461:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 234 previous similar messages Jul 6 16:03:18 nbp13-srv1 kernel: [16204.969982] LNet: Service thread pid 34212 was inactive for 551.64s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jul 6 16:03:18 nbp13-srv1 kernel: [16205.020985] Pid: 34212, comm: mdt02_064 3.10.0-1127.10.1.el7_lustre2125.x86_64 #1 SMP Wed Jun 17 08:01:50 PDT 2020 Jul 6 16:03:18 nbp13-srv1 kernel: [16205.020986] Call Trace: Jul 6 16:03:18 nbp13-srv1 kernel: [16205.020997] [<ffffffffc1341c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc] Jul 6 16:03:18 nbp13-srv1 kernel: [16205.041743] [<ffffffffc13427b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] Jul 6 16:03:18 nbp13-srv1 kernel: [16205.041759] [<ffffffffc18715cb>] mdt_object_local_lock+0x50b/0xb20 [mdt] Jul 6 16:03:18 nbp13-srv1 kernel: [16205.041765] [<ffffffffc1871c50>] mdt_object_lock_internal+0x70/0x360 [mdt] Jul 6 16:03:18 nbp13-srv1 kernel: [16205.041774] [<ffffffffc1871f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt] Jul 6 16:03:18 nbp13-srv1 kernel: [16205.041782] [<ffffffffc187228b>] mdt_layout_change+0x20b/0x480 [mdt] Jul 6 16:03:18 nbp13-srv1 kernel: [16205.041790] [<ffffffffc187a7d0>] mdt_intent_layout+0x8a0/0xe00 [mdt] Jul 6 16:03:18 nbp13-srv1 kernel: [16205.041798] [<ffffffffc1877d35>] mdt_intent_policy+0x435/0xd80 [mdt] Jul 6 16:03:18 nbp13-srv1 kernel: [16205.041812] [<ffffffffc1328e06>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc] Jul 6 16:03:18 nbp13-srv1 kernel: [16205.041829] [<ffffffffc1351516>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc] Jul 6 16:03:18 nbp13-srv1 kernel: [16205.041860] [<ffffffffc13da5d2>] tgt_enqueue+0x62/0x210 [ptlrpc] Jul 6 16:03:18 nbp13-srv1 kernel: [16205.041884] [<ffffffffc13e13ea>] tgt_request_handle+0xada/0x1570 [ptlrpc] Jul 6 16:03:18 nbp13-srv1 kernel: [16205.041904] [<ffffffffc138548b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Jul 6 16:03:18 nbp13-srv1 kernel: [16205.041922] [<ffffffffc1388df4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] Jul 6 16:03:18 nbp13-srv1 kernel: [16205.041926] [<ffffffffbdac6691>] kthread+0xd1/0xe0 Jul 6 16:03:18 nbp13-srv1 kernel: [16205.041929] [<ffffffffbe192d1d>] ret_from_fork_nospec_begin+0x7/0x21 Jul 6 16:03:18 nbp13-srv1 kernel: [16205.041944] [<ffffffffffffffff>] 0xffffffffffffffff Jul 6 16:03:18 nbp13-srv1 kernel: [16205.041946] LustreError: dumping log to /tmp/lustre-log.1594076598.34212
I looked at the debug dump. The debug buffer was to small, it only has about 1min of logs. RPC history show RPC in 'Interpret' state. 6846513251564060672:10.151.26.183@o2ib:12345-10.149.9.229@o2ib313:x1670506690195456:376:Interpret:1594078087.153533:-1594078088.846466s(-1594080927.0s) opc 101 6846513251571204096:10.151.26.183@o2ib:12345-10.149.10.50@o2ib313:x1669038774702880:376:Interpret:1594078087.155269:-1594078088.844730s(-1594081550.0s) opc 101 6846513251575857152:10.151.26.183@o2ib:12345-10.149.10.19@o2ib313:x1668983977176208:376:Interpret:1594078087.156408:-1594078088.843591s(-1594080927.0s) opc 101 6846513251675078656:10.151.26.183@o2ib:12345-10.149.10.52@o2ib313:x1669038696609136:376:Interpret:1594078087.180633:-1594078088.819366s(-1594081551.0s) opc 101 6846513251688841216:10.151.26.183@o2ib:12345-10.149.9.232@o2ib313:x1668975691189088:376:Interpret:1594078087.183999:-1594078088.816000s(-1594081550.0s) opc 101 6846515834582925312:10.151.26.183@o2ib:12345-10.149.9.225@o2ib313:x1668975407998848:376:Interpret:1594078688.579198:-1594078689.420801s(-1594084920.0s) opc 101 6846515834807320576:10.151.26.183@o2ib:12345-10.149.10.15@o2ib313:x1668984073580016:376:Interpret:1594078688.633968:-1594078689.366031s(-1594080991.0s) opc 101 6846515834814791680:10.151.26.183@o2ib:12345-10.149.10.43@o2ib313:x1671120375878864:376:Interpret:1594078688.635799:-1594078689.364200s(-1594088664.0s) opc 101 6846515834852278272:10.151.26.183@o2ib:12345-10.149.10.9@o2ib313:x1668977731546976:376:Interpret:1594078688.644950:-1594078689.355049s(-1594081802.0s) opc 101 6846515834871939072:10.151.26.183@o2ib:12345-10.149.10.51@o2ib313:x1669038774709456:376:Interpret:1594078688.649757:-1594078689.350242s(-1594085129.0s) opc 101 6846515834965852160:10.151.26.183@o2ib:12345-10.149.10.45@o2ib313:x1669038855886144:376:Interpret:1594078688.672678:-1594078689.327321s(-1594080998.0s) opc 101 6846515835105443840:10.151.26.183@o2ib:12345-10.149.9.228@o2ib313:x1671120242121136:376:Interpret:1594078688.706757:-1594078689.293242s(-1594080997.0s) opc 101 6846515835166261248:10.151.26.183@o2ib:12345-10.149.9.135@o2ib313:x1668975409880352:376:Interpret:1594078688.721603:-1594078689.278396s(-1594080998.0s) opc 101 6846515835180023808:10.151.26.183@o2ib:12345-10.149.10.10@o2ib313:x1668977813158432:376:Interpret:1594078688.724960:-1594078689.275039s(-1594084003.0s) opc 101 ..... Evicting the first client on the list breaks the deadlock and the MDT returns to healthy state. |
| Comment by Oleg Drokin [ 08/Jul/20 ] |
|
unfortunately it's not enough to just look at the server side syslogs to see what's going on and why clients don't release the locks. At first I Thought this might be similar to To better understand this we'll need a solid reproducer so we can get detailed logs on our systems or you will need to collect detailed Lustre debug logs from both the MDS server and all clients (esp. the one that when evicted breaks the deadlock). You said you isolated the job, any idea of what it does? Ideally the debug log would be at -1 with no gaps on both client and servers but that might not be feasible depending on activity levels so it might be needed to dropped a few notches down ( we definitely need dlmtrace and rpctrace at the very least). Also I see your clients are at 2.12.3, there was a number of fixes in .4 and .5 so if possible upgrading the clients to 2.12.5 first might be a good idea. |
| Comment by Mahmoud Hanafi [ 08/Jul/20 ] |
|
The code is FORTRAN mpi code. It is writing HDF5 and binary restart files. I am trying to reproduce the issue on our test filesystem and gather debug info.
|
| Comment by Mahmoud Hanafi [ 09/Jul/20 ] |
|
I was able reproduce using IOR. This was using lustre-2.12.5 server and lustre-2.12.3 clients. This is using HPE-mpt 2.17r13 and intel 2016.2.181 $ mpiexec /u/mhanafi/bin/IOR.MPIIO -a MPIIO -w -t 4k -b 10m -i 100 IOR-2.10.3: MPI Coordinated Test of Parallel I/ORun began: Thu Jul 9 11:15:25 2020 Command line used: /u/mhanafi/bin/IOR.MPIIO -a MPIIO -w -t 4k -b 10m -i 100 Machine: Linux r901i3n10Summary: api = MPIIO (version=3, subversion=1) test filename = testFile access = single-shared-file ordering in a file = sequential offsets ordering inter file= no tasks offsets clients = 1800 (30 per node) repetitions = 100 xfersize = 4096 bytes blocksize = 10 MiB aggregate filesize = 17.58 GiB Lustre stripe size = Use default stripe count = Use default I got debug logs from server and client I will have to upload them to the ftp site. It will takes sometime to get setup for testing using 2.12.5 clients. Here is the service side logs Jul 9 11:20:44 nbptest3-srv1 kernel: [ 7545.162610] Lustre: 16185:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (30/30), not sending early reply Jul 9 11:20:44 nbptest3-srv1 kernel: [ 7545.162610] req@ffff9cfc80aa2400 x1668974508088176/t0(0) o101->0e53cdce-c2ce-279c-1bda-abae6ca8fc35@10.141.6.196@o2ib417:474/0 lens 376/0 e 1 to 0 dl 1594318874 ref 2 fl New:/0/ffffffff rc 0/-1 Jul 9 11:20:44 nbptest3-srv1 kernel: [ 7545.259022] Lustre: 16185:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 424 previous similar messages Jul 9 11:21:20 nbptest3-srv1 kernel: [ 7581.233493] Lustre: 16192:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (30/30), not sending early reply Jul 9 11:21:20 nbptest3-srv1 kernel: [ 7581.233493] req@ffff9d0e01b73600 x1667404859964912/t0(0) o101->93fc8911-9a6c-9428-f682-a88016025469@10.151.27.22@o2ib:510/0 lens 576/0 e 1 to 0 dl 1594318910 ref 2 fl New:/0/ffffffff rc 0/-1 Jul 9 11:22:17 nbptest3-srv1 kernel: [ 7638.354601] LustreError: 6611:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 412s: evicting client at 10.141.6.236@o2ib417 ns: mdt-nbptest3-MDT0000_UUID lock: ffff9cfc82e99200/0xf0e4f75150665c0f lrc: 3/0,0 mode: CR/CR res: [0x200004a51:0x2:0x0].0x0 bits 0x8/0x0 rrc: 490 type: IBT flags: 0x60200400000020 nid: 10.141.6.236@o2ib417 remote: 0x9bea31f0e36eb3a0 expref: 19 pid: 16200 timeout: 7638 lvb_type: 0 Jul 9 11:22:17 nbptest3-srv1 kernel: [ 7638.482253] Lustre: 16103:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (349:63s); client may timeout. req@ffff9d0e02b05580 x1668974021539200/t0(0) o101->b6231da5-843d-7974-af9e-038000d0baef@10.141.6.192@o2ib417:474/0 lens 376/1000 e 1 to 0 dl 1594318874 ref 1 fl Complete:/0/0 rc 0/0 Jul 9 11:22:18 nbptest3-srv1 kernel: [ 7638.577573] LustreError: 16103:0:(service.c:2128:ptlrpc_server_handle_request()) @@@ Dropping timed-out request from 12345-10.141.6.213@o2ib417: deadline 349:63s ago Jul 9 11:22:18 nbptest3-srv1 kernel: [ 7638.577573] req@ffff9d0e02b36c00 x1668974504934624/t0(0) o101->9b7650f7-c3aa-0860-7a72-aa3fe5ce22a6@10.141.6.213@o2ib417:474/0 lens 376/0 e 1 to 0 dl 1594318874 ref 1 fl Interpret:/0/ffffffff rc 0/-1 Jul 9 11:24:37 nbptest3-srv1 kernel: [ 7777.590810] LNet: Service thread pid 16057 was inactive for 551.09s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jul 9 11:24:37 nbptest3-srv1 kernel: [ 7777.646622] Pid: 16057, comm: mdt00_053 3.10.0-1127.10.1.el7_lustre2125.x86_64 #1 SMP Wed Jun 17 08:01:50 PDT 2020 Jul 9 11:24:37 nbptest3-srv1 kernel: [ 7777.646623] Call Trace: Jul 9 11:24:37 nbptest3-srv1 kernel: [ 7777.646636] [<ffffffffc0ce1c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc] Jul 9 11:24:37 nbptest3-srv1 kernel: [ 7777.669376] [<ffffffffc0ce27b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] Jul 9 11:24:37 nbptest3-srv1 kernel: [ 7777.669396] [<ffffffffc11ef5cb>] mdt_object_local_lock+0x50b/0xb20 [mdt] Jul 9 11:24:37 nbptest3-srv1 kernel: [ 7777.669406] [<ffffffffc11efc50>] mdt_object_lock_internal+0x70/0x360 [mdt] Jul 9 11:24:37 nbptest3-srv1 kernel: [ 7777.669416] [<ffffffffc11eff9c>] mdt_reint_object_lock+0x2c/0x60 [mdt] Jul 9 11:24:37 nbptest3-srv1 kernel: [ 7777.669425] [<ffffffffc11f028b>] mdt_layout_change+0x20b/0x480 [mdt] Jul 9 11:24:37 nbptest3-srv1 kernel: [ 7777.669435] [<ffffffffc11f87d0>] mdt_intent_layout+0x8a0/0xe00 [mdt] Jul 9 11:24:37 nbptest3-srv1 kernel: [ 7777.669444] [<ffffffffc11f5d35>] mdt_intent_policy+0x435/0xd80 [mdt] Jul 9 11:24:37 nbptest3-srv1 kernel: [ 7777.669469] [<ffffffffc0cc8e06>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc] Jul 9 11:24:37 nbptest3-srv1 kernel: [ 7777.669497] [<ffffffffc0cf1516>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc] Jul 9 11:24:37 nbptest3-srv1 kernel: [ 7777.669539] [<ffffffffc0d7a5d2>] tgt_enqueue+0x62/0x210 [ptlrpc] Jul 9 11:24:37 nbptest3-srv1 kernel: [ 7777.669576] [<ffffffffc0d813ea>] tgt_request_handle+0xada/0x1570 [ptlrpc] Jul 9 11:24:37 nbptest3-srv1 kernel: [ 7777.669607] [<ffffffffc0d2548b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Jul 9 11:24:37 nbptest3-srv1 kernel: [ 7777.669637] [<ffffffffc0d28df4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] Jul 9 11:24:37 nbptest3-srv1 kernel: [ 7777.669644] [<ffffffff880c6691>] kthread+0xd1/0xe0 Jul 9 11:24:37 nbptest3-srv1 kernel: [ 7777.669649] [<ffffffff88792d37>] ret_from_fork_nospec_end+0x0/0x39 Jul 9 11:24:37 nbptest3-srv1 kernel: [ 7777.669669] [<ffffffffffffffff>] 0xffffffffffffffff |
| Comment by Mahmoud Hanafi [ 09/Jul/20 ] |
|
Our PFL settings lfs setstripe -E 10M -c 1 -E 16G -c 4 -E 64G -c 8 -E -1 -c 16 /nobackuptest |
| Comment by Oleg Drokin [ 10/Jul/20 ] |
|
so looking through the logs the picture is really strange. we can see the lock that's being evicted with remove handle of 0x9bea31f0e36eb3a0 being requiested by the client, server grants it immediately contends it and then returns it, client receives the blocking AST and that's it, no more mentionings of 0x9bea31f0e36eb3a0. Does not even indicate that the reply was received, and it's unclear if it's because some log messages got dropped or because the message was lost. Can you please reproduce again with debug_daemon in place to ensure nothing is dropped or if dropped - we receive a very visible message. Also please include client and server dmesg/syslog so we can see the evicted messages and only stop the lustre debug logging once there's eviction on the server (or even better if both the server and the client). Also as a follow-up from the call - hopefully you can provide two sets of logs, once with pfl and once without (from that other system) |
| Comment by Mahmoud Hanafi [ 13/Jul/20 ] |
|
Uploaded new logs to ftp site. This was PFL. I'll try to get debug logs for non-PFL ftp:/uploads/
|
| Comment by Mahmoud Hanafi [ 16/Jul/20 ] |
|
I was able to reproduce the issue using 2.12.5 clients with 2.12.5 servers. I couldn't get a reproducer for non-PFL.
|
| Comment by Oleg Drokin [ 23/Jul/20 ] |
|
So after reviewing the logs it looks like we are dealing with a missign RPC. We see the response is being sent by the server after finishing processing: 00010000:00010000:6.0:1594668516.184157:0:45180:0:(ldlm_lockd.c:1401:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply(err=0, rc=0) ns: mdt-nbptest3-MDT0000_UUID lock: ffff9c746ca00000/0xba2cc1762941d1ca lrc: 4/0,0 mode: CR/CR res: [0x2000061c1:0x2:0x0].0x0 bits 0x8/0x0 rrc: 88 type: IBT flags: 0x60200400000020 nid: 10.141.6.235@o2ib417 remote: 0x8dae759881f1de9c expref: 90 pid: 45180 timeout: 260195 lvb_type: 0 But it's never seen on the client and the wait is interrupted because of the evicted: 00000100:00100000:17.0:1594668927.514597:0:4814:0:(client.c:3145:ptlrpc_abort_inflight()) @@@ inflight req@ffff93bbc978adc0 x1672126583040528/t0(0) o101->nbptest3-MDT0000-mdc-ffff93c15a74d800@10.151.27.53@o2ib:12/10 lens 376/1192 e 0 to 0 dl 1594669139 ref 2 fl Rpc:/2/ffffffff rc 0/-1 Not sure what is this was supposed to signify, but it's about 40 seconds from the moment the reply was lost: [259827.263246] Lustre: DEBUG MARKER: hung rpcs so I assume you started to see tuff stopping from working as they waited for the lock to clean? Now the big question is how coem that RPC reply was lost? how good is your network? |
| Comment by Mahmoud Hanafi [ 23/Jul/20 ] |
|
I was watching mdt rpc history at 'hung rpcs' mark I start rpc not getting clear off the history These nodes are across routers. Does the server get an ack for the rpc?
I try to reproduce using nettrace. |
| Comment by Oleg Drokin [ 23/Jul/20 ] |
|
well, it's a "difficult reply" for enqueue RPCs so I believe it should get ack, but there is no resending of replies so if ack did not come we'll just keep the slot occupied for longer waiting for the client to resend their request so we can reconstruct from the slot. I was hoping it would have been a -1 log and you would only drop to reduced logs i -1 log could not be obtained due to size |
| Comment by Mahmoud Hanafi [ 24/Jul/20 ] |
|
I tried to reproduce with "-1" but I couldn't reproduce the issue. I tried with just +nettrace and still couldn't reproduce it. As soon as I turned off nettrace it hung. I will try again +nettrace. We don't see any errors on the IB network or routers. I think the rpc never leaves the server. |
| Comment by Mahmoud Hanafi [ 28/Jul/20 ] |
|
I was able to get debug with +net. But it is 27GB of logs. Would you like all of it? |
| Comment by Peter Jones [ 28/Jul/20 ] |
|
mhanafi Oleg is out of the office this week so will be slower to respond. My suggestion is to upload everything to the ftp site so it is available if that is the most useful option. |
| Comment by Mahmoud Hanafi [ 29/Jul/20 ] |
|
I uploaded logs to ftp site. /uploads/ Please ignore /uploads/
|
| Comment by Oleg Drokin [ 05/Aug/20 ] |
|
wow, 500G of logs, no wonders decompression took forever. The dmesg does not include the eviction so tracing which lock got "hung" is going to be hard I suspect. Did you wait until the eviction by any chance and can provide me with the lock handle? |
| Comment by Mahmoud Hanafi [ 05/Aug/20 ] |
|
There wasn't any evictions. Writes that caused the threads to hang started at 20:01:22. That should help to narrow down the timing. Also its looks like the RPCs that never complete are sent at least twice and the server marks them as duplicate. |
| Comment by Oleg Drokin [ 05/Aug/20 ] |
|
Well, the client side resend logic at work here with the duplicate RPCs, but the issue at least with the previous set of logs is the RPC reply sent by the server was never arriving to the client. Now it was relatively easy to find that because eventually the lock was evicted for nonresponsiveness and so we could trace the whole lifecycle on both ends by the lock handle. I'll try to see if I can pick any of hte "hung" RPCs and see what is it waiting for |
| Comment by Oleg Drokin [ 06/Aug/20 ] |
|
I sifted through logs for all the timed out threads and they all are stuck on "[0x20000813c:0xf36:0x0]" resource, there are no granted locks on it So the problem you hit this time appears to be different than what this all started with. from the log we can see: 00010000:00010000:14.0:1595991684.497678:0:18783:0:(ldlm_request.c:204:ldlm_completion_tail()) ### client-side enqueue: granted ns: mdt-nbptest3-MDT0000_UUID lock: ffff8f483e40ee40/0x28e6d526063f9c9 lrc: 3/0,1 mode: EX/EX res: [0x20000813c:0xf36:0x0].0x0 bits 0x8/0x0 rrc: 531 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 18783 timeout: 0 lvb_type: 0 00010000:00010000:14.0:1595991684.497683:0:18783:0:(ldlm_request.c:516:ldlm_cli_enqueue_local()) ### client-side local enqueue handler, new lock created ns: mdt-nbptest3-MDT0000_UUID lock: ffff8f483e40ee40/0x28e6d526063f9c9 lrc: 3/0,1 mode: EX/EX res: [0x20000813c:0xf36:0x0].0x0 bits 0x8/0x0 rrc: 531 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 18783 timeout: 0 lvb_type: 0 00010000:00010000:14.0:1595991684.497705:0:18783:0:(ldlm_lock.c:1073:ldlm_granted_list_add_lock()) ### About to add lock: ns: mdt-nbptest3-MDT0000_UUID lock: ffff8f483e40ee40/0x28e6d526063f9c9 lrc: 3/0,0 mode: CR/CR res: [0x20000813c:0xf36:0x0].0x0 bits 0x8/0x0 rrc: 534 type: IBT flags: 0x50200400000000 nid: 10.141.6.181@o2ib417 remote: 0xbcf4e69b81010fa6 expref: 939 pid: 18783 timeout: 0 lvb_type: 0 00010000:00010000:14.0:1595991684.497763:0:18783:0:(ldlm_inodebits.c:95:ldlm_reprocess_inodebits_queue()) --- Reprocess resource [0x20000813c:0xf36:0x0].0x0 (ffff8f41e16095c0) 00010000:00010000:14.0:1595991684.497764:0:18783:0:(ldlm_inodebits.c:112:ldlm_reprocess_inodebits_queue()) ### Reprocessing lock from queue 3 ns: mdt-nbptest3-MDT0000_UUID lock: ffff8f4877c15c40/0x28e6d526063f9e5 lrc: 3/0,1 mode: --/EX res: [0x20000813c:0xf36:0x0].0x0 bits 0x8/0x0 rrc: 533 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 18943 timeout: 0 lvb_type: 0 00010000:00010000:14.0:1595991684.497773:0:18783:0:(ldlm_lock.c:210:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: mdt-nbptest3-MDT0000_UUID lock: ffff8f36d003e0c0/0x28e6d526063f9c2 lrc: 0/0,0 mode: --/EX res: [0x20000813c:0xf36:0x0].0x0 bits 0x8/0x0 rrc: 534 type: IBT flags: 0x44000000000000 nid: 10.141.6.181@o2ib417 remote: 0xbcf4e69b81010fa6 expref: 939 pid: 18783 timeout: 0 lvb_type: 3 00010000:00010000:14.0:1595991684.497781:0:18783:0:(ldlm_lockd.c:465:ldlm_add_waiting_lock()) ### adding to wait list(timeout: 412, AT: on) ns: mdt-nbptest3-MDT0000_UUID lock: ffff8f483e40ee40/0x28e6d526063f9c9 lrc: 5/0,0 mode: CR/CR res: [0x20000813c:0xf36:0x0].0x0 bits 0x8/0x0 rrc: 532 type: IBT flags: 0x70200400000020 nid: 10.141.6.181@o2ib417 remote: 0xbcf4e69b81010fa6 expref: 938 pid: 18783 timeout: 11704 lvb_type: 0 00010000:00010000:14.0:1595991684.497787:0:18783:0:(ldlm_lockd.c:1401:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply(err=0, rc=0) ns: mdt-nbptest3-MDT0000_UUID lock: ffff8f483e40ee40/0x28e6d526063f9c9 lrc: 5/0,0 mode: CR/CR res: [0x20000813c:0xf36:0x0].0x0 bits 0x8/0x0 rrc: 533 type: IBT flags: 0x70200400000020 nid: 10.141.6.181@o2ib417 remote: 0xbcf4e69b81010fa6 expref: 938 pid: 18783 timeout: 11704 lvb_type: 0 So the lock 0x28e6d526063f9c9 was just granted and converter from EX to CR to be sent back to the client causing 0x28e6d526063f9e5 to be reprocessed (in turn sending blocking AST to the just granted lock). In parallel there's another conflicting request for this lock that arrived and instantly queued blocking ast for this newly granted lock: 00010000:00010000:11.0:1595991684.497758:0:18888:0:(ldlm_resource.c:1633:ldlm_resource_add_lock()) ### About to add this lock ns: mdt-nbptest3-MDT0000_UUID lock: ffff8f3540318480/0x28e6d5260640c1b lrc: 5/0,1 mode: --/EX res: [0x20000813c:0xf36:0x0].0x0 bits 0x8/0x0 rrc: 533 type: IBT flags: 0x50210000000000 nid: local remote: 0x0 expref: -99 pid: 18888 timeout: 0 lvb_type: 0 00010000:00010000:11.0:1595991684.497787:0:18888:0:(ldlm_lockd.c:903:ldlm_server_blocking_ast()) ### server preparing blocking AST ns: mdt-nbptest3-MDT0000_UUID lock: ffff8f483e40ee40/0x28e6d526063f9c9 lrc: 5/0,0 mode: CR/CR res: [0x20000813c:0xf36:0x0].0x0 bits 0x8/0x0 rrc: 533 type: IBT flags: 0x70200400000020 nid: 10.141.6.181@o2ib417 remote: 0xbcf4e69b81010fa6 expref: 938 pid: 18783 timeout: 11704 lvb_type: 0 00010000:00010000:11.0:1595991684.497793:0:18888:0:(ldlm_lockd.c:465:ldlm_add_waiting_lock()) ### not re-adding to wait list(timeout: 412, AT: on) ns: mdt-nbptest3-MDT0000_UUID lock: ffff8f483e40ee40/0x28e6d526063f9c9 lrc: 5/0,0 mode: CR/CR res: [0x20000813c:0xf36:0x0].0x0 bits 0x8/0x0 rrc: 532 type: IBT flags: 0x70200400000020 nid: 10.141.6.181@o2ib417 remote: 0xbcf4e69b81010fa6 expref: 938 pid: 18783 timeout: 11704 lvb_type: 0 Now as the second process is sending the BL ast the client has already returned the lock (as it came with the BL AST bit set on grant): 00010000:00010000:11.0:1595991684.498535:0:18888:0:(ldlm_lockd.c:695:ldlm_handle_ast_error()) ### client (nid 10.141.6.181@o2ib417) returned -22 from blocking AST (req@ffff8f40cc10da00 x1673509146453120) - normal race ns: mdt-nbptest3-MDT0000_UUID lock: ffff8f483e40ee40/0x28e6d526063f9c9 lrc: 4/0,0 mode: CR/CR res: [0x20000813c:0xf36:0x0].0x0 bits 0x8/0x0 rrc: 532 type: IBT flags: 0x60200400000020 nid: 10.141.6.181@o2ib417 remote: 0xbcf4e69b81010fa6 expref: 938 pid: 18783 timeout: 11704 lvb_type: 0 (we can see in client log how the lock was being canceled right away so when the bl ast came - it replied with the error) and then we see how both the error in ast causes the lock to be destroyed while the incoming cancel that the client has sent decided that "we got a cancel for the lock that does not exist": 00010000:00010000:11.0:1595991684.498549:0:18888:0:(ldlm_lockd.c:531:ldlm_del_waiting_lock()) ### removed ns: mdt-nbptest3-MDT0000_UUID lock: ffff8f483e40ee40/0x28e6d526063f9c9 lrc: 4/0,0 mode: CR/CR res: [0x20000813c:0xf36:0x0].0x0 bits 0x8/0x0 rrc: 533 type: IBT flags: 0x50200400000020 nid: 10.141.6.181@o2ib417 remote: 0xbcf4e69b81010fa6 expref: 939 pid: 18783 timeout: 11704 lvb_type: 0 00010000:00010000:0.0:1595991684.498549:0:17908:0:(ldlm_lockd.c:2436:ldlm_cancel_hpreq_check()) ### hpreq cancel/convert lock ns: mdt-nbptest3-MDT0000_UUID lock: ffff8f483e40ee40/0x28e6d526063f9c9 lrc: 4/0,0 mode: CR/CR res: [0x20000813c:0xf36:0x0].0x0 bits 0x8/0x0 rrc: 533 type: IBT flags: 0x50200400000020 nid: 10.141.6.181@o2ib417 remote: 0xbcf4e69b81010fa6 expref: 939 pid: 18783 timeout: 11704 lvb_type: 0 00010000:00010000:11.0:1595991684.498557:0:18888:0:(ldlm_lock.c:210:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: mdt-nbptest3-MDT0000_UUID lock: ffff8f483e40ee40/0x28e6d526063f9c9 lrc: 0/0,0 mode: --/CR res: [0x20000813c:0xf36:0x0].0x0 bits 0x8/0x0 rrc: 532 type: IBT flags: 0x44a01400000020 nid: 10.141.6.181@o2ib417 remote: 0xbcf4e69b81010fa6 expref: 939 pid: 18783 timeout: 11704 lvb_type: 0 00000100:00000200:0.0:1595991684.498557:0:17908:0:(nrs.c:896:ptlrpc_nrs_hpreq_add_nolock()) @@@ high priority req req@ffff8f3573015580 x1673025973253184/t0(0) o103->6357b1f7-e6ee-7152-9b79-4bef4e870e84@10.141.6.181@o2ib417:552/0 lens 328/0 e 0 to 0 dl 1595992032 ref 1 fl New:H/0/ffffffff rc 0/-1 00000100:00100000:11.0:1595991684.498563:0:18888:0:(client.c:2060:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc mdt01_098:nbptest3-MDT0000_UUID:18888:1673509146453120:10.141.6.181@o2ib417:104 00000100:00100000:0.0:1595991684.498566:0:17908:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.141.6.181@o2ib417, seq: 2884733 00000100:00100000:0.0:1595991684.498569:0:17908:0:(service.c:2140:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cn00_004:6357b1f7-e6ee-7152-9b79-4bef4e870e84+938:3865:x1673025973253184:12345-10.141.6.181@o2ib417:103 00000100:00000200:0.0:1595991684.498571:0:17908:0:(service.c:2145:ptlrpc_server_handle_request()) got req 1673025973253184 00010000:00010000:0.0:1595991684.498573:0:17908:0:(ldlm_lockd.c:1626:ldlm_request_cancel()) ### server-side cancel handler START: 1 locks, starting at 0 00010000:00010000:0.0:1595991684.498576:0:17908:0:(ldlm_lockd.c:1633:ldlm_request_cancel()) ### server-side cancel handler stale lock (cookie 184204835340679625) 00010000:00010000:0.0:1595991684.498577:0:17908:0:(ldlm_lockd.c:1677:ldlm_request_cancel()) ### server-side cancel handler END and this is it, no more activity on this resource even though what one would expect is when the lock is canceled it would wake up whatever was waiting on it which should have been 0x28e6d526063f9e5 So it looks like the reprocess was somehow missed, normally it would recover because some other lock request for this resource would come, BUT coincidentally there were no more lock requests for this resource too so we got stuck. I quickly inspected the code and I don't immediately see any path where handle_ast_error/ldlm_run_ast_work would ignore the ERESTART that should call reprocess at all times. |
| Comment by Oleg Drokin [ 06/Aug/20 ] |
|
Just to confirm, this is a vanilla 2.12.4 on the server just like it seems in the logs, there are no patches applied? |
| Comment by Mahmoud Hanafi [ 06/Aug/20 ] |
|
Here is our branch. |
| Comment by Jay Lan (Inactive) [ 06/Aug/20 ] |
|
I do not know 2.12.4-1nas or 2.12.4-2nas was used when the problem was reported. These patches were cherry-picked for 2.12.4-1nas (ie, on top of 2.12.4 release) These patches were cherry-picked between 2.12.4-1nas and 2.12.4-2nas: |
| Comment by Gerrit Updater [ 07/Aug/20 ] |
|
Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39598 |
| Comment by Oleg Drokin [ 07/Aug/20 ] |
|
I found a gap in lock processing logic that would lead to the problem exposed by the latest set of logs. This patch should also apply to 2.12.x branch. it is a server-only patch. Now this is a different problem than what was reported originally with the lost RPC so I hope you can now adopt that patch and run the reproducer again with the elevated log levels to hopefully catch the missing RPC trace now. Thank you. |
| Comment by Oleg Drokin [ 07/Aug/20 ] |
|
Ok I guess I spoke too soon and the patch might need a bit more work so don't apply it just yet. |
| Comment by Oleg Drokin [ 07/Aug/20 ] |
|
The current iteration of patch 39598 (set 3) is what I'd like you to try please. |
| Comment by Mahmoud Hanafi [ 11/Aug/20 ] |
|
Progress... The server doesn't get into a complete deadlock, like before when it would require a reboot. But I was able to reproduce a similar issue. The rpcs hung until timeout and the client gets evicted. At the eviction the application crashes with an I/O error. [1046354.720641] LustreError: 40059:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 412s: evicting client at 10.141.6.218@o2ib417 ns: mdt-nbptest3-MDT0000_UUID lock: ffff8e4f95824140/0x4015736665bf730e lrc: 3/0,0 mode: EX/EX res: [0x20000b7b2:0x86:0x0].0x0 bits 0x8/0x0 rrc: 134 type: IBT flags: 0x60200000000020 nid: 10.141.6.218@o2ib417 remote: 0xe8b7391db938d471 expref: 130 pid: 44433 timeout: 1046339 lvb_type: 0 [1046354.851994] LustreError: 44288:0:(ldlm_lockd.c:1348:ldlm_handle_enqueue0()) ### lock on destroyed export ffff8e4f6bef9800 ns: mdt-nbptest3-MDT0000_UUID lock: ffff8e57506a69c0/0x4015736665bf755a lrc: 1/0,0 mode: --/CR res: [0x20000b7b2:0x86:0x0].0x0 bits 0x8/0x0 rrc: 131 type: IBT flags: 0x54a01400000000 nid: 10.141.6.218@o2ib417 remote: 0xe8b7391db938d478 expref: 15 pid: 44288 timeout: 0 lvb_type: 0 [1046354.969040] LustreError: 44288:0:(ldlm_lockd.c:1348:ldlm_handle_enqueue0()) Skipped 1 previous similar message Uploaded debug logs to: ftp.whamcloud.com:/uploads/ |
| Comment by Mahmoud Hanafi [ 19/Aug/20 ] |
|
Any updates? |
| Comment by Oleg Drokin [ 20/Aug/20 ] |
|
from the affected client log: 00010000:00010000:63.0:1597179887.674436:0:15383:0:(ldlm_lockd.c:1844:ldlm_handle_cp_callback()) ### Double grant race happened ns: nbptest3-MDT0000-mdc-ffff8c1b58e0c800 lock: ffff8c1aa526ad80/0xe8b7391db938d471 lrc: 2/0,0 mode: EX/EX res: [0x20000b7b2:0x86:0x0].0x0 bits 0x8/0x0 rrc: 10 type: IBT flags: 0x0 nid: local remote: 0x4015736665bf730e expref: -99 pid: 17795 timeout: 0 lvb_type: 3 This is certainly a message I have not seen before that seems to be the source of the bug (seems to be another bug again compared to the initial report!)
if (ldlm_is_destroyed(lock) ||
ldlm_is_granted(lock)) {
/* b=11300: the lock has already been granted */
unlock_res_and_lock(lock);
LDLM_DEBUG(lock, "Double grant race happened");
GOTO(out, rc = 0);
}
This completely skips over later check for AST_SENT/CBPENDING flags leading the client to miss server's desire for the client to drop the lock. Here's server side story: 00010000:00010000:11.0:1597179887.630195:0:44433:0:(ldlm_lockd.c:1401:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply(err=0, rc=0) ns: mdt-nbptest3-MDT0000_UUID lock: ffff8e4f95824140/0x4015736665bf730e lrc: 3/0,0 mode: --/EX res: [0x20000b7b2:0x86:0x0].0x0 bits 0x8/0x0 rrc: 111 type: IBT flags: 0x40200000000000 nid: 10.141.6.218@o2ib417 remote: 0xe8b7391db938d471 expref: 162 pid: 44433 timeout: 0 lvb_type: 0 this is the initial rpc reply no AST_SENT flag. actually what's weird is it's not even granted yet and would not be for some time. 00010000:00010000:13.0:1597179887.630275:0:44299:0:(ldlm_lockd.c:891:ldlm_server _blocking_ast()) ### lock not granted, not sending blocking AST ns: mdt-nbptest3 -MDT0000_UUID lock: ffff8e4f95824140/0x4015736665bf730e lrc: 3/0,0 mode: --/EX r es: [0x20000b7b2:0x86:0x0].0x0 bits 0x8/0x0 rrc: 114 type: IBT flags: 0x60200000 000020 nid: 10.141.6.218@o2ib417 remote: 0xe8b7391db938d471 expref: 161 pid: 444 33 timeout: 0 lvb_type: 0 here it was blocked shortly after and eventually once granted, the cp ast was sent: 00010000:00010000:7.0:1597179887.674021:0:45141:0:(ldlm_lock.c:1073:ldlm_granted_list_add_lock()) ### About to add lock: ns: mdt-nbptest3-MDT0000_UUID lock: ffff8e4f95824140/0x4015736665bf730e lrc: 3/0,0 mode: EX/EX res: [0x20000b7b2:0x86:0 x0].0x0 bits 0x8/0x0 rrc: 135 type: IBT flags: 0x60280000000020 nid: 10.141.6.218@o2ib417 remote: 0xe8b7391db938d471 expref: 165 pid: 44433 timeout: 0 lvb_type: 0 00010000:00010000:7.0:1597179887.674036:0:45141:0:(ldlm_lockd.c:1014:ldlm_server _completion_ast()) ### server preparing completion AST ns: mdt-nbptest3-MDT0000_ UUID lock: ffff8e4f95824140/0x4015736665bf730e lrc: 3/0,0 mode: EX/EX res: [0x20 000b7b2:0x86:0x0].0x0 bits 0x8/0x0 rrc: 136 type: IBT flags: 0x60200000000020 ni d: 10.141.6.218@o2ib417 remote: 0xe8b7391db938d471 expref: 165 pid: 44433 timeou t: 0 lvb_type: 0 00010000:00010000:7.0:1597179887.674042:0:45141:0:(ldlm_lockd.c:465:ldlm_add_waiting_lock()) ### adding to wait list(timeout: 412, AT: on) ns: mdt-nbptest3-MDT0000_UUID lock: ffff8e4f95824140/0x4015736665bf730e lrc: 4/0,0 mode: EX/EX res: [0x20000b7b2:0x86:0x0].0x0 bits 0x8/0x0 rrc: 136 type: IBT flags: 0x70200000000020 nid: 10.141.6.218@o2ib417 remote: 0xe8b7391db938d471 expref: 165 pid: 44433 timeout: 1046339 lvb_type: 0 Hm, so this actually exposes the "we tell the client it got a granted lock" when it did not somehow. I can confirm from the log the path triggered specifically that was patched in my patch and that somehow apparently converted into intent lock request into a regular EX by not yet obvious means. So can you please try patchset 2 for the same patch a try (patchset3 is what you tried and got this result)? the difference is patchset2 only adds a reprocess back and patchset 3 also removes some visibly unneeded logic in addition. |
| Comment by Mahmoud Hanafi [ 21/Aug/20 ] |
|
Uploaded debug for patchset2 - ftp:/uploads/ I was able reproduce the issue. I stopped debugging after the first client eviction. May be it was too soon... Please take a look and let me know if you want me to re-run it. |
| Comment by Oleg Drokin [ 21/Aug/20 ] |
|
so this last log 100% matches the first log - server receives request xid 1675670653565120, processes it and sends a reply to the client. The client never receives the reply, so when the server sends bl ast, the client, not aware of the lock being granted, does not release the lock which leads to eviction. This log does not include nettrace/neterrors so I cannot see what actually happened to the reply - i.e. was it lost on the wire or got stuck in some outgoing queue. |
| Comment by Andreas Dilger [ 22/Aug/20 ] |
|
Oleg, if the client gets a blocking callback for a lock that it doesn't think is granted, should it "grant" the lock at that time with the blocking flag, and reply to the server AST, so that the lock is cancelled immediately after use but doesn't cause the client to stall? Alternately, the client could drop the lock but then it doesn't make any forward progress. |
| Comment by Mahmoud Hanafi [ 22/Aug/20 ] |
|
It is highly unlikely that we would have a drop on the net. I will re-run with nettrace. |
| Comment by Oleg Drokin [ 22/Aug/20 ] |
|
Andreas, while we can grant the lock on receipt of the bl callback, I am not sure that alone would unstuck the enqueue thread that's waiting for the RPC to return first. If we force-resume it then I imagine all sorts of strange behavior will crop up in normal races. Also just granting the lock will not make it go away because the stuck thread is holding a reference on it. |
| Comment by Mahmoud Hanafi [ 24/Aug/20 ] |
|
Uploaded new debug logs with nettrace Server side threads remain stuck until server reboot. |
| Comment by Mahmoud Hanafi [ 27/Aug/20 ] |
|
Any updates? |
| Comment by Oleg Drokin [ 28/Aug/20 ] |
|
I don't see anything different in this log vs the previous one 00010000:00010000:4.0:1598299915.176480:0:14307:0:(ldlm_lockd.c:1401:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply(err=0, rc=0) ns: mdt-nbptest3-MDT0000_UUID lock: ffff9c95908f1200/0xcb77a9c20dcbc88e lrc: 4/0,0 mode: CR/CR res: [0x20000d6f1:0x1:0x0].0x0 bits 0x8/0x0 rrc: 488 type: IBT flags: 0x60200400000020 nid: 10.141.6.214@o2ib417 remote: 0xdd261a7834f00346 expref: 98 pid: 14307 timeout: 6658 lvb_type: 0 00010000:00010000:4.0:1598299915.176495:0:14307:0:(ldlm_lockd.c:1479:ldlm_handle_enqueue0()) ### server-side enqueue handler END (lock ffff9c95908f1200, rc 0) 00000100:00100000:4.0:1598299915.176508:0:14307:0:(service.c:2190:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_005:221c8e76-3a4a-d7bb-7d3c-6ea3526db183+98:4765:x1675936175078016:12345-10.141.6.214@o2ib417:101 Request processed in 8860us (8880us total) trans 107374186758 rc 0/0 00000100:00100000:4.0:1598299915.176512:0:14307:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.141.6.214@o2ib417, seq: 2640 00000100:00100000:4.0:1598299915.176518:0:14307:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.141.3.114@o2ib417, seq: 2761 00000100:00100000:4.0:1598299915.176521:0:14307:0:(service.c:2140:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_005:44cfaec0-6c8c-6d1b-0b71-5fc557b7218e+97:4797:x1675936177175232:12345-10.141.3.114@o2ib417:101 request handled, reply is supposedly sent (no messages from lnet indicating otherwise at least) yet hte client did not receive it. |
| Comment by Andreas Dilger [ 28/Aug/20 ] |
|
Oleg, in addition to the BL AST "granting" the lock with LDLM_FL_CBPENDING, could it also wake up the thread waiting for the lock reply to allow it to finish processing? There would need to be some way to trigger this so that the client thinks the RPC was received successfully.
Probably a better idea is if the RPC wait in that thread could have a periodic timeout check if the lock is already granted, and exit the wait itself? Then the BL AST can mark the lock granted with CBPENDING, and when the waiting thread next wakes up it will see the lock is granted and clean itself up and continue processing the syscall. |
| Comment by Mahmoud Hanafi [ 28/Aug/20 ] |
|
In the last set of logs I had "nettrace" and "neterror". We didn't see any network errors. I can run with +net, if you would like. |
| Comment by Oleg Drokin [ 28/Aug/20 ] |
|
I discussed with Amir and he thinks your situation resembles a case where there's not enough peer credits so messages are dropped. I checked the logs and I do see this: 00000800:00000200:5.0:1595883687.321977:0:15687:0:(o2iblnd_cb.c:900:kiblnd_post_tx_locked()) 10.151.26.238@o2ib: no credits but the date is strange, its from July. No newer messages, but it's only output at trace level "net" that you've not been including later (I guess you were not removing older server.dk files?) He says you need to check peer_credits_hiw and peer_credits values and peer_credits_hiw should be set to 1/2 of the peer_credits to ensure proper reclaim of credits for certain types of messages (perhaps growing peer credits might not be a bad idea depending on what your settings are now) "lnetctl net show -v 4", "lnetctl peer show -v" should show some related stats and min credits. |
| Comment by Oleg Drokin [ 28/Aug/20 ] |
|
Andreas, the thread is sleeping in ptlrpc_set_wait in ptlrpc, so waking it up from ldlm is going to be very nontrivial. We can find the thread easily as it's recorded in the lock. But then we need to determine the ptlrpc set that's privately allocated and referenced on stack of that thread only. |
| Comment by Mahmoud Hanafi [ 01/Sep/20 ] |
|
I think we have 2 issues here. First is the ldlm deadlock and second the credit issue. |
| Comment by Oleg Drokin [ 02/Sep/20 ] |
|
patchset 4 if you can please. I agree we are having at least two issues at hand. |
| Comment by Mahmoud Hanafi [ 03/Sep/20 ] |
|
Uploaded logs running with patch#4. debug="ioctl neterror net warning dlmtrace error emerg ha rpctrace config console lfsck" |
| Comment by Oleg Drokin [ 07/Sep/20 ] |
|
I checked for the first eviction and that seems to have been removed from the client log by the time it was dumped so it's not possible to see what was going on on the client at the time. I see there's still a bunch of stuck threads after that that I need to see if I can find why - probably another AST lost but it's not a give that client log (whatever it was) survived either I guess. |
| Comment by Mahmoud Hanafi [ 08/Sep/20 ] |
|
With Patch#4 we see hung threads at start of the first iteration of MPI run. Without any patches it would take a number of iterations before we would see hung threads. I re-ran things today with peer_credits=128 to make sure we don't have any credit issues. As with the last set of logs I am setting the debug_daemon size very big not to miss anything - I am not sure why we missed the first eviction. This set of logs shouldn't be missing any thing. |
| Comment by Oleg Drokin [ 10/Sep/20 ] |
|
Thank you, this is a good trace. Now we see the lost RPC and we see it was beign sent off from the server. The thing is - it's not sent to the client, but to an intermediate router (not in the hostlist too): 00000400:00000200:1.0:1599585761.690684:0:15434:0:(lib-move.c:1752:lnet_handle_s end()) TRACE: 10.151.27.53@o2ib(10.151.27.53@o2ib:10.151.27.53@o2ib) -> 10.141.6 .236@o2ib417(10.141.6.236@o2ib417:10.151.26.246@o2ib) : PUT try# 0 00000800:00000200:1.0:1599585761.690687:0:15434:0:(o2iblnd_cb.c:1663:kiblnd_send ()) sending 1000 bytes in 1 frags to 12345-10.151.26.246@o2ib So this router (10.151.26.246) is the node that ate the message then I imagine? Any useful messages there? |
| Comment by Mahmoud Hanafi [ 10/Sep/20 ] |
|
There is nothing in the logs on that router. I guess I need to run debug trace on the routers, clients, and server. Router debug logs may be really big... I don't understand why these two clients send rpc with same xid. 00000100:00100000:0.0:1599585761.680712:0:15434:0:(service.c:2140:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_071:64f13ca0-a798-8be7-76ba-6d1f388635bd+88:6761:x1676790192962432:12345-10.141.6.236@o2ib417:101 00000100:00000200:0.0:1599585761.680714:0:15434:0:(service.c:2145:ptlrpc_server_handle_request()) got req 1676790192962432 00000100:00100000:8.0:1599585761.683587:0:15309:0:(service.c:1989:ptlrpc_server_handle_req_in()) got req x1676790192962432 00000100:00100000:8.0:1599585761.683601:0:15309:0:(service.c:2140:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt01_039:673a32db-c752-20f9-dd67-f80a34f5fd8d+90:6813:x1676790192962432:12345-10.141.6.235@o2ib417:101 |
| Comment by Amir Shehata (Inactive) [ 11/Sep/20 ] |
|
I was talking with Oleg, there are a few thing we can try to collect/do to see if there are issues on the LNet/networking side. First of all, let's grab the LNet statistics from the node (server/routers/clients with the problem), to see if there are drops: lnetctl net show -v 4 lnetctl stats show Oleg was saying that he sees the message depart the server but never arrives on the client. No router logs. If we get the above stats from the router, we can see if there are drops happening on the routers. We can also monitor the stats on some interval and see if the errors/drops increase at the same time frame as the problem. We can do the same type of monitoring on the servers and clients. I think if we're able to correlate LNet errors with the time frame of the problem, we'll have some lead to go on. We can at least look at the type of errors which are happening during that time and see how we handle them. The other thing I'm thinking is to try and reduce the complexity of LNet. Can we try disabling health, if we haven't already: lnetctl set health_sensitivity 0 This avoids any resends at the LNet layer. |
| Comment by Mahmoud Hanafi [ 11/Sep/20 ] |
|
We do run with 'health_sensitivity=0'. I saved "net show" and "stats show" These were not cleared before. I'll try to re-run and clear the stats before the test. ftp uploads/ I was able re-test today with router logs. see ftp:/uploads/ What I saw the server send the rpc --- SERVER --- 00000100:00100000:7.0:1599839141.481820:0:8199:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.141.6.236@o2ib417, seq: 2361 00000100:00100000:7.0:1599839141.481824:0:8199:0:(service.c:2140:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_012:9308d1b4-ec54-b743-7579-5daa9b33d4cd+88:4780:x1677516826795584:12345-10.141.6.236@o2ib417:101 00010000:00010000:7.0:1599839141.507240:0:8199:0:(ldlm_lock.c:1073:ldlm_granted_list_add_lock()) ### About to add lock: ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30a1c0/0x8e13058be0aade00 lrc: 3/0,0 mode: CR/CR res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 489 type: IBT flags: 0x50200400000000 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 97 pid: 8199 timeout: 0 lvb_type: 0 00010000:00010000:7.0:1599839141.507253:0:8199:0:(ldlm_lock.c:681:ldlm_add_bl_work_item()) ### lock incompatible; sending blocking AST. ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30a1c0/0x8e13058be0aade00 lrc: 3/0,0 mode: CR/CR res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 489 type: IBT flags: 0x40200400000000 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 97 pid: 8199 timeout: 0 lvb_type: 0 00010000:00010000:7.0:1599839141.507263:0:8199:0:(ldlm_lockd.c:903:ldlm_server_blocking_ast()) ### server preparing blocking AST ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30a1c0/0x8e13058be0aade00 lrc: 4/0,0 mode: CR/CR res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 489 type: IBT flags: 0x50200400000020 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 97 pid: 8199 timeout: 0 lvb_type: 0 00010000:00010000:7.0:1599839141.507269:0:8199:0:(ldlm_lockd.c:465:ldlm_add_waiting_lock()) ### adding to wait list(timeout: 412, AT: on) ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30a1c0/0x8e13058be0aade00 lrc: 5/0,0 mode: CR/CR res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 489 type: IBT flags: 0x70200400000020 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 97 pid: 8199 timeout: 1350 lvb_type: 0 00000100:00100000:7.0:1599839141.507276:0:8199:0:(client.c:1630:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc mdt00_012:nbptest3-MDT0000_UUID:8199:1677552166027200:10.141.6.236@o2ib417:104 00000400:00000200:7.0:1599839141.507285:0:8199:0:(lib-move.c:4684:LNetPut()) LNetPut -> 12345-10.141.6.236@o2ib417 00000400:00000200:7.0:1599839141.507288:0:8199:0:(lib-move.c:2478:lnet_handle_send_case_locked()) Source ANY to NMR: 10.141.6.236@o2ib417 routed destination 00000400:00000200:7.0:1599839141.507296:0:8199:0:(lib-move.c:1752:lnet_handle_send()) TRACE: 10.151.27.53@o2ib(10.151.27.53@o2ib:<?>) -> 10.141.6.236@o2ib417(10.141.6.236@o2ib417:10.151.26.234@o2ib) : PUT try# 0 00000100:00000200:7.0:1599839141.507889:0:8199:0:(events.c:93:reply_in_callback()) @@@ type 6, status 0 req@ffff99b5661a0900 x1677552166027200/t0(0) o104->nbptest3-MDT0000@10.141.6.236@o2ib417:15/16 lens 296/224 e 0 to 0 dl 1599839764 ref 1 fl Rpc:R/0/ffffffff rc 0/-1 00000100:00000200:7.0:1599839141.507895:0:8199:0:(events.c:114:reply_in_callback()) @@@ unlink req@ffff99b5661a0900 x1677552166027200/t0(0) o104->nbptest3-MDT0000@10.141.6.236@o2ib417:15/16 lens 296/224 e 0 to 0 dl 1599839764 ref 1 fl Rpc:R/0/ffffffff rc 0/-1 00000100:00100000:7.0:1599839141.507904:0:8199:0:(client.c:2060:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc mdt00_012:nbptest3-MDT0000_UUID:8199:1677552166027200:10.141.6.236@o2ib417:104 00010000:00010000:7.0:1599839141.507912:0:8199:0:(ldlm_lock.c:210:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30f740/0x8e13058be0aaddf2 lrc: 0/0,0 mode: --/EX res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 489 type: IBT flags: 0x44000000000000 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 97 pid: 8199 timeout: 0 lvb_type: 3 00010000:00010000:7.0:1599839141.507918:0:8199:0:(ldlm_lockd.c:465:ldlm_add_waiting_lock()) ### not re-adding to wait list(timeout: 412, AT: on) ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30a1c0/0x8e13058be0aade00 lrc: 4/0,0 mode: CR/CR res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 488 type: IBT flags: 0x70200400000020 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 96 pid: 8199 timeout: 1350 lvb_type: 0 00010000:00010000:7.0:1599839141.507924:0:8199:0:(ldlm_lockd.c:1401:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply(err=0, rc=0) ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30a1c0/0x8e13058be0aade00 lrc: 4/0,0 mode: CR/CR res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 488 type: IBT flags: 0x60200400000020 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 96 pid: 8199 timeout: 1350 lvb_type: 0 00010000:00000200:7.0:1599839141.507935:0:8199:0:(ldlm_lib.c:2967:target_send_reply_msg()) @@@ sending reply req@ffff99b5702d2d00 x1677516826795584/t201863467188(0) o101->9308d1b4-ec54-b743-7579-5daa9b33d4cd@10.141.6.236@o2ib417:131/0 lens 376/1032 e 0 to 0 dl 1599839846 ref 1 fl Interpret:/0/0 rc 0/0 00000400:00000200:7.0:1599839141.507944:0:8199:0:(lib-move.c:4684:LNetPut()) LNetPut -> 12345-10.141.6.236@o2ib417 00000400:00000200:7.0:1599839141.507946:0:8199:0:(lib-move.c:2478:lnet_handle_send_case_locked()) Source Specified: 10.151.27.53@o2ib to NMR: 10.141.6.236@o2ib417 routed destination 00000400:00000200:7.0:1599839141.507955:0:8199:0:(lib-move.c:1752:lnet_handle_send()) TRACE: 10.151.27.53@o2ib(10.151.27.53@o2ib:10.151.27.53@o2ib) -> 10.141.6.236@o2ib417(10.141.6.236@o2ib417:10.151.26.236@o2ib) : PUT try# 0 00000100:00100000:7.0:1599839141.507968:0:8199:0:(service.c:2190:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_012:9308d1b4-ec54-b743-7579-5daa9b33d4cd+96:4780:x1677516826795584:12345-10.141.6.236@o2ib417:101 Request processed in 26145us (26195us total) trans 201863467188 rc 0/0 00000100:00100000:7.0:1599839141.507973:0:8199:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.141.6.236@o2ib417, seq: 2361 *The router get the requests and (tries) to send it out. I wasn't sure if it actually makes it out of the router. *
00000800:00000200:71.0:1599839141.508041:0:25250:0:(o2iblnd_cb.c:343:kiblnd_handle_rx()) Received d1[0] from 10.151.27.53@o2ib
00000400:00000200:71.0:1599839141.508045:0:25250:0:(lib-move.c:4190:lnet_parse()) TRACE: 10.141.6.236@o2ib417(10.151.26.236@o2ib) <- 10.151.27.53@o2ib : PUT - routed
00000400:00000200:71.0:1599839141.508050:0:25250:0:(lib-msg.c:918:lnet_is_health_check()) health check = 0, status = 0, hstatus = 0
00000400:00000200:71.0:1599839141.508053:0:25250:0:(lib-move.c:2478:lnet_handle_send_case_locked()) Source ANY to NMR: 10.141.6.236@o2ib417 local destination
00000400:00000200:71.0:1599839141.508061:0:25250:0:(lib-move.c:1752:lnet_handle_send()) TRACE: 10.151.27.53@o2ib(10.141.26.236@o2ib417:<?>) -> 10.141.6.236@o2ib417(10.141.6.236@o2ib417:10.141.6.236@o2ib417) : PUT try# 0
00000800:00000200:71.0:1599839141.508065:0:25250:0:(o2iblnd_cb.c:1663:kiblnd_send()) sending 1000 bytes in 1 frags to 12345-10.141.6.236@o2ib417
00000800:00000200:71.0:1599839141.508071:0:25250:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff9c5a323c9d80] -> 10.141.6.236@o2ib417 (2) version: 12
00000800:00000200:71.0:1599839141.508072:0:25250:0:(o2iblnd_cb.c:1543:kiblnd_launch_tx()) conn[ffff9c4700f2ea00] (68)++
00000800:00000200:71.0:1599839141.508074:0:25250:0:(o2iblnd_cb.c:1316:kiblnd_queue_tx_locked()) conn[ffff9c4700f2ea00] (69)++
00000800:00000200:71.0:1599839141.508077:0:25250:0:(o2iblnd_cb.c:1549:kiblnd_launch_tx()) conn[ffff9c4700f2ea00] (70)--
00000800:00000200:71.0:1599839141.508078:0:25250:0:(o2iblnd_cb.c:205:kiblnd_post_rx()) conn[ffff9c5b223f4200] (69)++
00000800:00000200:71.0:1599839141.508080:0:25250:0:(o2iblnd_cb.c:239:kiblnd_post_rx()) conn[ffff9c5b223f4200] (70)--
00000800:00000200:71.0:1599839141.508081:0:25250:0:(o2iblnd_cb.c:3875:kiblnd_scheduler()) conn[ffff9c5b223f4200] (69)--
00000800:00000200:71.0:1599839141.508457:0:25250:0:(o2iblnd_cb.c:3859:kiblnd_scheduler()) conn[ffff9c5ad91c0200] (70)++
00000800:00000200:71.0:1599839141.508461:0:25250:0:(o2iblnd_cb.c:3875:kiblnd_scheduler()) conn[ffff9c5ad91c0200] (71)--
00000800:00000200:71.0:1599839141.508462:0:25250:0:(o2iblnd_cb.c:3859:kiblnd_scheduler()) conn[ffff9c5ad91c0200] (70)++
There is no completion for conn[ffff9c5ad91c0200]. |
| Comment by Amir Shehata (Inactive) [ 11/Sep/20 ] |
|
I think if we have a script to monitor the statistics on the router and prove that we have increment of drop counters or other errors around the time of the issue that would be helpful. Looking at the stats, it appears that some of the routers have very high number of drops, >12K. Do you think that would indicate a problem? The stats can't be cleared, so these numbers are from the time the routers came up. But a script can be written to figure out the delta. and see if they get incremented in bursts. I'm also seeing a lot of: remote_dropped_count in the stats show. There are only 2 places where that could get incremented.
For 2 that should be easy to verify. On the router logs, do you see that message? I think that could explain why some of the messages are not making it to the clients. They could be dropped by the router. |
| Comment by Mahmoud Hanafi [ 15/Sep/20 ] |
|
Amir, I re-ran the test with saving the router stats before and after. None of the drop counter were incremented. Please look at the router debug logs from the last set uploaded to see if you can find anything unusual. |
| Comment by Mahmoud Hanafi [ 23/Sep/20 ] |
|
I been trying to re-test but unable to reproduce the issue. I will keep trying. |
| Comment by Mahmoud Hanafi [ 06/Oct/20 ] |
|
Patch#4 appears stable - I haven't seen any more hung threads. |
| Comment by Gerrit Updater [ 27/Oct/20 ] |
|
Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40412 |
| Comment by Gerrit Updater [ 29/Oct/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39598/ |
| Comment by Peter Jones [ 29/Oct/20 ] |
|
Landed for 2.14 |
| Comment by Gerrit Updater [ 03/Nov/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40412/ |