Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.10.5, Lustre 2.14.0
-
None
-
3
-
9223372036854775807
Description
When a target sends a callback RPC to a dead client, the thread handling the RPC which triggered the lock callback RPC will keep a CPU busy until the client is eventually evicted. The dead client is supposed to be evicted by lock callback timeout, but depending on the server activity, this could not work and it will be evicted by ping_evictor instead, keeping the service thread busy 50% longer. It could also trigger behavior like "Already past deadline, not sending early reply" or "Request took longer than estimated, client may timeout".
The more you trigger callback RPC like that, the more you keep service threads busy.
This issue was detected with Lustre 2.10 but was reproduced with 'master' branch. Due to all the changes related to MR, some error message patterns are slightly different but the CPU usage is still there. Here is the error logs using master:
[Wed Sep 23 12:41:18 2020] Lustre: 12549:0:(client.c:2261:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1600864873/real 1600864873] req@ffff9db643cb3180 x1678624839367552/t0(0) o106->lustre-OST0002@172.31.46.180@tcp:15/16 lens 328/280 e 0 to 1 dl 1600864880 ref 1 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'' [Wed Sep 23 12:41:18 2020] Lustre: 12549:0:(client.c:2261:ptlrpc_expire_one_request()) Skipped 15303010 previous similar messages [Wed Sep 23 12:41:39 2020] Lustre: 12549:0:(client.c:2261:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1600864894/real 1600864887] req@ffff9db643cb3180 x1678624839367552/t0(0) o106->lustre-OST0002@172.31.46.180@tcp:15/16 lens 328/280 e 0 to 1 dl 1600864901 ref 2 fl Rpc:Xr/2/ffffffff rc 0/-1 job:'' [Wed Sep 23 12:41:39 2020] Lustre: 12549:0:(client.c:2261:ptlrpc_expire_one_request()) Skipped 2 previous similar messages [Wed Sep 23 12:42:11 2020] Lustre: 12549:0:(client.c:2261:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1600864933/real 1600864933] req@ffff9db643cb3180 x1678624839367552/t0(0) o106->lustre-OST0002@172.31.46.180@tcp:15/16 lens 328/280 e 0 to 1 dl 1600864940 ref 1 fl Rpc:eXQr/2/ffffffff rc 0/-1 job:'' [Wed Sep 23 12:42:11 2020] Lustre: 12549:0:(client.c:2261:ptlrpc_expire_one_request()) Skipped 973802 previous similar messages [Wed Sep 23 12:42:23 2020] LustreError: 13285:0:(sec.c:2515:sptlrpc_svc_unwrap_bulk()) @@@ truncated bulk GET 0(4096) req@ffff9db646aa6c00 x1678628476684464/t0(0) o4->890b8b6c-6958-af76-fd7e-75d275c07381@172.31.46.180@tcp:703/0 lens 608/448 e 3 to 0 dl 1600864953 ref 1 fl Interpret:/0/0 rc 0/0 job:'dd.0' [Wed Sep 23 12:42:23 2020] Lustre: lustre-OST0002: Bulk IO write error with 890b8b6c-6958-af76-fd7e-75d275c07381 (at 172.31.46.180@tcp), client will retry: rc = -110 [Wed Sep 23 12:43:15 2020] Lustre: 12549:0:(client.c:2261:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1600864997/real 1600864997] req@ffff9db643cb3180 x1678624839367552/t0(0) o106->lustre-OST0002@172.31.46.180@tcp:15/16 lens 328/280 e 0 to 1 dl 1600865004 ref 1 fl Rpc:eXQr/2/ffffffff rc 0/-1 job:'' [Wed Sep 23 12:43:15 2020] Lustre: 12549:0:(client.c:2261:ptlrpc_expire_one_request()) Skipped 8141422 previous similar messages [Wed Sep 23 12:44:31 2020] Lustre: ll_ost00_004: service thread pid 12549 was inactive for 200.396 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [Wed Sep 23 12:44:31 2020] Pid: 12549, comm: ll_ost00_004 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 15 17:36:42 UTC 2018 [Wed Sep 23 12:44:31 2020] Call Trace: [Wed Sep 23 12:44:31 2020] [<ffffffffffffffff>] 0xffffffffffffffff [Wed Sep 23 12:44:51 2020] Lustre: lustre-OST0000: haven't heard from client 890b8b6c-6958-af76-fd7e-75d275c07381 (at 172.31.46.180@tcp) in 229 seconds. I think it's dead, and I am evicting it. exp ffff9db9bc393400, cur 1600865094 expire 1600864944 last 1600864865 [Wed Sep 23 12:44:51 2020] Lustre: Skipped 1 previous similar message [Wed Sep 23 12:44:56 2020] Lustre: lustre-OST0001: haven't heard from client 890b8b6c-6958-af76-fd7e-75d275c07381 (at 172.31.46.180@tcp) in 233 seconds. I think it's dead, and I am evicting it. exp ffff9db9bc393000, cur 1600865098 expire 1600864948 last 1600864865 [Wed Sep 23 12:44:56 2020] LustreError: 12549:0:(client.c:1266:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff9db643cb3180 x1678624839367552/t0(0) o106->lustre-OST0002@172.31.46.180@tcp:15/16 lens 328/280 e 0 to 1 dl 1600865106 ref 1 fl Rpc:EeXQU/2/ffffffff rc -5/-1 job:'' [Wed Sep 23 12:44:57 2020] Lustre: lustre-MDT0000: haven't heard from client 890b8b6c-6958-af76-fd7e-75d275c07381 (at 172.31.46.180@tcp) in 229 seconds. I think it's dead, and I am evicting it. exp ffff9db97ce11000, cur 1600865100 expire 1600864950 last 1600864871 [Wed Sep 23 12:44:57 2020] Lustre: Skipped 1 previous similar message [Wed Sep 23 12:45:01 2020] Lustre: MGS: haven't heard from client 1c674e35-afaf-9ebd-a75e-6b29d3669249 (at 172.31.46.180@tcp) in 239 seconds. I think it's dead, and I am evicting it. exp ffff9db97ce10000, cur 1600865104 expire 1600864954 last 1600864865 [Wed Sep 23 12:50:18 2020] LNetError: 10741:0:(lib-move.c:3744:lnet_handle_recovery_reply()) peer NI (172.31.46.180@tcp) recovery failed with -5 [Wed Sep 23 12:50:18 2020] LNetError: 10741:0:(lib-move.c:3744:lnet_handle_recovery_reply()) Skipped 540 previous similar messages
Note the high value of: `Skipped 8141422 previous similar messages`
Steps to reproduce
This issue could be reproduced using test framework to start a filesystem and 2 other clients.
- On client A, start writing to a file and reboot the host without any cleaning to simulate a host crash
- On client B, try to read the file after client A just crashed. Client B process will hang, waiting for lock completion.
- On the OST where the file is located, you should start seeing the request timeout error messages and the load increasing after tens of seconds (tested with ksocklnd).
clientA $ (sudo dd if=/dev/zero of=/mnt/lustre/callback bs=4k count=1M &); sleep 1; sudo reboot -nff clientB $ dd if=/mnt/lustre/callback count=1 >/dev/null (hang) oss1 $ # see load increasing after ~100 sec and a service thread at 100%
Analysis
The thread is looping with the following stack trace:
ptlrpc_set_wait ldlm_run_ast_work ldlm_glimpse_locks ofd_intent_policy ...
ptlrpc_set_wait() is calling ptlrpc_check_set() -> ptlrpc_expire_one_request().
ptlrpc_check_set() will be always trying to resend the callback, which will by every few seconds at the beginning, as the socket still exists. It will retry after each request deadline, until the socket timeout is eventually reached and this socket is closed. At that point, re-sending the RPC will fail right away (ENOTCONN). l_wait_event() in ptlrpc_set_wait() will keep calling ptlrpc_check_set() which will fail to send the RPC. The thread will be churning into that code path until the client is eventually evicted.
I'm proposing a patch which avoid re-resending a RPC if it was send less than a second ago and failed due to network error.
Attachments
Issue Links
- is related to
-
LU-5368 errors in/from ldlm_run_ast_work() ignored
- Open