Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.14.0, Lustre 2.12.7, Lustre 2.15.0
-
3
-
9223372036854775807
Description
sanityn test_31a hangs on the client. This hang was first seen on 21 AUG 2020 while testing the patch https://review.whamcloud.com/39598 (LU-13692 ldlm: Ensure we reprocess the resource on ast error) on Lustre 2.13.55.46 with logs at https://testing.whamcloud.com/test_sets/6b73874c-2358-4c75-ac94-018ba68b604b. The first time it was seen for non-review testing was on 01 NOV 2020 for an interop test session between 2.12.5 clients and 2.13.56.63 servers; https://testing.whamcloud.com/test_sets/551af6c3-ac5d-4d68-9f70-9ed93187afda.
Looking at a recent hang between 2.13.0 clients and 2.14.50.203 servers at https://testing.whamcloud.com/test_sets/12bfd95b-3a68-4228-82b9-0a83f065233e, we see the following trace on client1 (vm1):
[ 8288.619007] Lustre: DEBUG MARKER: == sanityn test 31a: voluntary cancel / blocking ast race============================================= 00:52:53 (1616460773) [ 8288.941328] Lustre: *** cfs_fail_loc=314, val=0*** [ 8329.079615] Lustre: ldlm_cb00_000: service thread pid 15494 was inactive for 40.125 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 8329.083243] Pid: 15494, comm: ldlm_cb00_000 3.10.0-1062.1.1.el7.x86_64 #1 SMP Fri Sep 13 22:55:44 UTC 2019 [ 8329.084917] Call Trace: [ 8329.085424] [<ffffffffc0bbd039>] ldlm_handle_cp_callback+0x109/0xb20 [ptlrpc] [ 8329.087092] [<ffffffffc0bc0d4e>] ldlm_callback_handler.part.11+0x153e/0x1dd0 [ptlrpc] [ 8329.088550] [<ffffffffc0bc1617>] ldlm_callback_handler+0x37/0xd0 [ptlrpc] [ 8329.089896] [<ffffffffc0bee856>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [ 8329.091347] [<ffffffffc0bf238c>] ptlrpc_main+0xbac/0x1540 [ptlrpc] [ 8329.092652] [<ffffffffba8c50d1>] kthread+0xd1/0xe0 [ 8329.093600] [<ffffffffbaf8cd37>] ret_from_fork_nospec_end+0x0/0x39 [ 8329.094818] [<ffffffffffffffff>] 0xffffffffffffffff [ 8883.934431] Lustre: 21838:0:(service.c:1442:ptlrpc_at_send_early_reply()) @@@ Could not add any time (5/5), not sending early reply req@ffff96ac93018000 x1694973578214336/t0(0) o105->LOV_OSC_UUID@10.9.5.214@tcp:334/0 lens 392/224 e 24 to 0 dl 1616461374 ref 2 fl Interpret:/0/0 rc 0/0 job:'' [11946.899331] SysRq : Changing Loglevel
We’ve seen sanityn test 31a hang in this way 124 times since August 2020 and many of the hangs are for interop testing. Logs for recent failures are at:
2.10.8 clients/2.12.6.43 servers - https://testing.whamcloud.com/test_sets/82787b59-0ba3-40f5-a0f2-8ca4cf357994
2.11.0 clients/2.12.6.43 servers - https://testing.whamcloud.com/test_sets/4a602563-ea81-4301-8b1a-05cb08b5c984
2.13.0 clients/2.14.50.190 servers - https://testing.whamcloud.com/test_sets/7d866672-6aeb-4340-ba74-38be6fb836b5