[LU-14548] sanityn test 31a hangs in client lock with 'Could not add any time (5/5), not sending early reply' Created: 23/Mar/21 Updated: 20/Dec/21 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0, Lustre 2.12.7, Lustre 2.15.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | interop | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
sanityn test_31a hangs on the client. This hang was first seen on 21 AUG 2020 while testing the patch https://review.whamcloud.com/39598 ( Looking at a recent hang between 2.13.0 clients and 2.14.50.203 servers at https://testing.whamcloud.com/test_sets/12bfd95b-3a68-4228-82b9-0a83f065233e, we see the following trace on client1 (vm1): [ 8288.619007] Lustre: DEBUG MARKER: == sanityn test 31a: voluntary cancel / blocking ast race============================================= 00:52:53 (1616460773) [ 8288.941328] Lustre: *** cfs_fail_loc=314, val=0*** [ 8329.079615] Lustre: ldlm_cb00_000: service thread pid 15494 was inactive for 40.125 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 8329.083243] Pid: 15494, comm: ldlm_cb00_000 3.10.0-1062.1.1.el7.x86_64 #1 SMP Fri Sep 13 22:55:44 UTC 2019 [ 8329.084917] Call Trace: [ 8329.085424] [<ffffffffc0bbd039>] ldlm_handle_cp_callback+0x109/0xb20 [ptlrpc] [ 8329.087092] [<ffffffffc0bc0d4e>] ldlm_callback_handler.part.11+0x153e/0x1dd0 [ptlrpc] [ 8329.088550] [<ffffffffc0bc1617>] ldlm_callback_handler+0x37/0xd0 [ptlrpc] [ 8329.089896] [<ffffffffc0bee856>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [ 8329.091347] [<ffffffffc0bf238c>] ptlrpc_main+0xbac/0x1540 [ptlrpc] [ 8329.092652] [<ffffffffba8c50d1>] kthread+0xd1/0xe0 [ 8329.093600] [<ffffffffbaf8cd37>] ret_from_fork_nospec_end+0x0/0x39 [ 8329.094818] [<ffffffffffffffff>] 0xffffffffffffffff [ 8883.934431] Lustre: 21838:0:(service.c:1442:ptlrpc_at_send_early_reply()) @@@ Could not add any time (5/5), not sending early reply req@ffff96ac93018000 x1694973578214336/t0(0) o105->LOV_OSC_UUID@10.9.5.214@tcp:334/0 lens 392/224 e 24 to 0 dl 1616461374 ref 2 fl Interpret:/0/0 rc 0/0 job:'' [11946.899331] SysRq : Changing Loglevel We’ve seen sanityn test 31a hang in this way 124 times since August 2020 and many of the hangs are for interop testing. Logs for recent failures are at: |