Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.15.0
-
None
-
3
-
9223372036854775807
Description
This issue was created by maloo for Cliff White <cwhite@whamcloud.com>
This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/04d311dc-2cdd-41d7-b466-153348f0b7ce
System appears to go bad somewhat prior to the actual failing test. Logs show this:
[29069.300962] Lustre: DEBUG MARKER: -----============= acceptance-small: sanity-quota ============----- Mon Apr 4 01:48:40 UTC 2022 [29071.296226] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null [29072.105819] Lustre: DEBUG MARKER: /usr/sbin/lctl mark excepting tests: 55 [29072.142356] LustreError: 1038024:0:(ldlm_lockd.c:719:ldlm_handle_ast_error()) ### client (nid 10.240.42.19@tcp) returned error from blocking AST (req@000000002456a2c9 x1729110315565056 status -107 rc -107), evict it ns: filter-lustre-OST0005_UUID lock: 00000000f3771d86/0xceaec102ff073de4 lrc: 4/0,0 mode: PW/PW res: [0x2e4b0:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) gid 0 flags: 0x60000400030020 nid: 10.240.42.19@tcp remote: 0x4d8ebe6f641e09a9 expref: 61 pid: 1038030 timeout: 29173 lvb_type: 0 [29072.142945] LustreError: 138-a: lustre-OST0003: A client on nid 10.240.42.19@tcp was evicted due to a lock blocking callback time out: rc -107 [29072.150901] LustreError: 1038024:0:(ldlm_lockd.c:719:ldlm_handle_ast_error()) Skipped 1 previous similar message [29072.155542] LustreError: 945061:0:(ldlm_lockd.c:259:expired_lock_main()) ### lock callback timer expired after 0s: evicting client at 10.240.42.19@tcp ns: filter-lustre-OST0004_UUID lock: 0000000060efdb4c/0xceaec102ff073b91 lrc: 3/0,0 mode: PW/PW res: [0x2e58c:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) gid 0 flags: 0x60000400030020 nid: 10.240.42.19@tcp remote: 0x4d8ebe6f641dd3b6 expref: 62 pid: 1038016 timeout: 0 lvb_type: 0
Following tests fail, many dropped connections:
[ 1246.301471] Lustre: 34254:0:(client.c:2282:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1649008457/real 1649008457] req@0000000026f25ab6 x1729110135806080/t0(0) o400->MGC10.240.42.19@tcp@10.240.42.19@tcp:26/25 lens 224/224 e 0 to 1 dl 1649008464 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:4.0' [ 1246.306932] LustreError: 166-1: MGC10.240.42.19@tcp: Connection to MGS (at 10.240.42.19@tcp) was lost; in progress operations using this service will fail [ 1263.645767] Lustre: Evicted from MGS (at 10.240.42.19@tcp) after server handle changed from 0x4d8ebe6f5a182d6e to 0x4d8ebe6f5a18f5a4 [ 1263.648365] Lustre: MGC10.240.42.19@tcp: Connection restored to 10.240.42.19@tcp (at 10.240.42.19@tcp)
Sanity-quota was last test failure:
[31214.520365] LustreError: 110057:0:(lcommon_cl.c:197:cl_file_inode_init()) lustre: failed to initialize cl_object [0x20000a811:0x2496:0x0]: rc = -22 [31214.522974] LustreError: 110057:0:(llite_lib.c:2837:ll_prep_inode()) new_inode -fatal: rc -22 [31216.139993] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity-quota test_3a: @@@@@@ FAIL: write success, but expect EDQUOT [31216.590783] Lustre: DEBUG MARKER: sanity-quota test_3a: @@@@@@ FAIL: write success, but expect EDQUOT