We see the failure occasionally. Behind the error message
Input/output error
sanityn test_32a: @@@@@@ FAIL: cached truncate - wrong file size
is the failure of lstat("/mnt/lustre2/f32a.sanityn"). But the file's size, stat and layout look normal, observed after the test failure.
From the debug logs on mds0 (test driver), ldlm fails to connect to oss0: (-107 is no-conn)
00000100:02020000:11.0:1642177889.168635:0:9731:0:(client.c:1371:ptlrpc_check_status()) 11-0: lustre-OST0000-osc-ffff895ad1180800: operation ldlm_enqueue to node 10.6.4.23@tcp failed: rc = -107
Slightly earlier by timestamp, the oss0 debug log has the following (op 101 is ldlm_enqueue):
00000020:00080000:10.0:1642177889.168118:0:14028:0:(tgt_handler.c:770:tgt_request_handle()) operation 101 on unconnected OST from 12345-10.6.4.19@tcp
Also from the dmesg on mds0:
[ 1346.778759] LustreError: 11-0: lustre-OST0000-osc-ffff895ad1180800: operation ldlm_enqueue to node 10.6.4.23@tcp failed: rc = -107
[ 1346.785968] Lustre: lustre-OST0000-osc-ffff895ad1180800: Connection to lustre-OST0000 (at 10.6.4.23@tcp) was lost; in progress operations using this service will wait for recovery to complete
[ 1346.787088] LustreError: 167-0: lustre-OST0000-osc-ffff895ad1180800: This client was evicted by lustre-OST0000; in progress operations using this service will fail.
[ 1346.796872] Lustre: lustre-OST0000-osc-ffff895ad1180800: Connection restored to 10.6.4.23@tcp (at 10.6.4.23@tcp)
[ 1346.951394] Lustre: DEBUG MARKER: sanityn test_32a: @@@@@@ FAIL: cached truncate - wrong file size
It seems the mds0 and oss0 have a temporary connection error. It's unlikely due to a random network issue because other tests are OK when test_32a's fails, as we have observed many times.
Uploaded the following files. The debug logs are denoted "xxxxx".
sanityn.test_32a.debug_log.mds0.32a_only sanityn.test_32a.dmesg.mds0
sanityn.test_32a.debug_log.oss0.32a_only sanityn.test_32a.test_log.mds0
Test removed in
LU-14838