Details
-
Bug
-
Resolution: Fixed
-
Medium
-
Lustre 2.17.0
-
3
-
9223372036854775807
Description
It appears recovery-small.sh test_131 is failing regularly in Janitor testing:
https://testing.whamcloud.com/gerrit-janitor/61262/results.html reports 50 failures in the past month, for example:
https://testing.whamcloud.com/gerrit-janitor/61262/testresults/recovery-small-ldiskfs-DNE-rocky8.10_x86_64-rocky8.10_x86_64-retry1/
It appears that what is causing the problem is that the "dd sleep 4" fail_loc=0x80000414 on the client is waking up before the server fail_loc=0x8000031e is set, so the client completes the "dd" successfully before it is evicted.
It is a bit hard to tell, because the timestamps in the client and server dmesg logs are relative to boot and not absolute times shown by dmesg -T.
The client logs show the dd thread waking up at 6113.67, and using the DEBUG MARKER line for reference this works out to @1773674221.67 +/- 1s:
[ 6108.477238] Lustre: DEBUG MARKER: == recovery-small test 131: IO vs evict results to IO under staled lock === 11:16:56 (1773674216) [ 6109.579798] LustreError: 94095:0:(osc_request.c:2983:osc_build_rpc()) cfs_fail_timeout id 414 sleeping for 4000ms [ 6113.672214] LustreError: 94095:0:(osc_request.c:2983:osc_build_rpc()) cfs_fail_timeout id 414 awake [ 6114.881162] LustreError: lustre-OST0000-osc-ffff98dc07f29800: operation ost_statfs to node 192.168.201.103@tcp failed: rc = -107 $ echo $((1773674216 - 6108 + 6113)) 1773674221
Checking the server logs shows the eviction is triggered at 6116.64, and using the DEBUG MARKER timestamp this works out to @1773674222.64 +/-1s.
[ 6110.836160] Lustre: DEBUG MARKER: == recovery-small test 131: IO vs evict results to IO under staled lock === 11:16:56 (1773674216) [ 6116.644026] Lustre: 85956:0:(genops.c:1793:obd_export_evict_by_uuid()) lustre-OST0000: evicting 50b6f219-0167-4a18-8ed7-c655d95343ea at adminstrative request [ 6116.677331] LustreError: 13152:0:(ldlm_lockd.c:2933:ldlm_bl_thread_exports()) cfs_fail_timeout id 31e sleeping for 4000ms [ 6120.377949] Lustre: DEBUG MARKER: recovery-small test_131: @@@@@@ FAIL: dd succeeded [ 6120.746865] LustreError: 13152:0:(ldlm_lockd.c:2933:ldlm_bl_thread_exports()) cfs_fail_timeout id 31e awake $ echo $((1773674216 - 6110 + 6116)) 1773674222