[LU-16271] replay-single test_89 FAIL: 20480 blocks leaked Created: 27/Oct/22 Updated: 03/Jan/23 Resolved: 03/Jan/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Alexander Boyko |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
This issue was created by maloo for Lai Siyao <lai.siyao@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/d1100f6d-6ef7-4e77-9e31-f6de3a6b208e == replay-single test 89: no disk space leak on late ost connection ========================================================== 19:47:35 ... Waiting for orphan cleanup... CMD: onyx-60vm4 /usr/sbin/lctl list_param osp.*osc*.old_sync_processed 2> /dev/null osp.lustre-OST0000-osc-MDT0000.old_sync_processed osp.lustre-OST0001-osc-MDT0000.old_sync_processed wait 40 secs maximumly for onyx-60vm4 mds-ost sync done. CMD: onyx-60vm4 /usr/sbin/lctl get_param -n osp.*osc*.old_sync_processed sleep 5 for ZFS zfs Waiting for MDT destroys to complete CMD: onyx-60vm4 /usr/sbin/lctl get_param -n osp.*.destroys_in_flight replay-single test_89: @@@@@@ FAIL: 20480 blocks leaked Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:6524:error() = /usr/lib64/lustre/tests/replay-single.sh:3329:test_89() = /usr/lib64/lustre/tests/test-framework.sh:6860:run_one() = /usr/lib64/lustre/tests/test-framework.sh:6910:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:6747:run_test() = /usr/lib64/lustre/tests/replay-single.sh:3331:main() |
| Comments |
| Comment by Andreas Dilger [ 22/Nov/22 ] |
|
I thought the first failure like this was 2022-09-02 (2.15.51.86) and then not another failure until 2022-10-15 (2.15.52.63), but I now found LU-14201 which looks like a duplicate and has been around much longer. |
| Comment by Andreas Dilger [ 22/Nov/22 ] |
|
Duplicate, but seems to be hitting more regularly. Only 5 failures in 2022-07 and 2022-08, but 16 in the past 4 weeks. |
| Comment by Andreas Dilger [ 22/Nov/22 ] |
|
It looks like this recent wave of failures was caused by the landing of patch https://review.whamcloud.com/47928 "LU-16002 ptlrpc: reduce pinger eviction time" because this patch caused all 5 failures in this subtest before it landed, then landed on 2022-10-14 and coincides with the increase in test failures. |
| Comment by Andreas Dilger [ 22/Nov/22 ] |
|
Alexander, it looks like the landing of your patch caused this test to start failing, please see the following test report: |
| Comment by Alexander Boyko [ 28/Nov/22 ] |
|
I've checked logs and it look a bit strange, MDT0 eviction right after recovery 00010000:00080000:0.0:1669022027.575599:0:182954:0:(ldlm_lib.c:2820:target_recovery_thread()) @@@ processing final ping from 10.240.26.199@tcp: req@000000002adb22e8 x1750088378241600/t0(0) o400->lustre-MDT0002-mdtlov_UUID@10.240.26.199@tcp:161/0 lens 224/0 e 6 to 0 dl 1669022036 ref 1 fl Complete:H/c0/ffffffff rc 0/-1 job:'ldlm_lock_repla.0' 00010000:02000000:0.0:1669022027.575670:0:182954:0:(ldlm_lib.c:1766:target_finish_recovery()) lustre-OST0000: Recovery over after 1:10, of 6 clients 5 recovered and 1 was evicted. 00000100:00080000:1.0:1669022027.578807:0:18353:0:(pinger.c:523:ping_evictor_main()) Last request was at 1669022027 00000020:00080000:1.0:1669022027.578808:0:18353:0:(genops.c:1604:class_fail_export()) disconnecting export 00000000391424d3/lustre-MDT0000-mdtlov_UUID 00000020:00000080:1.0:1669022027.578812:0:18353:0:(genops.c:1445:class_disconnect()) disconnect: cookie 0x8074ece8ff0c069e 00000100:00080000:1.0:1669022027.578814:0:18353:0:(import.c:86:import_set_state_nolock()) 00000000ec9d3565 8�s�����: changing import state from FULL to CLOSED At MDT0 side, it connected, fall through recovery, send OST_CREATE/DESTROY but got -107 (operation on unconnected OST) 00000100:00080000:0.0:1669021958.927892:0:348514:0:(import.c:86:import_set_state_nolock()) 000000000d82a1de lustre-OST0000_UUID: changing import state from DISCONN to CONNECTING 00000100:00080000:0.0:1669021958.927894:0:348514:0:(import.c:534:import_select_connection()) lustre-OST0000-osc-MDT0000: connect to NID 10.240.26.198@tcp last attempt 7822 00000100:00080000:0.0:1669021958.927895:0:348514:0:(import.c:581:import_select_connection()) lustre-OST0000-osc-MDT0000: tried all connections, increasing latency to 11s 00000100:00080000:0.0:1669021958.927897:0:348514:0:(import.c:615:import_select_connection()) lustre-OST0000-osc-MDT0000: import 000000000d82a1de using connection 10.240.26.198@tcp/10.240.26.198@tcp 00000100:00100000:1.0:1669021958.929148:0:11061:0:(client.c:2912:ptlrpc_free_committed()) lustre-OST0000-osc-MDT0000: committing for last_committed 55834574850 gen 2 00000100:00080000:1.0:1669021958.929150:0:11061:0:(import.c:1074:ptlrpc_connect_interpret()) lustre-OST0000-osc-MDT0000: connect to target with instance 15 00000100:00080000:1.0:1669021958.929151:0:11061:0:(import.c:934:ptlrpc_connect_set_flags()) lustre-OST0000-osc-MDT0000: Resetting ns_connect_flags to server flags: 0x2040401443000066 00000100:00080000:1.0:1669021958.929153:0:11061:0:(import.c:1159:ptlrpc_connect_interpret()) connected to replayable target: lustre-OST0000_UUID 00000100:00080000:1.0:1669021958.929153:0:11061:0:(import.c:1177:ptlrpc_connect_interpret()) connect to lustre-OST0000_UUID during recovery 00000100:00080000:1.0:1669021958.929154:0:11061:0:(import.c:86:import_set_state_nolock()) 000000000d82a1de lustre-OST0000_UUID: changing import state from CONNECTING to REPLAY_LOCKS 00000100:00080000:1.0:1669021958.929155:0:11061:0:(import.c:86:import_set_state_nolock()) 000000000d82a1de lustre-OST0000_UUID: changing import state from REPLAY_LOCKS to REPLAY_WAIT 00010000:00080000:1.0:1669022027.578997:0:11061:0:(ldlm_request.c:1449:ldlm_cli_update_pool()) @@@ Zero SLV or limit found (SLV=0, limit=16625) req@000000001bd81261 x1750088378241600/t0(0) o400->lustre-OST0000-osc-MDT0002@10.240.26.198@tcp:28/4 lens 224/224 e 6 to 0 dl 1669022037 ref 1 fl Rpc:RQU/c0/0 rc 0/0 job:'ldlm_lock_repla.0' 00000100:00080000:1.0:1669022027.579029:0:11061:0:(import.c:86:import_set_state_nolock()) 000000000d82a1de lustre-OST0000_UUID: changing import state from REPLAY_WAIT to RECOVER 00000100:00080000:1.0:1669022027.579030:0:11061:0:(client.c:3040:ptlrpc_resend_req()) @@@ going to resend req@00000000c8b765cc x1750088378241536/t0(0) o400->lustre-OST0000-osc-MDT0000@10.240.26.198@tcp:28/4 lens 224/224 e 6 to 0 dl 1669022041 ref 1 fl Interpret:RQU/c0/0 rc 0/0 job:'ptlrpcd_rcv.0' 00000100:00080000:1.0:1669022027.579033:0:11061:0:(client.c:3049:ptlrpc_resend_req()) @@@ it has reply, so skip it req@00000000c8b765cc x1750088378241536/t0(0) o400->lustre-OST0000-osc-MDT0000@10.240.26.198@tcp:28/4 lens 224/224 e 6 to 0 dl 1669022041 ref 1 fl Interpret:RQU/c0/0 rc 0/0 job:'ptlrpcd_rcv.0' 00000100:00080000:1.0:1669022027.579039:0:11061:0:(import.c:86:import_set_state_nolock()) 000000000d82a1de lustre-OST0000_UUID: changing import state from RECOVER to FULL 00000004:00080000:1.0:1669022027.579044:0:11061:0:(osp_dev.c:1618:osp_import_event()) got connected 00000100:02000000:1.0:1669022027.579044:0:11061:0:(import.c:1648:ptlrpc_import_recovery_state_machine()) lustre-OST0000-osc-MDT0000: Connection restored to (at 10.240.26.198@tcp) 00000100:00100000:0.0:1669022027.579228:0:11062:0:(client.c:1743:ptlrpc_send_new_req()) Sending RPC req@000000005e38fb65 pname:cluuid:pid:xid:nid:opc:job ptlrpcd_00_00:lustre-MDT0000-mdtlov_UUID:11062:1750088378275456:10.240.26.198@tcp:6:osp-syn-0-0.0 00000100:02020000:1.0:1669022027.581000:0:441603:0:(client.c:1361:ptlrpc_check_status()) 11-0: lustre-OST0000-osc-MDT0000: operation ost_create to node 10.240.26.198@tcp failed: rc = -107 Looks like a last ping for a recovery finish stayed at OST side. After stale client eviction, OST processed final pings, move imports to FULL, and after that pinger_evictor found that timeout expired (no request from MDT0 from final ping to final ping processing) and evicted it. |
| Comment by Gerrit Updater [ 28/Nov/22 ] |
|
"Alexander <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49257 |
| Comment by Andreas Dilger [ 29/Nov/22 ] |
|
Thanks for the quick patch. Always annoying when there are autotest failures, and I've been trying to get the most common ones fixed so that other patch landings are not slowed down. |
| Comment by Gerrit Updater [ 03/Jan/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49257/ |
| Comment by Peter Jones [ 03/Jan/23 ] |
|
Landed for 2.16 |