[LU-16271] replay-single test_89 FAIL: 20480 blocks leaked - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.16.0
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

This issue was created by maloo for Lai Siyao <lai.siyao@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/d1100f6d-6ef7-4e77-9e31-f6de3a6b208e

== replay-single test 89: no disk space leak on late ost connection ========================================================== 19:47:35 
...
Waiting for orphan cleanup...
CMD: onyx-60vm4 /usr/sbin/lctl list_param osp.*osc*.old_sync_processed 2> /dev/null
osp.lustre-OST0000-osc-MDT0000.old_sync_processed
osp.lustre-OST0001-osc-MDT0000.old_sync_processed
wait 40 secs maximumly for onyx-60vm4 mds-ost sync done.
CMD: onyx-60vm4 /usr/sbin/lctl get_param -n osp.*osc*.old_sync_processed
sleep 5 for ZFS zfs
Waiting for MDT destroys to complete
CMD: onyx-60vm4 /usr/sbin/lctl get_param -n osp.*.destroys_in_flight
 replay-single test_89: @@@@@@ FAIL: 20480 blocks leaked 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:6524:error()
  = /usr/lib64/lustre/tests/replay-single.sh:3329:test_89()
  = /usr/lib64/lustre/tests/test-framework.sh:6860:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:6910:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:6747:run_test()
  = /usr/lib64/lustre/tests/replay-single.sh:3331:main()

Attachments

Issue Links

is related to

LU-14947 recovery-small test_26a: client not evicted from OST

Resolved

is related to

LU-14201 replay-single test 89 fails with '3072 blocks leaked'

Open

LU-16002 Ping evictor delayed client eviction for 3 ping interval more than defined

Resolved

Activity

[LU-16271] replay-single test_89 FAIL: 20480 blocks leaked

Peter Jones added a comment - 03/Jan/23 9:49 PM

Landed for 2.16

Peter Jones added a comment - 03/Jan/23 9:49 PM Landed for 2.16

Gerrit Updater added a comment - 03/Jan/23 9:32 PM

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49257/
Subject: ~~LU-16271~~ ptlrpc: fix eviction right after recovery
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2a6fda5b369c59a320941848968597244b8f3583

Gerrit Updater added a comment - 03/Jan/23 9:32 PM "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49257/ Subject: LU-16271 ptlrpc: fix eviction right after recovery Project: fs/lustre-release Branch: master Current Patch Set: Commit: 2a6fda5b369c59a320941848968597244b8f3583

Andreas Dilger added a comment - 29/Nov/22 6:31 AM

Thanks for the quick patch. Always annoying when there are autotest failures, and I've been trying to get the most common ones fixed so that other patch landings are not slowed down.

Andreas Dilger added a comment - 29/Nov/22 6:31 AM Thanks for the quick patch. Always annoying when there are autotest failures, and I've been trying to get the most common ones fixed so that other patch landings are not slowed down.

Gerrit Updater added a comment - 28/Nov/22 2:32 PM

"Alexander <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49257
Subject: ~~LU-16271~~ ptlrpc: fix eviction right after recovery
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a3e80d7bcf8d21421a4f18995c245ac84105aa80

Gerrit Updater added a comment - 28/Nov/22 2:32 PM "Alexander <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49257 Subject: LU-16271 ptlrpc: fix eviction right after recovery Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a3e80d7bcf8d21421a4f18995c245ac84105aa80

Alexander Boyko added a comment - 28/Nov/22 9:59 AM

I've checked logs and it look a bit strange, MDT0 eviction right after recovery

00010000:00080000:0.0:1669022027.575599:0:182954:0:(ldlm_lib.c:2820:target_recovery_thread()) @@@ processing final ping from 10.240.26.199@tcp:  req@000000002adb22e8 x1750088378241600/t0(0) o400->lustre-MDT0002-mdtlov_UUID@10.240.26.199@tcp:161/0 lens 224/0 e 6 to 0 dl 1669022036 ref 1 fl Complete:H/c0/ffffffff rc 0/-1 job:'ldlm_lock_repla.0'
00010000:02000000:0.0:1669022027.575670:0:182954:0:(ldlm_lib.c:1766:target_finish_recovery()) lustre-OST0000: Recovery over after 1:10, of 6 clients 5 recovered and 1 was evicted.
00000100:00080000:1.0:1669022027.578807:0:18353:0:(pinger.c:523:ping_evictor_main()) Last request was at 1669022027
00000020:00080000:1.0:1669022027.578808:0:18353:0:(genops.c:1604:class_fail_export()) disconnecting export 00000000391424d3/lustre-MDT0000-mdtlov_UUID
00000020:00000080:1.0:1669022027.578812:0:18353:0:(genops.c:1445:class_disconnect()) disconnect: cookie 0x8074ece8ff0c069e
00000100:00080000:1.0:1669022027.578814:0:18353:0:(import.c:86:import_set_state_nolock()) 00000000ec9d3565 8�s�����: changing import state from FULL to CLOSED

At MDT0 side, it connected, fall through recovery, send OST_CREATE/DESTROY but got -107 (operation on unconnected OST)

00000100:00080000:0.0:1669021958.927892:0:348514:0:(import.c:86:import_set_state_nolock()) 000000000d82a1de lustre-OST0000_UUID: changing import state from DISCONN to CONNECTING
00000100:00080000:0.0:1669021958.927894:0:348514:0:(import.c:534:import_select_connection()) lustre-OST0000-osc-MDT0000: connect to NID 10.240.26.198@tcp last attempt 7822
00000100:00080000:0.0:1669021958.927895:0:348514:0:(import.c:581:import_select_connection()) lustre-OST0000-osc-MDT0000: tried all connections, increasing latency to 11s
00000100:00080000:0.0:1669021958.927897:0:348514:0:(import.c:615:import_select_connection()) lustre-OST0000-osc-MDT0000: import 000000000d82a1de using connection 10.240.26.198@tcp/10.240.26.198@tcp
00000100:00100000:1.0:1669021958.929148:0:11061:0:(client.c:2912:ptlrpc_free_committed()) lustre-OST0000-osc-MDT0000: committing for last_committed 55834574850 gen 2
00000100:00080000:1.0:1669021958.929150:0:11061:0:(import.c:1074:ptlrpc_connect_interpret()) lustre-OST0000-osc-MDT0000: connect to target with instance 15
00000100:00080000:1.0:1669021958.929151:0:11061:0:(import.c:934:ptlrpc_connect_set_flags()) lustre-OST0000-osc-MDT0000: Resetting ns_connect_flags to server flags: 0x2040401443000066
00000100:00080000:1.0:1669021958.929153:0:11061:0:(import.c:1159:ptlrpc_connect_interpret()) connected to replayable target: lustre-OST0000_UUID
00000100:00080000:1.0:1669021958.929153:0:11061:0:(import.c:1177:ptlrpc_connect_interpret()) connect to lustre-OST0000_UUID during recovery
00000100:00080000:1.0:1669021958.929154:0:11061:0:(import.c:86:import_set_state_nolock()) 000000000d82a1de lustre-OST0000_UUID: changing import state from CONNECTING to REPLAY_LOCKS
00000100:00080000:1.0:1669021958.929155:0:11061:0:(import.c:86:import_set_state_nolock()) 000000000d82a1de lustre-OST0000_UUID: changing import state from REPLAY_LOCKS to REPLAY_WAIT

00010000:00080000:1.0:1669022027.578997:0:11061:0:(ldlm_request.c:1449:ldlm_cli_update_pool()) @@@ Zero SLV or limit found (SLV=0, limit=16625)  req@000000001bd81261 x1750088378241600/t0(0) o400->lustre-OST0000-osc-MDT0002@10.240.26.198@tcp:28/4 lens 224/224 e 6 to 0 dl 1669022037 ref 1 fl Rpc:RQU/c0/0 rc 0/0 job:'ldlm_lock_repla.0'
00000100:00080000:1.0:1669022027.579029:0:11061:0:(import.c:86:import_set_state_nolock()) 000000000d82a1de lustre-OST0000_UUID: changing import state from REPLAY_WAIT to RECOVER
00000100:00080000:1.0:1669022027.579030:0:11061:0:(client.c:3040:ptlrpc_resend_req()) @@@ going to resend  req@00000000c8b765cc x1750088378241536/t0(0) o400->lustre-OST0000-osc-MDT0000@10.240.26.198@tcp:28/4 lens 224/224 e 6 to 0 dl 1669022041 ref 1 fl Interpret:RQU/c0/0 rc 0/0 job:'ptlrpcd_rcv.0'
00000100:00080000:1.0:1669022027.579033:0:11061:0:(client.c:3049:ptlrpc_resend_req()) @@@ it has reply, so skip it  req@00000000c8b765cc x1750088378241536/t0(0) o400->lustre-OST0000-osc-MDT0000@10.240.26.198@tcp:28/4 lens 224/224 e 6 to 0 dl 1669022041 ref 1 fl Interpret:RQU/c0/0 rc 0/0 job:'ptlrpcd_rcv.0'
00000100:00080000:1.0:1669022027.579039:0:11061:0:(import.c:86:import_set_state_nolock()) 000000000d82a1de lustre-OST0000_UUID: changing import state from RECOVER to FULL
00000004:00080000:1.0:1669022027.579044:0:11061:0:(osp_dev.c:1618:osp_import_event()) got connected
00000100:02000000:1.0:1669022027.579044:0:11061:0:(import.c:1648:ptlrpc_import_recovery_state_machine()) lustre-OST0000-osc-MDT0000: Connection restored to  (at 10.240.26.198@tcp)

00000100:00100000:0.0:1669022027.579228:0:11062:0:(client.c:1743:ptlrpc_send_new_req()) Sending RPC req@000000005e38fb65 pname:cluuid:pid:xid:nid:opc:job ptlrpcd_00_00:lustre-MDT0000-mdtlov_UUID:11062:1750088378275456:10.240.26.198@tcp:6:osp-syn-0-0.0
00000100:02020000:1.0:1669022027.581000:0:441603:0:(client.c:1361:ptlrpc_check_status()) 11-0: lustre-OST0000-osc-MDT0000: operation ost_create to node 10.240.26.198@tcp failed: rc = -107

Looks like a last ping for a recovery finish stayed at OST side. After stale client eviction, OST processed final pings, move imports to FULL, and after that pinger_evictor found that timeout expired (no request from MDT0 from final ping to final ping processing) and evicted it.
I guess, recovery should update exp_last_request_time to a recovery finish time for all clients.

Alexander Boyko added a comment - 28/Nov/22 9:59 AM I've checked logs and it look a bit strange, MDT0 eviction right after recovery 00010000:00080000:0.0:1669022027.575599:0:182954:0:(ldlm_lib.c:2820:target_recovery_thread()) @@@ processing final ping from 10.240.26.199@tcp: req@000000002adb22e8 x1750088378241600/t0(0) o400->lustre-MDT0002-mdtlov_UUID@10.240.26.199@tcp:161/0 lens 224/0 e 6 to 0 dl 1669022036 ref 1 fl Complete:H/c0/ffffffff rc 0/-1 job:'ldlm_lock_repla.0' 00010000:02000000:0.0:1669022027.575670:0:182954:0:(ldlm_lib.c:1766:target_finish_recovery()) lustre-OST0000: Recovery over after 1:10, of 6 clients 5 recovered and 1 was evicted. 00000100:00080000:1.0:1669022027.578807:0:18353:0:(pinger.c:523:ping_evictor_main()) Last request was at 1669022027 00000020:00080000:1.0:1669022027.578808:0:18353:0:(genops.c:1604:class_fail_export()) disconnecting export 00000000391424d3/lustre-MDT0000-mdtlov_UUID 00000020:00000080:1.0:1669022027.578812:0:18353:0:(genops.c:1445:class_disconnect()) disconnect: cookie 0x8074ece8ff0c069e 00000100:00080000:1.0:1669022027.578814:0:18353:0:(import.c:86:import_set_state_nolock()) 00000000ec9d3565 8�s��: changing import state from FULL to CLOSED At MDT0 side, it connected, fall through recovery, send OST_CREATE/DESTROY but got -107 (operation on unconnected OST) 00000100:00080000:0.0:1669021958.927892:0:348514:0:(import.c:86:import_set_state_nolock()) 000000000d82a1de lustre-OST0000_UUID: changing import state from DISCONN to CONNECTING 00000100:00080000:0.0:1669021958.927894:0:348514:0:(import.c:534:import_select_connection()) lustre-OST0000-osc-MDT0000: connect to NID 10.240.26.198@tcp last attempt 7822 00000100:00080000:0.0:1669021958.927895:0:348514:0:(import.c:581:import_select_connection()) lustre-OST0000-osc-MDT0000: tried all connections, increasing latency to 11s 00000100:00080000:0.0:1669021958.927897:0:348514:0:(import.c:615:import_select_connection()) lustre-OST0000-osc-MDT0000: import 000000000d82a1de using connection 10.240.26.198@tcp/10.240.26.198@tcp 00000100:00100000:1.0:1669021958.929148:0:11061:0:(client.c:2912:ptlrpc_free_committed()) lustre-OST0000-osc-MDT0000: committing for last_committed 55834574850 gen 2 00000100:00080000:1.0:1669021958.929150:0:11061:0:(import.c:1074:ptlrpc_connect_interpret()) lustre-OST0000-osc-MDT0000: connect to target with instance 15 00000100:00080000:1.0:1669021958.929151:0:11061:0:(import.c:934:ptlrpc_connect_set_flags()) lustre-OST0000-osc-MDT0000: Resetting ns_connect_flags to server flags: 0x2040401443000066 00000100:00080000:1.0:1669021958.929153:0:11061:0:(import.c:1159:ptlrpc_connect_interpret()) connected to replayable target: lustre-OST0000_UUID 00000100:00080000:1.0:1669021958.929153:0:11061:0:(import.c:1177:ptlrpc_connect_interpret()) connect to lustre-OST0000_UUID during recovery 00000100:00080000:1.0:1669021958.929154:0:11061:0:(import.c:86:import_set_state_nolock()) 000000000d82a1de lustre-OST0000_UUID: changing import state from CONNECTING to REPLAY_LOCKS 00000100:00080000:1.0:1669021958.929155:0:11061:0:(import.c:86:import_set_state_nolock()) 000000000d82a1de lustre-OST0000_UUID: changing import state from REPLAY_LOCKS to REPLAY_WAIT 00010000:00080000:1.0:1669022027.578997:0:11061:0:(ldlm_request.c:1449:ldlm_cli_update_pool()) @@@ Zero SLV or limit found (SLV=0, limit=16625) req@000000001bd81261 x1750088378241600/t0(0) o400->lustre-OST0000-osc-MDT0002@10.240.26.198@tcp:28/4 lens 224/224 e 6 to 0 dl 1669022037 ref 1 fl Rpc:RQU/c0/0 rc 0/0 job:'ldlm_lock_repla.0' 00000100:00080000:1.0:1669022027.579029:0:11061:0:(import.c:86:import_set_state_nolock()) 000000000d82a1de lustre-OST0000_UUID: changing import state from REPLAY_WAIT to RECOVER 00000100:00080000:1.0:1669022027.579030:0:11061:0:(client.c:3040:ptlrpc_resend_req()) @@@ going to resend req@00000000c8b765cc x1750088378241536/t0(0) o400->lustre-OST0000-osc-MDT0000@10.240.26.198@tcp:28/4 lens 224/224 e 6 to 0 dl 1669022041 ref 1 fl Interpret:RQU/c0/0 rc 0/0 job:'ptlrpcd_rcv.0' 00000100:00080000:1.0:1669022027.579033:0:11061:0:(client.c:3049:ptlrpc_resend_req()) @@@ it has reply, so skip it req@00000000c8b765cc x1750088378241536/t0(0) o400->lustre-OST0000-osc-MDT0000@10.240.26.198@tcp:28/4 lens 224/224 e 6 to 0 dl 1669022041 ref 1 fl Interpret:RQU/c0/0 rc 0/0 job:'ptlrpcd_rcv.0' 00000100:00080000:1.0:1669022027.579039:0:11061:0:(import.c:86:import_set_state_nolock()) 000000000d82a1de lustre-OST0000_UUID: changing import state from RECOVER to FULL 00000004:00080000:1.0:1669022027.579044:0:11061:0:(osp_dev.c:1618:osp_import_event()) got connected 00000100:02000000:1.0:1669022027.579044:0:11061:0:(import.c:1648:ptlrpc_import_recovery_state_machine()) lustre-OST0000-osc-MDT0000: Connection restored to (at 10.240.26.198@tcp) 00000100:00100000:0.0:1669022027.579228:0:11062:0:(client.c:1743:ptlrpc_send_new_req()) Sending RPC req@000000005e38fb65 pname:cluuid:pid:xid:nid:opc:job ptlrpcd_00_00:lustre-MDT0000-mdtlov_UUID:11062:1750088378275456:10.240.26.198@tcp:6:osp-syn-0-0.0 00000100:02020000:1.0:1669022027.581000:0:441603:0:(client.c:1361:ptlrpc_check_status()) 11-0: lustre-OST0000-osc-MDT0000: operation ost_create to node 10.240.26.198@tcp failed: rc = -107 Looks like a last ping for a recovery finish stayed at OST side. After stale client eviction, OST processed final pings, move imports to FULL, and after that pinger_evictor found that timeout expired (no request from MDT0 from final ping to final ping processing) and evicted it. I guess, recovery should update exp_last_request_time to a recovery finish time for all clients.

Andreas Dilger added a comment - 22/Nov/22 1:08 AM

Alexander, it looks like the landing of your patch caused this test to start failing, please see the following test report:

https://testing.whamcloud.com/search?status%5B%5D=FAIL&test_set_script_id=f6a12204-32c3-11e0-a61c-52540025f9ae&sub_test_script_id=fd0b0e02-32c3-11e0-a61c-52540025f9ae&start_date=2022-07-01&end_date=2022-11-21&source=sub_tests#redirect

Andreas Dilger added a comment - 22/Nov/22 1:08 AM Alexander, it looks like the landing of your patch caused this test to start failing, please see the following test report: https://testing.whamcloud.com/search?status%5B%5D=FAIL&test_set_script_id=f6a12204-32c3-11e0-a61c-52540025f9ae&sub_test_script_id=fd0b0e02-32c3-11e0-a61c-52540025f9ae&start_date=2022-07-01&end_date=2022-11-21&source=sub_tests#redirect

People

Assignee:: Alexander Boyko

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 27/Oct/22 3:00 AM

Updated:: 18/Jan/25 12:43 AM

Resolved:: 03/Jan/23 9:49 PM