Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16271

replay-single test_89 FAIL: 20480 blocks leaked

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Lai Siyao <lai.siyao@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/d1100f6d-6ef7-4e77-9e31-f6de3a6b208e

      == replay-single test 89: no disk space leak on late ost connection ========================================================== 19:47:35 
      ...
      Waiting for orphan cleanup...
      CMD: onyx-60vm4 /usr/sbin/lctl list_param osp.*osc*.old_sync_processed 2> /dev/null
      osp.lustre-OST0000-osc-MDT0000.old_sync_processed
      osp.lustre-OST0001-osc-MDT0000.old_sync_processed
      wait 40 secs maximumly for onyx-60vm4 mds-ost sync done.
      CMD: onyx-60vm4 /usr/sbin/lctl get_param -n osp.*osc*.old_sync_processed
      sleep 5 for ZFS zfs
      Waiting for MDT destroys to complete
      CMD: onyx-60vm4 /usr/sbin/lctl get_param -n osp.*.destroys_in_flight
       replay-single test_89: @@@@@@ FAIL: 20480 blocks leaked 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:6524:error()
        = /usr/lib64/lustre/tests/replay-single.sh:3329:test_89()
        = /usr/lib64/lustre/tests/test-framework.sh:6860:run_one()
        = /usr/lib64/lustre/tests/test-framework.sh:6910:run_one_logged()
        = /usr/lib64/lustre/tests/test-framework.sh:6747:run_test()
        = /usr/lib64/lustre/tests/replay-single.sh:3331:main()
      

      Attachments

        Issue Links

          Activity

            [LU-16271] replay-single test_89 FAIL: 20480 blocks leaked
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49257/
            Subject: LU-16271 ptlrpc: fix eviction right after recovery
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 2a6fda5b369c59a320941848968597244b8f3583

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49257/ Subject: LU-16271 ptlrpc: fix eviction right after recovery Project: fs/lustre-release Branch: master Current Patch Set: Commit: 2a6fda5b369c59a320941848968597244b8f3583

            Thanks for the quick patch. Always annoying when there are autotest failures, and I've been trying to get the most common ones fixed so that other patch landings are not slowed down.

            adilger Andreas Dilger added a comment - Thanks for the quick patch. Always annoying when there are autotest failures, and I've been trying to get the most common ones fixed so that other patch landings are not slowed down.

            "Alexander <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49257
            Subject: LU-16271 ptlrpc: fix eviction right after recovery
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a3e80d7bcf8d21421a4f18995c245ac84105aa80

            gerrit Gerrit Updater added a comment - "Alexander <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49257 Subject: LU-16271 ptlrpc: fix eviction right after recovery Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a3e80d7bcf8d21421a4f18995c245ac84105aa80

            I've checked logs and it look a bit strange, MDT0 eviction right after recovery

            00010000:00080000:0.0:1669022027.575599:0:182954:0:(ldlm_lib.c:2820:target_recovery_thread()) @@@ processing final ping from 10.240.26.199@tcp:  req@000000002adb22e8 x1750088378241600/t0(0) o400->lustre-MDT0002-mdtlov_UUID@10.240.26.199@tcp:161/0 lens 224/0 e 6 to 0 dl 1669022036 ref 1 fl Complete:H/c0/ffffffff rc 0/-1 job:'ldlm_lock_repla.0'
            00010000:02000000:0.0:1669022027.575670:0:182954:0:(ldlm_lib.c:1766:target_finish_recovery()) lustre-OST0000: Recovery over after 1:10, of 6 clients 5 recovered and 1 was evicted.
            00000100:00080000:1.0:1669022027.578807:0:18353:0:(pinger.c:523:ping_evictor_main()) Last request was at 1669022027
            00000020:00080000:1.0:1669022027.578808:0:18353:0:(genops.c:1604:class_fail_export()) disconnecting export 00000000391424d3/lustre-MDT0000-mdtlov_UUID
            00000020:00000080:1.0:1669022027.578812:0:18353:0:(genops.c:1445:class_disconnect()) disconnect: cookie 0x8074ece8ff0c069e
            00000100:00080000:1.0:1669022027.578814:0:18353:0:(import.c:86:import_set_state_nolock()) 00000000ec9d3565 8�s�����: changing import state from FULL to CLOSED
            

            At MDT0 side, it connected, fall through recovery, send OST_CREATE/DESTROY but got -107 (operation on unconnected OST)

            00000100:00080000:0.0:1669021958.927892:0:348514:0:(import.c:86:import_set_state_nolock()) 000000000d82a1de lustre-OST0000_UUID: changing import state from DISCONN to CONNECTING
            00000100:00080000:0.0:1669021958.927894:0:348514:0:(import.c:534:import_select_connection()) lustre-OST0000-osc-MDT0000: connect to NID 10.240.26.198@tcp last attempt 7822
            00000100:00080000:0.0:1669021958.927895:0:348514:0:(import.c:581:import_select_connection()) lustre-OST0000-osc-MDT0000: tried all connections, increasing latency to 11s
            00000100:00080000:0.0:1669021958.927897:0:348514:0:(import.c:615:import_select_connection()) lustre-OST0000-osc-MDT0000: import 000000000d82a1de using connection 10.240.26.198@tcp/10.240.26.198@tcp
            00000100:00100000:1.0:1669021958.929148:0:11061:0:(client.c:2912:ptlrpc_free_committed()) lustre-OST0000-osc-MDT0000: committing for last_committed 55834574850 gen 2
            00000100:00080000:1.0:1669021958.929150:0:11061:0:(import.c:1074:ptlrpc_connect_interpret()) lustre-OST0000-osc-MDT0000: connect to target with instance 15
            00000100:00080000:1.0:1669021958.929151:0:11061:0:(import.c:934:ptlrpc_connect_set_flags()) lustre-OST0000-osc-MDT0000: Resetting ns_connect_flags to server flags: 0x2040401443000066
            00000100:00080000:1.0:1669021958.929153:0:11061:0:(import.c:1159:ptlrpc_connect_interpret()) connected to replayable target: lustre-OST0000_UUID
            00000100:00080000:1.0:1669021958.929153:0:11061:0:(import.c:1177:ptlrpc_connect_interpret()) connect to lustre-OST0000_UUID during recovery
            00000100:00080000:1.0:1669021958.929154:0:11061:0:(import.c:86:import_set_state_nolock()) 000000000d82a1de lustre-OST0000_UUID: changing import state from CONNECTING to REPLAY_LOCKS
            00000100:00080000:1.0:1669021958.929155:0:11061:0:(import.c:86:import_set_state_nolock()) 000000000d82a1de lustre-OST0000_UUID: changing import state from REPLAY_LOCKS to REPLAY_WAIT
            
            00010000:00080000:1.0:1669022027.578997:0:11061:0:(ldlm_request.c:1449:ldlm_cli_update_pool()) @@@ Zero SLV or limit found (SLV=0, limit=16625)  req@000000001bd81261 x1750088378241600/t0(0) o400->lustre-OST0000-osc-MDT0002@10.240.26.198@tcp:28/4 lens 224/224 e 6 to 0 dl 1669022037 ref 1 fl Rpc:RQU/c0/0 rc 0/0 job:'ldlm_lock_repla.0'
            00000100:00080000:1.0:1669022027.579029:0:11061:0:(import.c:86:import_set_state_nolock()) 000000000d82a1de lustre-OST0000_UUID: changing import state from REPLAY_WAIT to RECOVER
            00000100:00080000:1.0:1669022027.579030:0:11061:0:(client.c:3040:ptlrpc_resend_req()) @@@ going to resend  req@00000000c8b765cc x1750088378241536/t0(0) o400->lustre-OST0000-osc-MDT0000@10.240.26.198@tcp:28/4 lens 224/224 e 6 to 0 dl 1669022041 ref 1 fl Interpret:RQU/c0/0 rc 0/0 job:'ptlrpcd_rcv.0'
            00000100:00080000:1.0:1669022027.579033:0:11061:0:(client.c:3049:ptlrpc_resend_req()) @@@ it has reply, so skip it  req@00000000c8b765cc x1750088378241536/t0(0) o400->lustre-OST0000-osc-MDT0000@10.240.26.198@tcp:28/4 lens 224/224 e 6 to 0 dl 1669022041 ref 1 fl Interpret:RQU/c0/0 rc 0/0 job:'ptlrpcd_rcv.0'
            00000100:00080000:1.0:1669022027.579039:0:11061:0:(import.c:86:import_set_state_nolock()) 000000000d82a1de lustre-OST0000_UUID: changing import state from RECOVER to FULL
            00000004:00080000:1.0:1669022027.579044:0:11061:0:(osp_dev.c:1618:osp_import_event()) got connected
            00000100:02000000:1.0:1669022027.579044:0:11061:0:(import.c:1648:ptlrpc_import_recovery_state_machine()) lustre-OST0000-osc-MDT0000: Connection restored to  (at 10.240.26.198@tcp)
            
            00000100:00100000:0.0:1669022027.579228:0:11062:0:(client.c:1743:ptlrpc_send_new_req()) Sending RPC req@000000005e38fb65 pname:cluuid:pid:xid:nid:opc:job ptlrpcd_00_00:lustre-MDT0000-mdtlov_UUID:11062:1750088378275456:10.240.26.198@tcp:6:osp-syn-0-0.0
            00000100:02020000:1.0:1669022027.581000:0:441603:0:(client.c:1361:ptlrpc_check_status()) 11-0: lustre-OST0000-osc-MDT0000: operation ost_create to node 10.240.26.198@tcp failed: rc = -107
            

            Looks like a last ping for a recovery finish stayed at OST side. After stale client eviction, OST processed final pings, move imports to FULL, and after that pinger_evictor found that timeout expired (no request from MDT0 from final ping to final ping processing) and evicted it.
            I guess, recovery should update exp_last_request_time to a recovery finish time for all clients.

            aboyko Alexander Boyko added a comment - I've checked logs and it look a bit strange, MDT0 eviction right after recovery 00010000:00080000:0.0:1669022027.575599:0:182954:0:(ldlm_lib.c:2820:target_recovery_thread()) @@@ processing final ping from 10.240.26.199@tcp: req@000000002adb22e8 x1750088378241600/t0(0) o400->lustre-MDT0002-mdtlov_UUID@10.240.26.199@tcp:161/0 lens 224/0 e 6 to 0 dl 1669022036 ref 1 fl Complete:H/c0/ffffffff rc 0/-1 job:'ldlm_lock_repla.0' 00010000:02000000:0.0:1669022027.575670:0:182954:0:(ldlm_lib.c:1766:target_finish_recovery()) lustre-OST0000: Recovery over after 1:10, of 6 clients 5 recovered and 1 was evicted. 00000100:00080000:1.0:1669022027.578807:0:18353:0:(pinger.c:523:ping_evictor_main()) Last request was at 1669022027 00000020:00080000:1.0:1669022027.578808:0:18353:0:(genops.c:1604:class_fail_export()) disconnecting export 00000000391424d3/lustre-MDT0000-mdtlov_UUID 00000020:00000080:1.0:1669022027.578812:0:18353:0:(genops.c:1445:class_disconnect()) disconnect: cookie 0x8074ece8ff0c069e 00000100:00080000:1.0:1669022027.578814:0:18353:0:(import.c:86:import_set_state_nolock()) 00000000ec9d3565 8�s�����: changing import state from FULL to CLOSED At MDT0 side, it connected, fall through recovery, send OST_CREATE/DESTROY but got -107 (operation on unconnected OST) 00000100:00080000:0.0:1669021958.927892:0:348514:0:(import.c:86:import_set_state_nolock()) 000000000d82a1de lustre-OST0000_UUID: changing import state from DISCONN to CONNECTING 00000100:00080000:0.0:1669021958.927894:0:348514:0:(import.c:534:import_select_connection()) lustre-OST0000-osc-MDT0000: connect to NID 10.240.26.198@tcp last attempt 7822 00000100:00080000:0.0:1669021958.927895:0:348514:0:(import.c:581:import_select_connection()) lustre-OST0000-osc-MDT0000: tried all connections, increasing latency to 11s 00000100:00080000:0.0:1669021958.927897:0:348514:0:(import.c:615:import_select_connection()) lustre-OST0000-osc-MDT0000: import 000000000d82a1de using connection 10.240.26.198@tcp/10.240.26.198@tcp 00000100:00100000:1.0:1669021958.929148:0:11061:0:(client.c:2912:ptlrpc_free_committed()) lustre-OST0000-osc-MDT0000: committing for last_committed 55834574850 gen 2 00000100:00080000:1.0:1669021958.929150:0:11061:0:(import.c:1074:ptlrpc_connect_interpret()) lustre-OST0000-osc-MDT0000: connect to target with instance 15 00000100:00080000:1.0:1669021958.929151:0:11061:0:(import.c:934:ptlrpc_connect_set_flags()) lustre-OST0000-osc-MDT0000: Resetting ns_connect_flags to server flags: 0x2040401443000066 00000100:00080000:1.0:1669021958.929153:0:11061:0:(import.c:1159:ptlrpc_connect_interpret()) connected to replayable target: lustre-OST0000_UUID 00000100:00080000:1.0:1669021958.929153:0:11061:0:(import.c:1177:ptlrpc_connect_interpret()) connect to lustre-OST0000_UUID during recovery 00000100:00080000:1.0:1669021958.929154:0:11061:0:(import.c:86:import_set_state_nolock()) 000000000d82a1de lustre-OST0000_UUID: changing import state from CONNECTING to REPLAY_LOCKS 00000100:00080000:1.0:1669021958.929155:0:11061:0:(import.c:86:import_set_state_nolock()) 000000000d82a1de lustre-OST0000_UUID: changing import state from REPLAY_LOCKS to REPLAY_WAIT 00010000:00080000:1.0:1669022027.578997:0:11061:0:(ldlm_request.c:1449:ldlm_cli_update_pool()) @@@ Zero SLV or limit found (SLV=0, limit=16625) req@000000001bd81261 x1750088378241600/t0(0) o400->lustre-OST0000-osc-MDT0002@10.240.26.198@tcp:28/4 lens 224/224 e 6 to 0 dl 1669022037 ref 1 fl Rpc:RQU/c0/0 rc 0/0 job:'ldlm_lock_repla.0' 00000100:00080000:1.0:1669022027.579029:0:11061:0:(import.c:86:import_set_state_nolock()) 000000000d82a1de lustre-OST0000_UUID: changing import state from REPLAY_WAIT to RECOVER 00000100:00080000:1.0:1669022027.579030:0:11061:0:(client.c:3040:ptlrpc_resend_req()) @@@ going to resend req@00000000c8b765cc x1750088378241536/t0(0) o400->lustre-OST0000-osc-MDT0000@10.240.26.198@tcp:28/4 lens 224/224 e 6 to 0 dl 1669022041 ref 1 fl Interpret:RQU/c0/0 rc 0/0 job:'ptlrpcd_rcv.0' 00000100:00080000:1.0:1669022027.579033:0:11061:0:(client.c:3049:ptlrpc_resend_req()) @@@ it has reply, so skip it req@00000000c8b765cc x1750088378241536/t0(0) o400->lustre-OST0000-osc-MDT0000@10.240.26.198@tcp:28/4 lens 224/224 e 6 to 0 dl 1669022041 ref 1 fl Interpret:RQU/c0/0 rc 0/0 job:'ptlrpcd_rcv.0' 00000100:00080000:1.0:1669022027.579039:0:11061:0:(import.c:86:import_set_state_nolock()) 000000000d82a1de lustre-OST0000_UUID: changing import state from RECOVER to FULL 00000004:00080000:1.0:1669022027.579044:0:11061:0:(osp_dev.c:1618:osp_import_event()) got connected 00000100:02000000:1.0:1669022027.579044:0:11061:0:(import.c:1648:ptlrpc_import_recovery_state_machine()) lustre-OST0000-osc-MDT0000: Connection restored to (at 10.240.26.198@tcp) 00000100:00100000:0.0:1669022027.579228:0:11062:0:(client.c:1743:ptlrpc_send_new_req()) Sending RPC req@000000005e38fb65 pname:cluuid:pid:xid:nid:opc:job ptlrpcd_00_00:lustre-MDT0000-mdtlov_UUID:11062:1750088378275456:10.240.26.198@tcp:6:osp-syn-0-0.0 00000100:02020000:1.0:1669022027.581000:0:441603:0:(client.c:1361:ptlrpc_check_status()) 11-0: lustre-OST0000-osc-MDT0000: operation ost_create to node 10.240.26.198@tcp failed: rc = -107 Looks like a last ping for a recovery finish stayed at OST side. After stale client eviction, OST processed final pings, move imports to FULL, and after that pinger_evictor found that timeout expired (no request from MDT0 from final ping to final ping processing) and evicted it. I guess, recovery should update exp_last_request_time to a recovery finish time for all clients.
            adilger Andreas Dilger added a comment - Alexander, it looks like the landing of your patch caused this test to start failing, please see the following test report: https://testing.whamcloud.com/search?status%5B%5D=FAIL&test_set_script_id=f6a12204-32c3-11e0-a61c-52540025f9ae&sub_test_script_id=fd0b0e02-32c3-11e0-a61c-52540025f9ae&start_date=2022-07-01&end_date=2022-11-21&source=sub_tests#redirect

            People

              aboyko Alexander Boyko
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: