Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17962

conf-sanity test_32a: failed with replace_nids operation already in progress

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.17.0
    • Lustre 2.16.0, Lustre 2.15.6
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/4c9dfb66-c4ee-4ceb-9fd7-436f4fc46eb8

      test_32a failed with the following error:

      CMD: trevis-128vm8 mount -t lustre -o nosvc t32fs-mdt1/mdt1 /tmp/t32/mnt/mdt
      CMD: trevis-128vm8 /usr/sbin/lctl replace_nids t32fs-OST0000 10.240.45.26@tcp
      trevis-128vm8: error: replace_nids: Operation now in progress
      pdsh@trevis-128vm1: trevis-128vm8: ssh exited with exit code 115
      CMD: trevis-128vm8 /usr/sbin/lctl dl
        0 UP osd-zfs t32fs-MDT0000-osd t32fs-MDT0000-osd_UUID 5
        1 UP mgs MGS MGS 7
        2 UP mgc MGC10.240.45.31@tcp 524e7669-b108-4f05-8270-6dd5e88a654d 5
       conf-sanity test_32a: @@@@@@ FAIL: replace_nids t32fs-OST0000 10.240.45.26@tcp failed 
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-reviews/100649 - 4.18.0-477.27.1.el8_8.x86_64
      servers: https://build.whamcloud.com/job/lustre-reviews/100649 - 4.18.0-477.27.1.el8_lustre.x86_64

      <<Please provide additional information about the failure here>>

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      conf-sanity test_32a - replace_nids t32fs-OST0000 10.240.45.26@tcp failed

      Attachments

        Issue Links

          Activity

            [LU-17962] conf-sanity test_32a: failed with replace_nids operation already in progress
            pjones Peter Jones added a comment -

            Merged for 2.17

            pjones Peter Jones added a comment - Merged for 2.17

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56709/
            Subject: LU-17962 mgc: free nidlist correctly
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 6ddf46420826cc66263599ba430c5144eabf766e

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56709/ Subject: LU-17962 mgc: free nidlist correctly Project: fs/lustre-release Branch: master Current Patch Set: Commit: 6ddf46420826cc66263599ba430c5144eabf766e
            yujian Jian Yu added a comment - Lustre 2.16.0 RC5 client with 2.15.5 server: https://testing.whamcloud.com/test_sets/5bf51531-c7ba-462d-aecb-d01083f98aba
            emoly.liu Emoly Liu added a comment -

            "Emoly Liu <emoly@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56709
            Subject: LU-17962 mgc: free nidlist correctly
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 3
            Commit: 79515b97a31537505b914871b811a4e3cfc1ec1e

            emoly.liu Emoly Liu added a comment - "Emoly Liu <emoly@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56709 Subject: LU-17962 mgc: free nidlist correctly Project: fs/lustre-release Branch: master Current Patch Set: 3 Commit: 79515b97a31537505b914871b811a4e3cfc1ec1e
            emoly.liu Emoly Liu added a comment -

            The leak_finder.pl found the following leak:

            *** Leak: 20 bytes allocated at 00000000718f9558 (mgc_request.c:mgc_apply_recover_logs:1285:(nidlist), debug file line 7005)
            

            I will fix it soon.

            emoly.liu Emoly Liu added a comment - The leak_finder.pl found the following leak: *** Leak: 20 bytes allocated at 00000000718f9558 (mgc_request.c:mgc_apply_recover_logs:1285:(nidlist), debug file line 7005) I will fix it soon.

            "Emoly Liu <emoly@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56709
            Subject: LU-17962 tests: debug conf-sanity.sh test_29 failure
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 1cc5e26a883b4561d639e4bbf3ae6703f802f304

            gerrit Gerrit Updater added a comment - "Emoly Liu <emoly@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56709 Subject: LU-17962 tests: debug conf-sanity.sh test_29 failure Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1cc5e26a883b4561d639e4bbf3ae6703f802f304
            yujian Jian Yu added a comment - - edited

            The memory leak failure occurred consistently in the following 2.16.0 clients with 2.15.5 servers interop test sessions:
            lustre-reviews_el8.10-x86_64_full-dne-part-3
            lustre-reviews_el8.10-x86_64_el9.4-x86_64_full-dne-part-3
            lustre-reviews_el8.10-x86_64_sles15sp6-x86_64_full-dne-part-3
            lustre-reviews_el8.10-x86_64_ubuntu2404-x86_64_full-dne-part-3

            conf-sanity test 29, 46a, 50h, 51, 70e, and 93 failed with this issue.

            yujian Jian Yu added a comment - - edited The memory leak failure occurred consistently in the following 2.16.0 clients with 2.15.5 servers interop test sessions: lustre-reviews_el8.10-x86_64_full-dne-part-3 lustre-reviews_el8.10-x86_64_el9.4-x86_64_full-dne-part-3 lustre-reviews_el8.10-x86_64_sles15sp6-x86_64_full-dne-part-3 lustre-reviews_el8.10-x86_64_ubuntu2404-x86_64_full-dne-part-3 conf-sanity test 29, 46a, 50h, 51, 70e, and 93 failed with this issue.
            yujian Jian Yu added a comment -

            Test session details:
            clients: https://build.whamcloud.com/job/lustre-master/4581 - 4.18.0-553.16.1.el8_10.x86_64
            servers: https://build.whamcloud.com/job/lustre-b2_15/94 - 4.18.0-553.5.1.el8_lustre.x86_64
            https://testing.whamcloud.com/test_sets/31e964e5-1404-4a3c-b868-30ad5dd3fcc6
            conf-sanity test 29, 50h, 51, 70e, and 93 failed with this issue:

            [21909.029507] LustreError: 385894:0:(class_obd.c:895:obdclass_exit()) obd_memory max: 6606559, leaked: 20
            

             

            yujian Jian Yu added a comment - Test session details: clients: https://build.whamcloud.com/job/lustre-master/4581 - 4.18.0-553.16.1.el8_10.x86_64 servers: https://build.whamcloud.com/job/lustre-b2_15/94 - 4.18.0-553.5.1.el8_lustre.x86_64 https://testing.whamcloud.com/test_sets/31e964e5-1404-4a3c-b868-30ad5dd3fcc6 conf-sanity test 29, 50h, 51, 70e, and 93 failed with this issue: [21909.029507] LustreError: 385894:0:(class_obd.c:895:obdclass_exit()) obd_memory max: 6606559, leaked: 20  

            Hi Emoly,
            there are a number of test cases that are failing due to memory leaks:

            • conf-sanity test_32a
            • conf-sanity test_29 (100% failure with 2.16 client + old server interop)
            • mmp interop testing after sanity-flr (full-part-1)

            The test_32a failure is also causing later conf-sanity test_32b... to fail with "Mounting the MDT" errors.

            It is likely that the memory leaks actually happen in an earlier subtest and it is only reported the next time the modules are unloaded. For debugging issues like this it would be best to add a check at the end of run_one_logged() (e.g. CLEANUP_SUBTEST=y or similar) to optionally call cleanupall()/setupall() after every subtest to allow isolating this issue, and then run these test sessions to see which subtest actually causes the memory leak.

            Once the leak is isolated to a single subtest, then it should be possible to run the subtest with debug=+malloc tracing and use leak_finder.pl to identify which allocation is being leaked.

            adilger Andreas Dilger added a comment - Hi Emoly, there are a number of test cases that are failing due to memory leaks: conf-sanity test_32a conf-sanity test_29 (100% failure with 2.16 client + old server interop) mmp interop testing after sanity-flr (full-part-1) The test_32a failure is also causing later conf-sanity test_32b... to fail with " Mounting the MDT " errors. It is likely that the memory leaks actually happen in an earlier subtest and it is only reported the next time the modules are unloaded. For debugging issues like this it would be best to add a check at the end of run_one_logged() (e.g. CLEANUP_SUBTEST=y or similar) to optionally call cleanupall() / setupall() after every subtest to allow isolating this issue, and then run these test sessions to see which subtest actually causes the memory leak. Once the leak is isolated to a single subtest, then it should be possible to run the subtest with debug=+malloc tracing and use leak_finder.pl to identify which allocation is being leaked.

            I've also seen this with "383229505, leaked: 32784", so exactly 1/2 of the amount of leakage.

            adilger Andreas Dilger added a comment - I've also seen this with " 383229505, leaked: 32784 ", so exactly 1/2 of the amount of leakage.

            People

              emoly.liu Emoly Liu
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: