Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17962

conf-sanity test_32a: failed with replace_nids operation already in progress

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.17.0
    • Lustre 2.16.0, Lustre 2.15.6
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/4c9dfb66-c4ee-4ceb-9fd7-436f4fc46eb8

      test_32a failed with the following error:

      CMD: trevis-128vm8 mount -t lustre -o nosvc t32fs-mdt1/mdt1 /tmp/t32/mnt/mdt
      CMD: trevis-128vm8 /usr/sbin/lctl replace_nids t32fs-OST0000 10.240.45.26@tcp
      trevis-128vm8: error: replace_nids: Operation now in progress
      pdsh@trevis-128vm1: trevis-128vm8: ssh exited with exit code 115
      CMD: trevis-128vm8 /usr/sbin/lctl dl
        0 UP osd-zfs t32fs-MDT0000-osd t32fs-MDT0000-osd_UUID 5
        1 UP mgs MGS MGS 7
        2 UP mgc MGC10.240.45.31@tcp 524e7669-b108-4f05-8270-6dd5e88a654d 5
       conf-sanity test_32a: @@@@@@ FAIL: replace_nids t32fs-OST0000 10.240.45.26@tcp failed 
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-reviews/100649 - 4.18.0-477.27.1.el8_8.x86_64
      servers: https://build.whamcloud.com/job/lustre-reviews/100649 - 4.18.0-477.27.1.el8_lustre.x86_64

      <<Please provide additional information about the failure here>>

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      conf-sanity test_32a - replace_nids t32fs-OST0000 10.240.45.26@tcp failed

      Attachments

        Issue Links

          Activity

            [LU-17962] conf-sanity test_32a: failed with replace_nids operation already in progress
            yujian Jian Yu added a comment - - edited

            The memory leak failure occurred consistently in the following 2.16.0 clients with 2.15.5 servers interop test sessions:
            lustre-reviews_el8.10-x86_64_full-dne-part-3
            lustre-reviews_el8.10-x86_64_el9.4-x86_64_full-dne-part-3
            lustre-reviews_el8.10-x86_64_sles15sp6-x86_64_full-dne-part-3
            lustre-reviews_el8.10-x86_64_ubuntu2404-x86_64_full-dne-part-3

            conf-sanity test 29, 46a, 50h, 51, 70e, and 93 failed with this issue.

            yujian Jian Yu added a comment - - edited The memory leak failure occurred consistently in the following 2.16.0 clients with 2.15.5 servers interop test sessions: lustre-reviews_el8.10-x86_64_full-dne-part-3 lustre-reviews_el8.10-x86_64_el9.4-x86_64_full-dne-part-3 lustre-reviews_el8.10-x86_64_sles15sp6-x86_64_full-dne-part-3 lustre-reviews_el8.10-x86_64_ubuntu2404-x86_64_full-dne-part-3 conf-sanity test 29, 46a, 50h, 51, 70e, and 93 failed with this issue.
            yujian Jian Yu added a comment -

            Test session details:
            clients: https://build.whamcloud.com/job/lustre-master/4581 - 4.18.0-553.16.1.el8_10.x86_64
            servers: https://build.whamcloud.com/job/lustre-b2_15/94 - 4.18.0-553.5.1.el8_lustre.x86_64
            https://testing.whamcloud.com/test_sets/31e964e5-1404-4a3c-b868-30ad5dd3fcc6
            conf-sanity test 29, 50h, 51, 70e, and 93 failed with this issue:

            [21909.029507] LustreError: 385894:0:(class_obd.c:895:obdclass_exit()) obd_memory max: 6606559, leaked: 20
            

             

            yujian Jian Yu added a comment - Test session details: clients: https://build.whamcloud.com/job/lustre-master/4581 - 4.18.0-553.16.1.el8_10.x86_64 servers: https://build.whamcloud.com/job/lustre-b2_15/94 - 4.18.0-553.5.1.el8_lustre.x86_64 https://testing.whamcloud.com/test_sets/31e964e5-1404-4a3c-b868-30ad5dd3fcc6 conf-sanity test 29, 50h, 51, 70e, and 93 failed with this issue: [21909.029507] LustreError: 385894:0:(class_obd.c:895:obdclass_exit()) obd_memory max: 6606559, leaked: 20  

            Hi Emoly,
            there are a number of test cases that are failing due to memory leaks:

            • conf-sanity test_32a
            • conf-sanity test_29 (100% failure with 2.16 client + old server interop)
            • mmp interop testing after sanity-flr (full-part-1)

            The test_32a failure is also causing later conf-sanity test_32b... to fail with "Mounting the MDT" errors.

            It is likely that the memory leaks actually happen in an earlier subtest and it is only reported the next time the modules are unloaded. For debugging issues like this it would be best to add a check at the end of run_one_logged() (e.g. CLEANUP_SUBTEST=y or similar) to optionally call cleanupall()/setupall() after every subtest to allow isolating this issue, and then run these test sessions to see which subtest actually causes the memory leak.

            Once the leak is isolated to a single subtest, then it should be possible to run the subtest with debug=+malloc tracing and use leak_finder.pl to identify which allocation is being leaked.

            adilger Andreas Dilger added a comment - Hi Emoly, there are a number of test cases that are failing due to memory leaks: conf-sanity test_32a conf-sanity test_29 (100% failure with 2.16 client + old server interop) mmp interop testing after sanity-flr (full-part-1) The test_32a failure is also causing later conf-sanity test_32b... to fail with " Mounting the MDT " errors. It is likely that the memory leaks actually happen in an earlier subtest and it is only reported the next time the modules are unloaded. For debugging issues like this it would be best to add a check at the end of run_one_logged() (e.g. CLEANUP_SUBTEST=y or similar) to optionally call cleanupall() / setupall() after every subtest to allow isolating this issue, and then run these test sessions to see which subtest actually causes the memory leak. Once the leak is isolated to a single subtest, then it should be possible to run the subtest with debug=+malloc tracing and use leak_finder.pl to identify which allocation is being leaked.

            I've also seen this with "383229505, leaked: 32784", so exactly 1/2 of the amount of leakage.

            adilger Andreas Dilger added a comment - I've also seen this with " 383229505, leaked: 32784 ", so exactly 1/2 of the amount of leakage.

            I looked at the most recent failure of conf-sanity test_32a and it showed:
            https://testing.whamcloud.com/test_sets/766e6cac-fc67-40be-966a-06ab22763f00

             [15651.917942] LustreError: 22128:0:(class_obd.c:841:obdclass_exit()) obd_memory max: 381116069, leaked: 65568
            

            and then the next image failed with:

            trevis-24vm6: rm: cannot remove ‘/tmp/t32/mnt/mdt’: Device or resource busy
            

            so it looks like this failure has changed since it was first reported. There are some timeouts of 32a as well, but those are in interop with b2_14.

            It probably makes sense to run conf-sanity.sh with unmounts between subtests to see where the memory is being leaked.

            adilger Andreas Dilger added a comment - I looked at the most recent failure of conf-sanity test_32a and it showed: https://testing.whamcloud.com/test_sets/766e6cac-fc67-40be-966a-06ab22763f00 [15651.917942] LustreError: 22128:0:(class_obd.c:841:obdclass_exit()) obd_memory max: 381116069, leaked: 65568 and then the next image failed with: trevis-24vm6: rm: cannot remove ‘/tmp/t32/mnt/mdt’: Device or resource busy so it looks like this failure has changed since it was first reported. There are some timeouts of 32a as well, but those are in interop with b2_14. It probably makes sense to run conf-sanity.sh with unmounts between subtests to see where the memory is being leaked.

            This is also causing test_32b and test_32c to fail with "Mounting the MDT" or "Mounting the OST1" or similar.

            adilger Andreas Dilger added a comment - This is also causing test_32b and test_32c to fail with "Mounting the MDT" or "Mounting the OST1" or similar.

            People

              emoly.liu Emoly Liu
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: