Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4223

conf-sanity test_32c, test_32d: could not find any free loop device

Details

    • 3
    • 11491

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/751ee23a-4106-11e3-a1e8-52540035b04c.

      The sub-test test_32c failed with the following error:

      test_32c failed with 1

      Info required for matching: conf-sanity 32c

      Attachments

        Issue Links

          Activity

            [LU-4223] conf-sanity test_32c, test_32d: could not find any free loop device

            False alarm - it is actually LU-4358 that is being hit, but is misdiagnosed as LU-4223.

            I'm closing this bug since it looks like it is not being hit in recent test runs.

            adilger Andreas Dilger added a comment - False alarm - it is actually LU-4358 that is being hit, but is misdiagnosed as LU-4223 . I'm closing this bug since it looks like it is not being hit in recent test runs.
            adilger Andreas Dilger added a comment - This patch was landed on 2013-12-09 but conf-sanity is still reporting this bug for failures: https://maloo.whamcloud.com/test_sets/7d7a27aa-66ae-11e3-93e2-52540035b04c https://maloo.whamcloud.com/test_sets/202cbd9e-66c5-11e3-93e2-52540035b04c https://maloo.whamcloud.com/test_sets/868c4f72-66fd-11e3-a234-52540035b04c and others.
            di.wang Di Wang added a comment -

            John, yes, this makes sense, I updated the patch, and please have a look.

            di.wang Di Wang added a comment - John, yes, this makes sense, I updated the patch, and please have a look.
            jhammond John Hammond added a comment -

            Hi Di,

            1. There are a few other places in lustre/utils/ where popen() if followed by fclose(), including one in is_e2fsprogs_feature_supp(). We should fix those too.
            1. This may not be enough. On my RHEL 6.4 kernel (2.6.32-358.18.1.el6.lustre.x86_64) if there are IOs still in flight to a loop device then it cannot be detached:
              # dd if=/dev/zero of=/tmp/0 bs=1M count=100
              100+0 records in
              100+0 records out
              104857600 bytes (105 MB) copied, 0.195863 s, 535 MB/s
              # losetup -f /tmp/0
              # losetup -a
              /dev/loop0: [fc01]:1055063 (/tmp/0)
              # dd if=/dev/zero of=/dev/loop0 bs=1M count=100; losetup -d /dev/loop0
              100+0 records in
              100+0 records out
              104857600 bytes (105 MB) copied, 0.15608 s, 672 MB/s
              loop: can't delete device /dev/loop0: Device or resource busy
              # losetup -a
              /dev/loop0: [fc01]:1055063 (/tmp/0)
              # losetup -d /dev/loop0
              #
              # losetup /dev/loop0 /tmp/0; dd if=/dev/zero of=/dev/loop0 oflag=sync bs=1M count=100; losetup -d /dev/loop0
              100+0 records in
              100+0 records out
              104857600 bytes (105 MB) copied, 3.72073 s, 28.2 MB/s
              loop: can't delete device /dev/loop0: Device or resource busy
              

              As you see above, asking for sync IO to the loop device is not enough. So we probably need to add some wait retry logic to loop_cleanup().

            jhammond John Hammond added a comment - Hi Di, There are a few other places in lustre/utils/ where popen() if followed by fclose(), including one in is_e2fsprogs_feature_supp(). We should fix those too. This may not be enough. On my RHEL 6.4 kernel (2.6.32-358.18.1.el6.lustre.x86_64) if there are IOs still in flight to a loop device then it cannot be detached: # dd if=/dev/zero of=/tmp/0 bs=1M count=100 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 0.195863 s, 535 MB/s # losetup -f /tmp/0 # losetup -a /dev/loop0: [fc01]:1055063 (/tmp/0) # dd if=/dev/zero of=/dev/loop0 bs=1M count=100; losetup -d /dev/loop0 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 0.15608 s, 672 MB/s loop: can't delete device /dev/loop0: Device or resource busy # losetup -a /dev/loop0: [fc01]:1055063 (/tmp/0) # losetup -d /dev/loop0 # # losetup /dev/loop0 /tmp/0; dd if=/dev/zero of=/dev/loop0 oflag=sync bs=1M count=100; losetup -d /dev/loop0 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 3.72073 s, 28.2 MB/s loop: can't delete device /dev/loop0: Device or resource busy As you see above, asking for sync IO to the loop device is not enough. So we probably need to add some wait retry logic to loop_cleanup().
            di.wang Di Wang added a comment - http://review.whamcloud.com/8409
            di.wang Di Wang added a comment -

            John: I think you are right, and it should use pclose, instead of fclose. Good catch!

            di.wang Di Wang added a comment - John: I think you are right, and it should use pclose, instead of fclose. Good catch!
            jhammond John Hammond added a comment -

            In mount_utils_ldiskfs.c:is_feature_enabled(), popen() is used to invoke debugfs but fclose() is used to close the FILE * returned from popen(). Hence wait() is not called, debugfs may still be running (and holding the loop device open). This prevents losetup -d from detaching it.

            jhammond John Hammond added a comment - In mount_utils_ldiskfs.c:is_feature_enabled(), popen() is used to invoke debugfs but fclose() is used to close the FILE * returned from popen(). Hence wait() is not called, debugfs may still be running (and holding the loop device open). This prevents losetup -d from detaching it.
            yujian Jian Yu added a comment -

            The patch was cherry-picked to Lustre b2_4 branch.

            yujian Jian Yu added a comment - The patch was cherry-picked to Lustre b2_4 branch.

            This patch should be landed to b2_4 and b2_5.

            adilger Andreas Dilger added a comment - This patch should be landed to b2_4 and b2_5.

            Di, please use More Actions->Link to link duplicate bugs. Just closing a bug as a duplicate does not link it to the original bug (unlike with Bugzilla).

            adilger Andreas Dilger added a comment - Di, please use More Actions->Link to link duplicate bugs. Just closing a bug as a duplicate does not link it to the original bug (unlike with Bugzilla).

            This appears to be the most common failure preventing review-dne from passing all of its tests (7 of 11 failures in the past three days if one adds LU-4261 failures). Fixing this would probably allow review-dne to pass consistently enough to be enforced.

            adilger Andreas Dilger added a comment - This appears to be the most common failure preventing review-dne from passing all of its tests (7 of 11 failures in the past three days if one adds LU-4261 failures). Fixing this would probably allow review-dne to pass consistently enough to be enforced.

            People

              di.wang Di Wang
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: