[LU-4223] conf-sanity test_32c, test_32d: could not find any free loop device Created: 07/Nov/13  Updated: 16/Mar/16  Resolved: 18/Dec/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.6.0, Lustre 2.4.2, Lustre 2.5.1, Lustre 2.8.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: dne, mn4, sdsc

Issue Links:
Duplicate
is duplicated by LU-4261 Test failure on test suite sanity-quo... Closed
Severity: 3
Rank (Obsolete): 11491

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/751ee23a-4106-11e3-a1e8-52540035b04c.

The sub-test test_32c failed with the following error:

test_32c failed with 1

Info required for matching: conf-sanity 32c



 Comments   
Comment by Sarah Liu [ 07/Nov/13 ]

test log:

CMD: wtm-32vm3 tunefs.lustre --quota /tmp/t32/ost
wtm-32vm3: tunefs.lustre: out of loop devices!
wtm-32vm3: 
wtm-32vm3: tunefs.lustre FATAL: Loop device setup for /tmp/t32/ost failed: Too many open files
wtm-32vm3: tunefs.lustre: exiting with 24 (Too many open files)
checking for existing Lustre data: found
Comment by Andreas Dilger [ 13/Nov/13 ]

Oleg reports that he runs out of loop devices when running conf-sanity repeatedly on the same node. It seems there is a leak in the configuration of the loop devices, either by mount.lustre/unmount, or something else in conf-sanity. It is likely hitting DNE testing more than regular testing just due to DNE configurations using more MDT and OST devices.

Comment by Andreas Dilger [ 14/Nov/13 ]

In the test output:

Upgrading from disk2_3-ldiskfs.tar.bz2, created with:
  Commit: 2.3.0
  Kernel: 2.6.32-279.5.1.el6_lustre.gb16fe80.x86_64
    Arch: x86_64
CMD: wtm-32vm3 tunefs.lustre --dryrun /tmp/t32/mdt
   :
   :
   loop: can't delete device /dev/loop3: Device or resource busy

So it looks like the device is being referenced. I've noticed in recent DNE testing that the MDT device did not unmount cleanly, so this should be tested as the possible root cause of the problem.

Comment by Di Wang [ 15/Nov/13 ]

Hmm, it seems a few tests did not do "umount -d" when umount. http://review.whamcloud.com/8296

Comment by Andreas Dilger [ 15/Nov/13 ]

This appears to be the most common failure preventing review-dne from passing all of its tests (7 of 11 failures in the past three days if one adds LU-4261 failures). Fixing this would probably allow review-dne to pass consistently enough to be enforced.

Comment by Andreas Dilger [ 15/Nov/13 ]

Di, please use More Actions->Link to link duplicate bugs. Just closing a bug as a duplicate does not link it to the original bug (unlike with Bugzilla).

Comment by Andreas Dilger [ 19/Nov/13 ]

This patch should be landed to b2_4 and b2_5.

Comment by Jian Yu [ 22/Nov/13 ]

The patch was cherry-picked to Lustre b2_4 branch.

Comment by John Hammond [ 26/Nov/13 ]

In mount_utils_ldiskfs.c:is_feature_enabled(), popen() is used to invoke debugfs but fclose() is used to close the FILE * returned from popen(). Hence wait() is not called, debugfs may still be running (and holding the loop device open). This prevents losetup -d from detaching it.

Comment by Di Wang [ 27/Nov/13 ]

John: I think you are right, and it should use pclose, instead of fclose. Good catch!

Comment by Di Wang [ 27/Nov/13 ]

http://review.whamcloud.com/8409

Comment by John Hammond [ 27/Nov/13 ]

Hi Di,

  1. There are a few other places in lustre/utils/ where popen() if followed by fclose(), including one in is_e2fsprogs_feature_supp(). We should fix those too.
  1. This may not be enough. On my RHEL 6.4 kernel (2.6.32-358.18.1.el6.lustre.x86_64) if there are IOs still in flight to a loop device then it cannot be detached:
    # dd if=/dev/zero of=/tmp/0 bs=1M count=100
    100+0 records in
    100+0 records out
    104857600 bytes (105 MB) copied, 0.195863 s, 535 MB/s
    # losetup -f /tmp/0
    # losetup -a
    /dev/loop0: [fc01]:1055063 (/tmp/0)
    # dd if=/dev/zero of=/dev/loop0 bs=1M count=100; losetup -d /dev/loop0
    100+0 records in
    100+0 records out
    104857600 bytes (105 MB) copied, 0.15608 s, 672 MB/s
    loop: can't delete device /dev/loop0: Device or resource busy
    # losetup -a
    /dev/loop0: [fc01]:1055063 (/tmp/0)
    # losetup -d /dev/loop0
    #
    # losetup /dev/loop0 /tmp/0; dd if=/dev/zero of=/dev/loop0 oflag=sync bs=1M count=100; losetup -d /dev/loop0
    100+0 records in
    100+0 records out
    104857600 bytes (105 MB) copied, 3.72073 s, 28.2 MB/s
    loop: can't delete device /dev/loop0: Device or resource busy
    

    As you see above, asking for sync IO to the loop device is not enough. So we probably need to add some wait retry logic to loop_cleanup().

Comment by Di Wang [ 27/Nov/13 ]

John, yes, this makes sense, I updated the patch, and please have a look.

Comment by Andreas Dilger [ 18/Dec/13 ]

This patch was landed on 2013-12-09 but conf-sanity is still reporting this bug for failures:

https://maloo.whamcloud.com/test_sets/7d7a27aa-66ae-11e3-93e2-52540035b04c
https://maloo.whamcloud.com/test_sets/202cbd9e-66c5-11e3-93e2-52540035b04c
https://maloo.whamcloud.com/test_sets/868c4f72-66fd-11e3-a234-52540035b04c

and others.

Comment by Andreas Dilger [ 18/Dec/13 ]

False alarm - it is actually LU-4358 that is being hit, but is misdiagnosed as LU-4223.

I'm closing this bug since it looks like it is not being hit in recent test runs.

Comment by Jian Yu [ 04/Jan/14 ]

http://review.whamcloud.com/8409

The above patch was not cherry-picked to Lustre b2_5 branch.

The same failure occurred on Lustre b2_5 build #5:
https://maloo.whamcloud.com/test_sets/16cd0f30-7497-11e3-8b21-52540035b04c

Here is the back-ported patch on Lustre b2_5 branch: http://review.whamcloud.com/8723

Comment by Jian Yu [ 11/Jan/14 ]

Landed for Lustre 2.5.1.

Comment by Gerrit Updater [ 07/Jan/15 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/13265
Subject: LU-4223 tests: fix conf-sanity test_32 typo
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 05a59163d8427f0c77d3aa6b31486b2219fb49d2

Comment by Gerrit Updater [ 03/Mar/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13265/
Subject: LU-4223 tests: fix conf-sanity test_32 typo
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5da53bcb1d8c38157325505f6619d6b6c3d4db6a

Generated at Sat Feb 10 01:40:47 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.