Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4638

racer test hung: /mnt/lustre is still busy, wait one second

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.5.1
    • None

    • Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/26/
      Distro/Arch: RHEL6.5/x86_64 (kernel version: 2.6.32-431.3.1.el6)
    • 3
    • 12688

    Description

      While running racer test, it hung as follows:

      Stopping client client-24vm1.lab.whamcloud.com /mnt/lustre opts:-f
      Stopping client client-24vm2.lab.whamcloud.com /mnt/lustre opts:-f
      COMMAND   PID USER   FD   TYPE      DEVICE  SIZE/OFF               NODE NAME
      cat     12523 root    1w   REG 1273,181606   3146752 144115205306064232 /mnt/lustre/racer/5
      cat     12523 root    3r   REG 1273,181606    482309 144115205306064282 /mnt/lustre/racer/6/6/6
      dd      16254 root    1w   REG 1273,181606 209190912 144115205255734078 /mnt/lustre/racer/15
      dd      19402 root    1w   REG 1273,181606 240656384 144115205289279515 /mnt/lustre2/racer/3
      dd      19523 root    1w   REG 1273,181606 151068672 144115205255734554 /mnt/lustre/racer/12 (deleted)
      dd      19757 root    1w   REG 1273,181606 142558208 144115205306065539 /mnt/lustre/racer/4
      dd      31064 root    1w   REG 1273,181606 188440576 144115205272508837 /mnt/lustre2/racer/7/14
      COMMAND   PID USER   FD   TYPE      DEVICE  SIZE/OFF               NODE NAME
      dd       3063 root    1w   REG 1273,181606   3146752 144115205306064232 /mnt/lustre/racer/14
      cat      3485 root    1w   REG 1273,181606    482309 144115205306064282 /mnt/lustre/racer/6/6/6 (deleted)
      cat      3485 root    3r   REG 1273,181606   3149824 144115205306064232 /mnt/lustre/racer/5
      dd      14671 root    1w   REG 1273,181606 240656384 144115205289279515 /mnt/lustre/racer/3
      dd      16043 root    1w   REG 1273,181606  93643776 144115205289289486 /mnt/lustre2/racer/2
      dd      17535 root    1w   REG 1273,181606 105392128 144115205272513127 /mnt/lustre/racer/11
      dd      31434 root    1w   REG 1273,181606 263391232 144115205272504946 /mnt/lustre2/racer/14
      /mnt/lustre is still busy, wait one second
      /mnt/lustre is still busy, wait one second
      /mnt/lustre is still busy, wait one second
      /mnt/lustre is still busy, wait one second
      

      Maloo report: https://maloo.whamcloud.com/test_sets/18534f84-977b-11e3-b941-52540035b04c

      Attachments

        Issue Links

          Activity

            [LU-4638] racer test hung: /mnt/lustre is still busy, wait one second
            pjones Peter Jones added a comment -

            This is fixed by the latest RHEL6.5 update - LU-5025

            pjones Peter Jones added a comment - This is fixed by the latest RHEL6.5 update - LU-5025

            So far it looks like running racer on an unpatched new version el6 kernel does in fact work. Not seeing the racer hang reported in this bug. Will run it a few more times to accumulate more evidence, but I think the bug fix in the kernel really does fix the problem as we expected it to do.

            bogl Bob Glossman (Inactive) added a comment - So far it looks like running racer on an unpatched new version el6 kernel does in fact work. Not seeing the racer hang reported in this bug. Will run it a few more times to accumulate more evidence, but I think the bug fix in the kernel really does fix the problem as we expected it to do.

            The brand new kernel update in LU-5025 includes the upstream kernel fix we've been waiting for. It should fix this problem once and for all in both clients and servers. No kernel patching required.

            bogl Bob Glossman (Inactive) added a comment - The brand new kernel update in LU-5025 includes the upstream kernel fix we've been waiting for. It should fix this problem once and for all in both clients and servers. No kernel patching required.

            As mentioned in previous comments this is entirely due to a linux bug. It will continue to happen when using lustre client builds until the known fix is in an upstream linux release and we start building against it.

            bogl Bob Glossman (Inactive) added a comment - As mentioned in previous comments this is entirely due to a linux bug. It will continue to happen when using lustre client builds until the known fix is in an upstream linux release and we start building against it.
            yujian Jian Yu added a comment -

            Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/39/ (2.5.1 RC1)
            Distro/Arch: RHEL6.5/x86_64 (kernel version: 2.6.32-431.5.1.el6)

            The same failure occurred:
            https://maloo.whamcloud.com/test_sets/28e2acee-a4a2-11e3-8fba-52540035b04c

            yujian Jian Yu added a comment - Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/39/ (2.5.1 RC1) Distro/Arch: RHEL6.5/x86_64 (kernel version: 2.6.32-431.5.1.el6) The same failure occurred: https://maloo.whamcloud.com/test_sets/28e2acee-a4a2-11e3-8fba-52540035b04c

            Pretty sure this is the known bug in the upstream kernel discussed in comments in LU-4287.
            Details discussed in http://comments.gmane.org/gmane.linux.kernel/1638580.

            We have fixed the problem for server builds with an extra kernel patch. The problem still exists and will continue to exist for client builds until the fix is present and available in the upstream kernel we build clients against.

            A workaround is to use server builds even on client nodes.

            bogl Bob Glossman (Inactive) added a comment - Pretty sure this is the known bug in the upstream kernel discussed in comments in LU-4287 . Details discussed in http://comments.gmane.org/gmane.linux.kernel/1638580 . We have fixed the problem for server builds with an extra kernel patch. The problem still exists and will continue to exist for client builds until the fix is present and available in the upstream kernel we build clients against. A workaround is to use server builds even on client nodes.
            yujian Jian Yu added a comment -

            Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/26/
            Distro/Arch: RHEL6.5/x86_64 (kernel version: 2.6.32-431.3.1.el6)
            MDSCOUNT=2

            The same failure occurred:
            https://maloo.whamcloud.com/test_sets/0ed1c2a0-97b8-11e3-acb5-52540035b04c

            So, the failure occurred consistently on Lustre b2_5 build #26.

            yujian Jian Yu added a comment - Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/26/ Distro/Arch: RHEL6.5/x86_64 (kernel version: 2.6.32-431.3.1.el6) MDSCOUNT=2 The same failure occurred: https://maloo.whamcloud.com/test_sets/0ed1c2a0-97b8-11e3-acb5-52540035b04c So, the failure occurred consistently on Lustre b2_5 build #26.
            yujian Jian Yu added a comment -

            This is a regression introduced by Lustre b2_5 build #25 (not tested) or #26, because racer test kept passing in previous builds.

            Build #26 Changes:

            LU-4154 lfsck: skip old lfsck test in DNE mode 
            LU-4287 kernel: kernel update RHEL6.5 [2.6.32-431.3.1.el6] 
            LU-4429 llite: fix open lock matching in ll_md_blocking_ast() 
            LU-4208 osd-zfs: hold pool config lock to register property 
            

            Build #25 Changes:

            LU-4253 osc: Don't flush active extents. 
            LU-4336 quota: improper assert in osc_quota_chkdq() 
            LU-3834 mdt: handle swap_layouts failures during restore 
            LU-4152 mdt: Don't enqueue two locks on the same resource 
            LU-3601 Do not create layout in lease-open 
            LU-4454 libcfs: warn if all HTs in a core are gone 
            LU-3618 ptlrpc: rq_commit_cb is called for twice 
            LU-2818 mdt: Properly handle ENOMEM 
            LU-3857 osd: cleanup procfs after osd_shutdown 
            LU-3772 ptlrpc: fix nrs cleanup 
            LU-4260 lod: free striping if striping initialization fails 
            LU-4430 mdt: check for MDS_FMODE_EXEC in mdt_mfd_open() 
            LU-3528 mdt: check object exists for remote directory 
            LU-946 lprocfs: List open files in filesystem 
            
            yujian Jian Yu added a comment - This is a regression introduced by Lustre b2_5 build #25 (not tested) or #26, because racer test kept passing in previous builds. Build #26 Changes: LU-4154 lfsck: skip old lfsck test in DNE mode LU-4287 kernel: kernel update RHEL6.5 [2.6.32-431.3.1.el6] LU-4429 llite: fix open lock matching in ll_md_blocking_ast() LU-4208 osd-zfs: hold pool config lock to register property Build #25 Changes: LU-4253 osc: Don't flush active extents. LU-4336 quota: improper assert in osc_quota_chkdq() LU-3834 mdt: handle swap_layouts failures during restore LU-4152 mdt: Don't enqueue two locks on the same resource LU-3601 Do not create layout in lease-open LU-4454 libcfs: warn if all HTs in a core are gone LU-3618 ptlrpc: rq_commit_cb is called for twice LU-2818 mdt: Properly handle ENOMEM LU-3857 osd: cleanup procfs after osd_shutdown LU-3772 ptlrpc: fix nrs cleanup LU-4260 lod: free striping if striping initialization fails LU-4430 mdt: check for MDS_FMODE_EXEC in mdt_mfd_open() LU-3528 mdt: check object exists for remote directory LU-946 lprocfs: List open files in filesystem

            People

              bogl Bob Glossman (Inactive)
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: