[LU-4638] racer test hung: /mnt/lustre is still busy, wait one second Created: 17/Feb/14  Updated: 22/Nov/18  Resolved: 16/May/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.1
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Jian Yu Assignee: Bob Glossman (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/26/
Distro/Arch: RHEL6.5/x86_64 (kernel version: 2.6.32-431.3.1.el6)


Issue Links:
Duplicate
Related
is related to LU-5025 Kernel update [RHEL6.5 2.6.32-431.17.... Resolved
Severity: 3
Rank (Obsolete): 12688

 Description   

While running racer test, it hung as follows:

Stopping client client-24vm1.lab.whamcloud.com /mnt/lustre opts:-f
Stopping client client-24vm2.lab.whamcloud.com /mnt/lustre opts:-f
COMMAND   PID USER   FD   TYPE      DEVICE  SIZE/OFF               NODE NAME
cat     12523 root    1w   REG 1273,181606   3146752 144115205306064232 /mnt/lustre/racer/5
cat     12523 root    3r   REG 1273,181606    482309 144115205306064282 /mnt/lustre/racer/6/6/6
dd      16254 root    1w   REG 1273,181606 209190912 144115205255734078 /mnt/lustre/racer/15
dd      19402 root    1w   REG 1273,181606 240656384 144115205289279515 /mnt/lustre2/racer/3
dd      19523 root    1w   REG 1273,181606 151068672 144115205255734554 /mnt/lustre/racer/12 (deleted)
dd      19757 root    1w   REG 1273,181606 142558208 144115205306065539 /mnt/lustre/racer/4
dd      31064 root    1w   REG 1273,181606 188440576 144115205272508837 /mnt/lustre2/racer/7/14
COMMAND   PID USER   FD   TYPE      DEVICE  SIZE/OFF               NODE NAME
dd       3063 root    1w   REG 1273,181606   3146752 144115205306064232 /mnt/lustre/racer/14
cat      3485 root    1w   REG 1273,181606    482309 144115205306064282 /mnt/lustre/racer/6/6/6 (deleted)
cat      3485 root    3r   REG 1273,181606   3149824 144115205306064232 /mnt/lustre/racer/5
dd      14671 root    1w   REG 1273,181606 240656384 144115205289279515 /mnt/lustre/racer/3
dd      16043 root    1w   REG 1273,181606  93643776 144115205289289486 /mnt/lustre2/racer/2
dd      17535 root    1w   REG 1273,181606 105392128 144115205272513127 /mnt/lustre/racer/11
dd      31434 root    1w   REG 1273,181606 263391232 144115205272504946 /mnt/lustre2/racer/14
/mnt/lustre is still busy, wait one second
/mnt/lustre is still busy, wait one second
/mnt/lustre is still busy, wait one second
/mnt/lustre is still busy, wait one second

Maloo report: https://maloo.whamcloud.com/test_sets/18534f84-977b-11e3-b941-52540035b04c



 Comments   
Comment by Jian Yu [ 17/Feb/14 ]

This is a regression introduced by Lustre b2_5 build #25 (not tested) or #26, because racer test kept passing in previous builds.

Build #26 Changes:

LU-4154 lfsck: skip old lfsck test in DNE mode 
LU-4287 kernel: kernel update RHEL6.5 [2.6.32-431.3.1.el6] 
LU-4429 llite: fix open lock matching in ll_md_blocking_ast() 
LU-4208 osd-zfs: hold pool config lock to register property 

Build #25 Changes:

LU-4253 osc: Don't flush active extents. 
LU-4336 quota: improper assert in osc_quota_chkdq() 
LU-3834 mdt: handle swap_layouts failures during restore 
LU-4152 mdt: Don't enqueue two locks on the same resource 
LU-3601 Do not create layout in lease-open 
LU-4454 libcfs: warn if all HTs in a core are gone 
LU-3618 ptlrpc: rq_commit_cb is called for twice 
LU-2818 mdt: Properly handle ENOMEM 
LU-3857 osd: cleanup procfs after osd_shutdown 
LU-3772 ptlrpc: fix nrs cleanup 
LU-4260 lod: free striping if striping initialization fails 
LU-4430 mdt: check for MDS_FMODE_EXEC in mdt_mfd_open() 
LU-3528 mdt: check object exists for remote directory 
LU-946 lprocfs: List open files in filesystem 
Comment by Jian Yu [ 17/Feb/14 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/26/
Distro/Arch: RHEL6.5/x86_64 (kernel version: 2.6.32-431.3.1.el6)
MDSCOUNT=2

The same failure occurred:
https://maloo.whamcloud.com/test_sets/0ed1c2a0-97b8-11e3-acb5-52540035b04c

So, the failure occurred consistently on Lustre b2_5 build #26.

Comment by Bob Glossman (Inactive) [ 17/Feb/14 ]

Pretty sure this is the known bug in the upstream kernel discussed in comments in LU-4287.
Details discussed in http://comments.gmane.org/gmane.linux.kernel/1638580.

We have fixed the problem for server builds with an extra kernel patch. The problem still exists and will continue to exist for client builds until the fix is present and available in the upstream kernel we build clients against.

A workaround is to use server builds even on client nodes.

Comment by Jian Yu [ 06/Mar/14 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/39/ (2.5.1 RC1)
Distro/Arch: RHEL6.5/x86_64 (kernel version: 2.6.32-431.5.1.el6)

The same failure occurred:
https://maloo.whamcloud.com/test_sets/28e2acee-a4a2-11e3-8fba-52540035b04c

Comment by Bob Glossman (Inactive) [ 06/Mar/14 ]

As mentioned in previous comments this is entirely due to a linux bug. It will continue to happen when using lustre client builds until the known fix is in an upstream linux release and we start building against it.

Comment by Bob Glossman (Inactive) [ 07/May/14 ]

The brand new kernel update in LU-5025 includes the upstream kernel fix we've been waiting for. It should fix this problem once and for all in both clients and servers. No kernel patching required.

Comment by Bob Glossman (Inactive) [ 09/May/14 ]

So far it looks like running racer on an unpatched new version el6 kernel does in fact work. Not seeing the racer hang reported in this bug. Will run it a few more times to accumulate more evidence, but I think the bug fix in the kernel really does fix the problem as we expected it to do.

Comment by Peter Jones [ 16/May/14 ]

This is fixed by the latest RHEL6.5 update - LU-5025

Generated at Sat Feb 10 01:44:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.