[LU-4638] racer test hung: /mnt/lustre is still busy, wait one second Created: 17/Feb/14 Updated: 22/Nov/18 Resolved: 16/May/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Jian Yu | Assignee: | Bob Glossman (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/26/ |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 12688 | ||||||||||||
| Description |
|
While running racer test, it hung as follows: Stopping client client-24vm1.lab.whamcloud.com /mnt/lustre opts:-f Stopping client client-24vm2.lab.whamcloud.com /mnt/lustre opts:-f COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME cat 12523 root 1w REG 1273,181606 3146752 144115205306064232 /mnt/lustre/racer/5 cat 12523 root 3r REG 1273,181606 482309 144115205306064282 /mnt/lustre/racer/6/6/6 dd 16254 root 1w REG 1273,181606 209190912 144115205255734078 /mnt/lustre/racer/15 dd 19402 root 1w REG 1273,181606 240656384 144115205289279515 /mnt/lustre2/racer/3 dd 19523 root 1w REG 1273,181606 151068672 144115205255734554 /mnt/lustre/racer/12 (deleted) dd 19757 root 1w REG 1273,181606 142558208 144115205306065539 /mnt/lustre/racer/4 dd 31064 root 1w REG 1273,181606 188440576 144115205272508837 /mnt/lustre2/racer/7/14 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME dd 3063 root 1w REG 1273,181606 3146752 144115205306064232 /mnt/lustre/racer/14 cat 3485 root 1w REG 1273,181606 482309 144115205306064282 /mnt/lustre/racer/6/6/6 (deleted) cat 3485 root 3r REG 1273,181606 3149824 144115205306064232 /mnt/lustre/racer/5 dd 14671 root 1w REG 1273,181606 240656384 144115205289279515 /mnt/lustre/racer/3 dd 16043 root 1w REG 1273,181606 93643776 144115205289289486 /mnt/lustre2/racer/2 dd 17535 root 1w REG 1273,181606 105392128 144115205272513127 /mnt/lustre/racer/11 dd 31434 root 1w REG 1273,181606 263391232 144115205272504946 /mnt/lustre2/racer/14 /mnt/lustre is still busy, wait one second /mnt/lustre is still busy, wait one second /mnt/lustre is still busy, wait one second /mnt/lustre is still busy, wait one second Maloo report: https://maloo.whamcloud.com/test_sets/18534f84-977b-11e3-b941-52540035b04c |
| Comments |
| Comment by Jian Yu [ 17/Feb/14 ] |
|
This is a regression introduced by Lustre b2_5 build #25 (not tested) or #26, because racer test kept passing in previous builds. Build #26 Changes: LU-4154 lfsck: skip old lfsck test in DNE mode LU-4287 kernel: kernel update RHEL6.5 [2.6.32-431.3.1.el6] LU-4429 llite: fix open lock matching in ll_md_blocking_ast() LU-4208 osd-zfs: hold pool config lock to register property Build #25 Changes: LU-4253 osc: Don't flush active extents. LU-4336 quota: improper assert in osc_quota_chkdq() LU-3834 mdt: handle swap_layouts failures during restore LU-4152 mdt: Don't enqueue two locks on the same resource LU-3601 Do not create layout in lease-open LU-4454 libcfs: warn if all HTs in a core are gone LU-3618 ptlrpc: rq_commit_cb is called for twice LU-2818 mdt: Properly handle ENOMEM LU-3857 osd: cleanup procfs after osd_shutdown LU-3772 ptlrpc: fix nrs cleanup LU-4260 lod: free striping if striping initialization fails LU-4430 mdt: check for MDS_FMODE_EXEC in mdt_mfd_open() LU-3528 mdt: check object exists for remote directory LU-946 lprocfs: List open files in filesystem |
| Comment by Jian Yu [ 17/Feb/14 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/26/ The same failure occurred: So, the failure occurred consistently on Lustre b2_5 build #26. |
| Comment by Bob Glossman (Inactive) [ 17/Feb/14 ] |
|
Pretty sure this is the known bug in the upstream kernel discussed in comments in We have fixed the problem for server builds with an extra kernel patch. The problem still exists and will continue to exist for client builds until the fix is present and available in the upstream kernel we build clients against. A workaround is to use server builds even on client nodes. |
| Comment by Jian Yu [ 06/Mar/14 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/39/ (2.5.1 RC1) The same failure occurred: |
| Comment by Bob Glossman (Inactive) [ 06/Mar/14 ] |
|
As mentioned in previous comments this is entirely due to a linux bug. It will continue to happen when using lustre client builds until the known fix is in an upstream linux release and we start building against it. |
| Comment by Bob Glossman (Inactive) [ 07/May/14 ] |
|
The brand new kernel update in |
| Comment by Bob Glossman (Inactive) [ 09/May/14 ] |
|
So far it looks like running racer on an unpatched new version el6 kernel does in fact work. Not seeing the racer hang reported in this bug. Will run it a few more times to accumulate more evidence, but I think the bug fix in the kernel really does fix the problem as we expected it to do. |
| Comment by Peter Jones [ 16/May/14 ] |
|
This is fixed by the latest RHEL6.5 update - |