[LU-3896] racer test hung: /mnt/lustre2 is still busy, wait one second Created: 06/Sep/13  Updated: 10/Oct/21  Resolved: 10/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Jian Yu Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: dne, yuc2, zfs
Environment:

Lustre build: http://build.whamcloud.com/job/lustre-b2_4/44/ (2.4.1 RC1)
Distro/Arch: RHEL6.4/x86_64
FSTYPE=zfs
MDSCOUNT=4


Severity: 3
Rank (Obsolete): 10186

 Description   

racer test hung as follows:

== racer test complete, duration 1006 sec == 00:42:55 (1378194175)
CMD: wtm-29vm3 /usr/sbin/lctl get_param -n version
CMD: wtm-29vm3 /usr/sbin/lctl conf_param lustre.quota.mdt=ug
CMD: wtm-29vm3 /usr/sbin/lctl conf_param lustre.quota.ost=ug
Stopping clients: wtm-29vm5,wtm-29vm6.rosso.whamcloud.com /mnt/lustre2 (opts:)
CMD: wtm-29vm5,wtm-29vm6.rosso.whamcloud.com running=\$(grep -c /mnt/lustre2' ' /proc/mounts);
if [ \$running -ne 0 ] ; then
echo Stopping client \$(hostname) /mnt/lustre2 opts:;
lsof /mnt/lustre2 || need_kill=no;
if [ x != x -a x\$need_kill != xno ]; then
    pids=\$(lsof -t /mnt/lustre2 | sort -u);
    if [ -n \"\$pids\" ]; then
             kill -9 \$pids;
    fi
fi;
while umount  /mnt/lustre2 2>&1 | grep -q busy; do
    echo /mnt/lustre2 is still busy, wait one second && sleep 1;
done;
fi
Stopping client wtm-29vm6.rosso.whamcloud.com /mnt/lustre2 opts:
Stopping client wtm-29vm5.rosso.whamcloud.com /mnt/lustre2 opts:
COMMAND   PID USER   FD   TYPE      DEVICE SIZE/OFF               NODE NAME
dd       3713 root    1w   REG 1273,181606  7769088 450359979933698183 /mnt/lustre/racer1/6 (deleted)
dd      10024 root    1w   REG 1273,181606 99767296 450359979950473219 /mnt/lustre2/racer1/2 (deleted)
dd      10031 root    1w   REG 1273,181606  9411584 450359979967250450 /mnt/lustre2/racer1/12
dd      10059 root    1w   REG 1273,181606 85909504 432345581440991292 /mnt/lustre2/racer/6
dd      10062 root    1w   REG 1273,181606 56402944 432345581440991267 /mnt/lustre/racer/8
dd      10074 root    1w   REG 1273,181606 47879168 432345581457768455 /mnt/lustre/racer/9
dd      10076 root    1w   REG 1273,181606 45958144 432345581407436811 /mnt/lustre2/racer/13
dd      10084 root    1w   REG 1273,181606 47879168 432345581457768455 /mnt/lustre/racer/9
dd      10130 root    1w   REG 1273,181606 18804736 468374378443178004 /mnt/lustre/racer2/13
dd      10173 root    1w   REG 1273,181606 55358464 486388776969437193 /mnt/lustre/racer3/6
dd      10421 root    1w   REG 1273,181606  2441216 486388776952660004 /mnt/lustre/racer3/10
dd      10450 root    1w   REG 1273,181606 50086912 468374378476732457 /mnt/lustre/racer2/0
dd      10456 root    1w   REG 1273,181606 31876096 468374378443178023 /mnt/lustre/racer2/13 (deleted)
dd      10457 root    1w   REG 1273,181606 64463872 486388776952660005 /mnt/lustre/racer3/19
dd      10527 root    1w   REG 1273,181606  2441216 486388776952660004 /mnt/lustre2/racer3/10
dd      10581 root    1w   REG 1273,181606  7365632 468374378459955230 /mnt/lustre2/racer2/11
dd      11360 root    1w   REG 1273,181606 57447424 450359979967250475 /mnt/lustre2/racer1/10
cat     11906 root    1w   REG 1273,181606 10503168 486388776935882790 /mnt/lustre2/racer3/2 (deleted)
cat     11906 root    3r   REG 1273,181606 39769088 486388776986214418 /mnt/lustre2/racer3/5
dd      11947 root    1w   REG 1273,181606 10503168 486388776935882790 /mnt/lustre2/racer3/2 (deleted)
cat     12675 root    1w   REG 1273,181606  8477696 450359979933696110 /mnt/lustre/racer1/1 (deleted)
cat     12675 root    3r   REG 1273,181606  8478720 450359979933696048 /mnt/lustre/racer1/5
dd      15141 root    1w   REG 1273,181606 58491904 468374378476732558 /mnt/lustre2/racer2/3 (deleted)
dd      21547 root    1w   REG 1273,181606 32379904 468374378476734932 /mnt/lustre2/racer2/5
dd      29365 root    1w   REG 1273,181606 61625344 450359979916919106 /mnt/lustre/racer1/0
/mnt/lustre2 is still busy, wait one second

Console log on client wtm-29vm6 showed that:

00:42:04:cat           D 0000000000000000     0  4282   2714 0x00000080
00:42:04: ffff880062aefc90 0000000000000082 0000000000000246 000000100000000d
00:42:04: ffffffff00000050 ffffffffa06b3390 ffff88006c91e1e8 ffff88006c91ca28
00:42:04: ffff880065a2f058 ffff880062aeffd8 000000000000fb88 ffff880065a2f058
00:42:04:Call Trace:
00:42:04: [<ffffffffa06b3390>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
00:42:04: [<ffffffffa09a9fa0>] ? lov_io_init_raid0+0x3e0/0x800 [lov]
00:42:04: [<ffffffff81510695>] rwsem_down_failed_common+0x95/0x1d0
00:42:04: [<ffffffff81510826>] rwsem_down_read_failed+0x26/0x30
00:42:04: [<ffffffff81283844>] call_rwsem_down_read_failed+0x14/0x30
00:42:04: [<ffffffff8150fd24>] ? down_read+0x24/0x30
00:42:04: [<ffffffffa0a7182d>] ll_file_io_generic+0x29d/0x600 [lustre]
00:42:04: [<ffffffffa0a71ccf>] ll_file_aio_read+0x13f/0x2c0 [lustre]
00:42:04: [<ffffffffa0a7256c>] ll_file_read+0x16c/0x2a0 [lustre]
00:42:04: [<ffffffff81181a95>] vfs_read+0xb5/0x1a0
00:42:04: [<ffffffff81181bd1>] sys_read+0x51/0x90
00:42:04: [<ffffffff810dc685>] ? __audit_syscall_exit+0x265/0x290
00:42:04: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Stack trace on client wtm-29vm5 showed that:

dd            S 0000000000000000     0  3713      1 0x00000080
 ffff880042549bd8 0000000000000086 ffffffffa0581e5c ffff8800679bf5b8
 0000000000000002 ffff880067ac4040 ffff880042549be8 ffffffffa055d35c
 ffff880067ac45f8 ffff880042549fd8 000000000000fb88 ffff880067ac45f8
Call Trace:
 [<ffffffffa055d35c>] ? cl_lock_trace0+0x11c/0x130 [obdclass]
 [<ffffffffa0554825>] ? cl_env_info+0x15/0x20 [obdclass]
 [<ffffffffa03d86fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
 [<ffffffffa056171a>] cl_lock_state_wait+0x1aa/0x320 [obdclass]
 [<ffffffff81063410>] ? default_wake_function+0x0/0x20
 [<ffffffffa0561f0b>] cl_enqueue_locked+0x15b/0x1f0 [obdclass]
 [<ffffffffa0562a8e>] cl_lock_request+0x7e/0x270 [obdclass]
 [<ffffffffa0567d2c>] cl_io_lock+0x3cc/0x560 [obdclass]
 [<ffffffffa0567f62>] cl_io_loop+0xa2/0x1b0 [obdclass]
 [<ffffffffa09df9e0>] ll_file_io_generic+0x450/0x600 [lustre]
 [<ffffffffa09e0922>] ll_file_aio_write+0x142/0x2c0 [lustre]
 [<ffffffffa09e0c0c>] ll_file_write+0x16c/0x2a0 [lustre]
 [<ffffffff81181368>] vfs_write+0xb8/0x1a0
 [<ffffffff81181c61>] sys_write+0x51/0x90
 [<ffffffff810dc685>] ? __audit_syscall_exit+0x265/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Maloo report: https://maloo.whamcloud.com/test_sets/66e0f40c-1649-11e3-aa2a-52540035b04c



 Comments   
Comment by Oleg Drokin [ 06/Sep/13 ]

This might be more fallout from LU-3027

Comment by Jian Yu [ 26/Nov/13 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/58/
Distro/Arch: RHEL6.4/x86_64

FSTYPE=zfs
MDSCOUNT=1
MDSSIZE=2097152
OSTCOUNT=2
OSTSIZE=2097152

racer test hit the same failure again:
https://maloo.whamcloud.com/test_sets/d73fa562-5605-11e3-8e94-52540035b04c

The same test passed in manual run:
https://maloo.whamcloud.com/test_sessions/2a310d0a-567a-11e3-82a8-52540035b04c

Comment by Saurabh Tandan (Inactive) [ 29/Sep/15 ]

Also observed the same issue in suite_log of replay-dual for interop testing for 2.7.60 tag
https://testing.hpdd.intel.com/test_sets/f5135c3c-62dd-11e5-b25a-5254006e85c2

while umount  /mnt/lustre2 2>&1 | grep -q busy; do
    echo /mnt/lustre2 is still busy, wait one second && sleep 1;
done;
fi
Stopping client onyx-38vm6.onyx.hpdd.intel.com /mnt/lustre2 opts:
Stopping client onyx-38vm5.onyx.hpdd.intel.com /mnt/lustre2 opts:
Generated at Sat Feb 10 01:37:51 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.