Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.3.0, Lustre 2.4.0
-
None
-
3
-
5219
Description
This issue was created by maloo for yujian <yujian@whamcloud.com>
This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/140674a6-16b2-11e2-962d-52540035b04c.
Lustre Tag: v2_3_0_RC3
Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/36
Distro/Arch: RHEL6.3/x86_64(server), FC15/x86_64(client)
Network: TCP
ENABLE_QUOTA=yes
The sub-test test_23a hung at unmounting the client:
== conf-sanity test 23a: interrupt client during recovery mount delay ================================ 02:41:31 (1350294091) start mds service on fat-amd-2 Starting mds1: /dev/sdc5 /mnt/mds1 Started lustre-MDT0000 start ost1 service on fat-amd-3 Starting ost1: /dev/sdc5 /mnt/ost1 Started lustre-OST0000 mount lustre on /mnt/lustre..... Starting client: client-5: -o user_xattr,flock fat-amd-2@tcp:/lustre /mnt/lustre Stopping /mnt/mds1 (opts:) on fat-amd-2 Stopping client /mnt/lustre (opts: -f)
Stack trace on client:
[ 5526.947537] umount S ffff880316bb3170 0 7395 7009 0x00000080 [ 5526.954596] ffff8803136e57c8 0000000000000082 00000001004fdeea ffff88030af44560 [ 5526.962037] ffff8803136e5fd8 ffff8803136e5fd8 0000000000013840 0000000000013840 [ 5526.969479] ffff880323191720 ffff88030af44560 0000000000000000 0000000000000286 [ 5526.976921] Call Trace: [ 5526.979396] [<ffffffffa054a570>] ? ptlrpc_interrupted_set+0x0/0x120 [ptlrpc] [ 5526.986517] [<ffffffff8147461a>] schedule_timeout+0xa7/0xde [ 5526.992168] [<ffffffff81060b58>] ? process_timeout+0x0/0x10 [ 5526.997829] [<ffffffffa02ae761>] cfs_waitq_timedwait+0x11/0x20 [libcfs] [ 5527.004550] [<ffffffffa0555a9c>] ptlrpc_set_wait+0x2ec/0x8c0 [ptlrpc] [ 5527.011066] [<ffffffff8104df76>] ? default_wake_function+0x0/0x14 [ 5527.017270] [<ffffffffa05560e8>] ptlrpc_queue_wait+0x78/0x230 [ptlrpc] [ 5527.023900] [<ffffffffa05386c5>] ldlm_cli_enqueue+0x2f5/0x7b0 [ptlrpc] [ 5527.030528] [<ffffffffa0536d90>] ? ldlm_completion_ast+0x0/0x6f0 [ptlrpc] [ 5527.037408] [<ffffffffa0905cc0>] ? ll_md_blocking_ast+0x0/0x710 [lustre] [ 5527.044186] [<ffffffffa0744e55>] mdc_enqueue+0x505/0x1590 [mdc] [ 5527.050196] [<ffffffffa02b9578>] ? libcfs_log_return+0x28/0x40 [libcfs] [ 5527.056885] [<ffffffffa074609e>] ? mdc_revalidate_lock+0x1be/0x1d0 [mdc] [ 5527.063661] [<ffffffffa0746270>] mdc_intent_lock+0x1c0/0x5c0 [mdc] [ 5527.069932] [<ffffffffa0905cc0>] ? ll_md_blocking_ast+0x0/0x710 [lustre] [ 5527.076734] [<ffffffffa0536d90>] ? ldlm_completion_ast+0x0/0x6f0 [ptlrpc] [ 5527.083601] [<ffffffffa09eed8b>] lmv_intent_lookup+0x3bb/0x11c0 [lmv] [ 5527.090136] [<ffffffffa0905cc0>] ? ll_md_blocking_ast+0x0/0x710 [lustre] [ 5527.096913] [<ffffffffa09f12f0>] lmv_intent_lock+0x310/0x370 [lmv] [ 5527.103190] [<ffffffffa0905cc0>] ? ll_md_blocking_ast+0x0/0x710 [lustre] [ 5527.109982] [<ffffffffa08e0944>] __ll_inode_revalidate_it+0x214/0xd90 [lustre] [ 5527.117295] [<ffffffffa0905cc0>] ? ll_md_blocking_ast+0x0/0x710 [lustre] [ 5527.124084] [<ffffffffa08e1764>] ll_inode_revalidate_it+0x44/0x1a0 [lustre] [ 5527.131136] [<ffffffffa08e1903>] ll_getattr_it+0x43/0x170 [lustre] [ 5527.137408] [<ffffffffa08e1a64>] ll_getattr+0x34/0x40 [lustre] [ 5527.143317] [<ffffffff81125113>] vfs_getattr+0x45/0x63 [ 5527.148535] [<ffffffff8112517e>] vfs_fstatat+0x4d/0x63 [ 5527.153751] [<ffffffff811251cf>] vfs_stat+0x1b/0x1d [ 5527.158709] [<ffffffff811252ce>] sys_newstat+0x1a/0x33 [ 5527.163927] [<ffffffff81129f89>] ? path_put+0x1f/0x23 [ 5527.169059] [<ffffffff8109fa08>] ? audit_syscall_entry+0x145/0x171 [ 5527.175315] [<ffffffff81009bc2>] system_call_fastpath+0x16/0x1b
Info required for matching: conf-sanity 23a
I have seen this repeatedly in sles11 sp2 clients too, so it's not just FC15 client.
I'm very suspicious that this may be due to version skew in the umount command. In el6 which doesn't fail umount is found in util-linux-ng 2.17.2, in sles11 sp2 it is in util-linux 2.19.1, don't know what the version is in FC15.
Running strace on a 'umount -f' with mds down or unreachable, in el6 I see the first significant syscall is umount():
....
{st_mode=S_IFREG|0644, st_size=480, ...}getuid() = 0
geteuid() = 0
readlink("/mnt", 0x7fff08de1ad0, 4096) = -1 EINVAL (Invalid argument)
readlink("/mnt/lustre", 0x7fff08de1ad0, 4096) = -1 EINVAL (Invalid argument)
umask(077) = 022
open("/etc/mtab", O_RDONLY) = 3
umask(022) = 077
fstat(3,
) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4c55c00000
read(3, "/dev/mapper/vg_centos1-lv_root /"..., 4096) = 480
read(3, "", 4096) = 0
close(3) = 0
munmap(0x7f4c55c00000, 4096) = 0
stat("/sbin/umount.lustre", 0x7fff08de2900) = -1 ENOENT (No such file or directory)
rt_sigprocmask(SIG_BLOCK, ~[TRAP SEGV RTMIN RT_1], NULL, 8) = 0
umount("/mnt/lustre", MNT_FORCE) = 0
....
In sles11 sp2 the first significant syscall is stat() on the mount point:
....
getuid() = 0
geteuid() = 0
readlink("/mnt", 0x7fff0e6455d0, 4096) = -1 EINVAL (Invalid argument)
readlink("/mnt/lustre", 0x7fff0e6455d0, 4096) = -1 EINVAL (Invalid argument)
stat("/mnt/lustre",
< hangs here >
It appears to be the stat() call of the mount point that gets into the permanent client loop YS has described and the umount command never gets to the umount() syscall. In earlier versions of the umount command, as seen in el6 clients, there is no stat() call and the umount() call succeeds even with the mds down.