[LU-2181] failure conf-sanity test_23a: umount -f client hung in stat() when MDS down - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.4.0
Affects Version/s: Lustre 2.3.0, Lustre 2.4.0
Labels:
None

Severity:
3
Rank (Obsolete):
5219

Description

This issue was created by maloo for yujian <yujian@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/140674a6-16b2-11e2-962d-52540035b04c.

Lustre Tag: v2_3_0_RC3
Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/36
Distro/Arch: RHEL6.3/x86_64(server), FC15/x86_64(client)
Network: TCP
ENABLE_QUOTA=yes

The sub-test test_23a hung at unmounting the client:

== conf-sanity test 23a: interrupt client during recovery mount delay ================================ 02:41:31 (1350294091)
start mds service on fat-amd-2
Starting mds1:   /dev/sdc5 /mnt/mds1
Started lustre-MDT0000
start ost1 service on fat-amd-3
Starting ost1:   /dev/sdc5 /mnt/ost1
Started lustre-OST0000
mount lustre on /mnt/lustre.....
Starting client: client-5: -o user_xattr,flock fat-amd-2@tcp:/lustre /mnt/lustre
Stopping /mnt/mds1 (opts:) on fat-amd-2
Stopping client /mnt/lustre (opts: -f)

Stack trace on client:

[ 5526.947537] umount          S ffff880316bb3170     0  7395   7009 0x00000080
[ 5526.954596]  ffff8803136e57c8 0000000000000082 00000001004fdeea ffff88030af44560
[ 5526.962037]  ffff8803136e5fd8 ffff8803136e5fd8 0000000000013840 0000000000013840
[ 5526.969479]  ffff880323191720 ffff88030af44560 0000000000000000 0000000000000286
[ 5526.976921] Call Trace:
[ 5526.979396]  [<ffffffffa054a570>] ? ptlrpc_interrupted_set+0x0/0x120 [ptlrpc]
[ 5526.986517]  [<ffffffff8147461a>] schedule_timeout+0xa7/0xde
[ 5526.992168]  [<ffffffff81060b58>] ? process_timeout+0x0/0x10
[ 5526.997829]  [<ffffffffa02ae761>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
[ 5527.004550]  [<ffffffffa0555a9c>] ptlrpc_set_wait+0x2ec/0x8c0 [ptlrpc]
[ 5527.011066]  [<ffffffff8104df76>] ? default_wake_function+0x0/0x14
[ 5527.017270]  [<ffffffffa05560e8>] ptlrpc_queue_wait+0x78/0x230 [ptlrpc]
[ 5527.023900]  [<ffffffffa05386c5>] ldlm_cli_enqueue+0x2f5/0x7b0 [ptlrpc]
[ 5527.030528]  [<ffffffffa0536d90>] ? ldlm_completion_ast+0x0/0x6f0 [ptlrpc]
[ 5527.037408]  [<ffffffffa0905cc0>] ? ll_md_blocking_ast+0x0/0x710 [lustre]
[ 5527.044186]  [<ffffffffa0744e55>] mdc_enqueue+0x505/0x1590 [mdc]
[ 5527.050196]  [<ffffffffa02b9578>] ? libcfs_log_return+0x28/0x40 [libcfs]
[ 5527.056885]  [<ffffffffa074609e>] ? mdc_revalidate_lock+0x1be/0x1d0 [mdc]
[ 5527.063661]  [<ffffffffa0746270>] mdc_intent_lock+0x1c0/0x5c0 [mdc]
[ 5527.069932]  [<ffffffffa0905cc0>] ? ll_md_blocking_ast+0x0/0x710 [lustre]
[ 5527.076734]  [<ffffffffa0536d90>] ? ldlm_completion_ast+0x0/0x6f0 [ptlrpc]
[ 5527.083601]  [<ffffffffa09eed8b>] lmv_intent_lookup+0x3bb/0x11c0 [lmv]
[ 5527.090136]  [<ffffffffa0905cc0>] ? ll_md_blocking_ast+0x0/0x710 [lustre]
[ 5527.096913]  [<ffffffffa09f12f0>] lmv_intent_lock+0x310/0x370 [lmv]
[ 5527.103190]  [<ffffffffa0905cc0>] ? ll_md_blocking_ast+0x0/0x710 [lustre]
[ 5527.109982]  [<ffffffffa08e0944>] __ll_inode_revalidate_it+0x214/0xd90 [lustre]
[ 5527.117295]  [<ffffffffa0905cc0>] ? ll_md_blocking_ast+0x0/0x710 [lustre]
[ 5527.124084]  [<ffffffffa08e1764>] ll_inode_revalidate_it+0x44/0x1a0 [lustre]
[ 5527.131136]  [<ffffffffa08e1903>] ll_getattr_it+0x43/0x170 [lustre]
[ 5527.137408]  [<ffffffffa08e1a64>] ll_getattr+0x34/0x40 [lustre]
[ 5527.143317]  [<ffffffff81125113>] vfs_getattr+0x45/0x63
[ 5527.148535]  [<ffffffff8112517e>] vfs_fstatat+0x4d/0x63
[ 5527.153751]  [<ffffffff811251cf>] vfs_stat+0x1b/0x1d
[ 5527.158709]  [<ffffffff811252ce>] sys_newstat+0x1a/0x33
[ 5527.163927]  [<ffffffff81129f89>] ? path_put+0x1f/0x23
[ 5527.169059]  [<ffffffff8109fa08>] ? audit_syscall_entry+0x145/0x171
[ 5527.175315]  [<ffffffff81009bc2>] system_call_fastpath+0x16/0x1b

Info required for matching: conf-sanity 23a

Attachments

Issue Links

is related to

LU-9042 conf-sanity test cases 23a 34b remove from ALWAYS_EXCEPT

Resolved

mentioned in: Page Loading...

Activity

[LU-2181] failure conf-sanity test_23a: umount -f client hung in stat() when MDS down

Bob Glossman (Inactive) added a comment - 20/Nov/12 2:03 PM

I have seen this repeatedly in sles11 sp2 clients too, so it's not just FC15 client.

I'm very suspicious that this may be due to version skew in the umount command. In el6 which doesn't fail umount is found in util-linux-ng 2.17.2, in sles11 sp2 it is in util-linux 2.19.1, don't know what the version is in FC15.

Running strace on a 'umount -f' with mds down or unreachable, in el6 I see the first significant syscall is umount():

....
getuid() = 0
geteuid() = 0
readlink("/mnt", 0x7fff08de1ad0, 4096) = -1 EINVAL (Invalid argument)
readlink("/mnt/lustre", 0x7fff08de1ad0, 4096) = -1 EINVAL (Invalid argument)
umask(077) = 022
open("/etc/mtab", O_RDONLY) = 3
umask(022) = 077
fstat(3,

{st_mode=S_IFREG|0644, st_size=480, ...}

) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4c55c00000
read(3, "/dev/mapper/vg_centos1-lv_root /"..., 4096) = 480
read(3, "", 4096) = 0
close(3) = 0
munmap(0x7f4c55c00000, 4096) = 0
stat("/sbin/umount.lustre", 0x7fff08de2900) = -1 ENOENT (No such file or directory)
rt_sigprocmask(SIG_BLOCK, ~[TRAP SEGV RTMIN RT_1], NULL, 8) = 0
umount("/mnt/lustre", MNT_FORCE) = 0
....

In sles11 sp2 the first significant syscall is stat() on the mount point:

....
getuid() = 0
geteuid() = 0
readlink("/mnt", 0x7fff0e6455d0, 4096) = -1 EINVAL (Invalid argument)
readlink("/mnt/lustre", 0x7fff0e6455d0, 4096) = -1 EINVAL (Invalid argument)
stat("/mnt/lustre",
< hangs here >

It appears to be the stat() call of the mount point that gets into the permanent client loop YS has described and the umount command never gets to the umount() syscall. In earlier versions of the umount command, as seen in el6 clients, there is no stat() call and the umount() call succeeds even with the mds down.

Bob Glossman (Inactive) added a comment - 20/Nov/12 2:03 PM I have seen this repeatedly in sles11 sp2 clients too, so it's not just FC15 client. I'm very suspicious that this may be due to version skew in the umount command. In el6 which doesn't fail umount is found in util-linux-ng 2.17.2, in sles11 sp2 it is in util-linux 2.19.1, don't know what the version is in FC15. Running strace on a 'umount -f' with mds down or unreachable, in el6 I see the first significant syscall is umount(): .... getuid() = 0 geteuid() = 0 readlink("/mnt", 0x7fff08de1ad0, 4096) = -1 EINVAL (Invalid argument) readlink("/mnt/lustre", 0x7fff08de1ad0, 4096) = -1 EINVAL (Invalid argument) umask(077) = 022 open("/etc/mtab", O_RDONLY) = 3 umask(022) = 077 fstat(3, {st_mode=S_IFREG|0644, st_size=480, ...} ) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4c55c00000 read(3, "/dev/mapper/vg_centos1-lv_root /"..., 4096) = 480 read(3, "", 4096) = 0 close(3) = 0 munmap(0x7f4c55c00000, 4096) = 0 stat("/sbin/umount.lustre", 0x7fff08de2900) = -1 ENOENT (No such file or directory) rt_sigprocmask(SIG_BLOCK, ~ [TRAP SEGV RTMIN RT_1] , NULL, 8) = 0 umount("/mnt/lustre", MNT_FORCE) = 0 .... In sles11 sp2 the first significant syscall is stat() on the mount point: .... getuid() = 0 geteuid() = 0 readlink("/mnt", 0x7fff0e6455d0, 4096) = -1 EINVAL (Invalid argument) readlink("/mnt/lustre", 0x7fff0e6455d0, 4096) = -1 EINVAL (Invalid argument) stat("/mnt/lustre", < hangs here > It appears to be the stat() call of the mount point that gets into the permanent client loop YS has described and the umount command never gets to the umount() syscall. In earlier versions of the umount command, as seen in el6 clients, there is no stat() call and the umount() call succeeds even with the mds down.

Sarah Liu added a comment - 03/Nov/12 12:06 AM

Found another failure in SLES11 SP2 client.
https://maloo.whamcloud.com/test_sets/a9ac1d4a-2539-11e2-9e7c-52540035b04c

client trace shows:

[20995.079218] umount          S 0000000000000001     0  6319   5616 0x00000000
[20995.079218]  ffff880062751878 0000000000000082 0000000000000000 ffff880062750010
[20995.079218]  0000000000011180 0000000000011180 ffff880062751fd8 ffff880062751fd8
[20995.079218]  0000000000011180 ffff88006c426340 ffff880000000028 ffff880026852440
[20995.079218] Call Trace:
[20995.079218]  [<ffffffff81441570>] schedule_timeout+0x1b0/0x2a0
[20995.079218]  [<ffffffffa080bfbb>] ptlrpc_set_wait+0x2eb/0x920 [ptlrpc]
[20995.079218]  [<ffffffffa080c666>] ptlrpc_queue_wait+0x76/0x230 [ptlrpc]
[20995.079218]  [<ffffffffa07ef83b>] ldlm_cli_enqueue+0x23b/0x790 [ptlrpc]
[20995.079218]  [<ffffffffa09928c8>] mdc_enqueue+0x2b8/0xcb0 [mdc]
[20995.079218]  [<ffffffffa0993468>] mdc_intent_lock+0x1a8/0xd40 [mdc]
[20995.079218]  [<ffffffffa09497a7>] lmv_intent_lookup+0x3b7/0x11b0 [lmv]
[20995.079218]  [<ffffffffa094a8e2>] lmv_intent_lock+0x342/0x3a0 [lmv]
[20995.079218]  [<ffffffffa0b17ea0>] __ll_inode_revalidate_it+0x550/0xd70 [lustre]
[20995.079218]  [<ffffffffa0b18959>] ll_inode_revalidate_it+0x49/0x1c0 [lustre]
[20995.079218]  [<ffffffffa0b18b0e>] ll_getattr_it+0x3e/0x160 [lustre]
[20995.079218]  [<ffffffffa0b18c5f>] ll_getattr+0x2f/0x40 [lustre]
[20995.079218]  [<ffffffff81154631>] vfs_fstatat+0x81/0x90
[20995.079218]  [<ffffffff8115478f>] sys_newstat+0x1f/0x50
[20995.079218]  [<ffffffff8144ac92>] system_call_fastpath+0x16/0x1b
[20995.079218]  [<00007f60d1ef9335>] 0x7f60d1ef9334

Sarah Liu added a comment - 03/Nov/12 12:06 AM Found another failure in SLES11 SP2 client. https://maloo.whamcloud.com/test_sets/a9ac1d4a-2539-11e2-9e7c-52540035b04c client trace shows: [20995.079218] umount S 0000000000000001 0 6319 5616 0x00000000 [20995.079218] ffff880062751878 0000000000000082 0000000000000000 ffff880062750010 [20995.079218] 0000000000011180 0000000000011180 ffff880062751fd8 ffff880062751fd8 [20995.079218] 0000000000011180 ffff88006c426340 ffff880000000028 ffff880026852440 [20995.079218] Call Trace: [20995.079218] [<ffffffff81441570>] schedule_timeout+0x1b0/0x2a0 [20995.079218] [<ffffffffa080bfbb>] ptlrpc_set_wait+0x2eb/0x920 [ptlrpc] [20995.079218] [<ffffffffa080c666>] ptlrpc_queue_wait+0x76/0x230 [ptlrpc] [20995.079218] [<ffffffffa07ef83b>] ldlm_cli_enqueue+0x23b/0x790 [ptlrpc] [20995.079218] [<ffffffffa09928c8>] mdc_enqueue+0x2b8/0xcb0 [mdc] [20995.079218] [<ffffffffa0993468>] mdc_intent_lock+0x1a8/0xd40 [mdc] [20995.079218] [<ffffffffa09497a7>] lmv_intent_lookup+0x3b7/0x11b0 [lmv] [20995.079218] [<ffffffffa094a8e2>] lmv_intent_lock+0x342/0x3a0 [lmv] [20995.079218] [<ffffffffa0b17ea0>] __ll_inode_revalidate_it+0x550/0xd70 [lustre] [20995.079218] [<ffffffffa0b18959>] ll_inode_revalidate_it+0x49/0x1c0 [lustre] [20995.079218] [<ffffffffa0b18b0e>] ll_getattr_it+0x3e/0x160 [lustre] [20995.079218] [<ffffffffa0b18c5f>] ll_getattr+0x2f/0x40 [lustre] [20995.079218] [<ffffffff81154631>] vfs_fstatat+0x81/0x90 [20995.079218] [<ffffffff8115478f>] sys_newstat+0x1f/0x50 [20995.079218] [<ffffffff8144ac92>] system_call_fastpath+0x16/0x1b [20995.079218] [<00007f60d1ef9335>] 0x7f60d1ef9334

Yang Sheng added a comment - 15/Oct/12 3:00 PM

From first investigate, looks like client loop forever in ptlrpc_set_wait(). maybe relate to commit 7223b4746c71bf450c178ed21ddf99a0e3e26a1a. But i am still not understand why it just impact FC15 client.

Yang Sheng added a comment - 15/Oct/12 3:00 PM From first investigate, looks like client loop forever in ptlrpc_set_wait(). maybe relate to commit 7223b4746c71bf450c178ed21ddf99a0e3e26a1a. But i am still not understand why it just impact FC15 client.

Peter Jones added a comment - 15/Oct/12 12:50 PM

Yangsheng is looking into this one

Peter Jones added a comment - 15/Oct/12 12:50 PM Yangsheng is looking into this one

Jian Yu added a comment - 15/Oct/12 12:16 PM

Lustre Tag: v2_3_0_RC3
Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/36
Distro/Arch: RHEL6.3/x86_64(server), FC15/x86_64(client)
Network: TCP
ENABLE_QUOTA=yes

conf-sanity test 45 also hit the same issue:
https://maloo.whamcloud.com/test_sets/c0834482-16e1-11e2-afe1-52540035b04c

Jian Yu added a comment - 15/Oct/12 12:16 PM Lustre Tag: v2_3_0_RC3 Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/36 Distro/Arch: RHEL6.3/x86_64(server), FC15/x86_64(client) Network: TCP ENABLE_QUOTA=yes conf-sanity test 45 also hit the same issue: https://maloo.whamcloud.com/test_sets/c0834482-16e1-11e2-afe1-52540035b04c

Jian Yu added a comment - 15/Oct/12 11:24 AM

Lustre Tag: v2_3_0_RC3
Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/36
Distro/Arch: RHEL6.3/x86_64(server), FC15/x86_64(client)
Network: TCP
ENABLE_QUOTA=yes

conf-sanity test 34b also hit the same issue:
https://maloo.whamcloud.com/test_sets/b1d732d8-16da-11e2-afe1-52540035b04c

Jian Yu added a comment - 15/Oct/12 11:24 AM Lustre Tag: v2_3_0_RC3 Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/36 Distro/Arch: RHEL6.3/x86_64(server), FC15/x86_64(client) Network: TCP ENABLE_QUOTA=yes conf-sanity test 34b also hit the same issue: https://maloo.whamcloud.com/test_sets/b1d732d8-16da-11e2-afe1-52540035b04c

People

Assignee:: Bob Glossman (Inactive)

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 15/Oct/12 6:40 AM

Updated:: 13/Apr/23 10:24 PM

Resolved:: 01/Oct/14 6:04 PM