Details
-
Bug
-
Resolution: Cannot Reproduce
-
Critical
-
None
-
Lustre 2.4.2
-
3
-
14564
Description
I'm seeing an issue on one of our production Lustre filesystems where the stat system call is not returning, but the MDS of the filesystem in question appears to be nearly idle.
$ mount ... pilsner-mds1-lnet0@o2ib100:/lsd on /p/lscratchd type lustre (rw,nosuid,noauto,_netdev,lazystatfs,flock) $ strace -f -tt /bin/ls /p/lscratchd/surya1 ... 09:39:52.854819 stat("/p/lscratchd/surya1",
Looking at the CPU utilization on the MDS, there isn't a whole lot going on:
# perf record -a sleep 5 # perf report --stdio --symfs=/ 2>/dev/null | head -n 40 | tail -n 20 # # Overhead Command Shared Object Symbol # ........ ............... ................. .......................................... # 25.06% init [kernel.kallsyms] [k] intel_idle 1.84% swapper [kernel.kallsyms] [k] intel_idle 1.31% init [kernel.kallsyms] [k] getnstimeofday 1.18% init [kernel.kallsyms] [k] schedule 1.17% init [kernel.kallsyms] [k] menu_select 1.06% init [kernel.kallsyms] [k] tick_nohz_stop_sched_tick 0.81% snmpd [kernel.kallsyms] [k] diskstats_show 0.80% init [kernel.kallsyms] [k] select_nohz_load_balancer 0.78% init [kernel.kallsyms] [k] fn_hash_lookup 0.77% init [kernel.kallsyms] [k] cpuidle_idle_call 0.61% init [ib_ipoib] [k] ipoib_ib_handle_rx_wc 0.59% kiblnd_sd_02_00 [kernel.kallsyms] [k] find_busiest_group 0.59% init [kernel.kallsyms] [k] __remove_hrtimer 0.52% init [kernel.kallsyms] [k] rb_next 0.51% init [kernel.kallsyms] [k] try_to_wake_up 0.51% kiblnd_sd_02_01 [kernel.kallsyms] [k] find_busiest_group
There's not a whole lot of ARC activity or disk activity on the MDS either:
# timeout 10 arcstat.py 1 time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c 09:55:28 0 0 0 0 0 0 0 0 0 3.9G 1.1G 09:55:29 0 0 0 0 0 0 0 0 0 3.9G 1.1G 09:55:30 6 6 100 6 100 0 0 5 100 3.9G 1.1G 09:55:31 0 0 0 0 0 0 0 0 0 3.9G 1.1G 09:55:32 0 0 0 0 0 0 0 0 0 3.9G 1.1G 09:55:33 0 0 0 0 0 0 0 0 0 3.9G 1.1G 09:55:34 0 0 0 0 0 0 0 0 0 3.9G 1.1G 09:55:35 0 0 0 0 0 0 0 0 0 3.9G 1.1G 09:55:36 0 0 0 0 0 0 0 0 0 3.9G 1.1G 09:55:37 0 0 0 0 0 0 0 0 0 3.9G 1.1G # timeout 5 iostat -xyz 1 Linux 2.6.32-431.17.2.1chaos.ch5.2.x86_64 (pilsner-mds1) 06/18/14 _x86_64_ (16 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 0.00 0.00 0.13 0.00 0.00 99.87 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sde 0.00 0.00 1.00 0.00 4.00 0.00 4.00 0.00 0.00 0.00 0.00 sdh 0.00 0.00 1.00 0.00 4.00 0.00 4.00 0.00 0.00 0.00 0.00 sdl 0.00 0.00 1.00 0.00 4.00 0.00 4.00 0.00 0.00 0.00 0.00 sdan 0.00 0.00 1.00 0.00 4.00 0.00 4.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.00 0.00 0.25 0.00 0.00 99.75 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdk 0.00 0.00 4.00 0.00 262.00 0.00 65.50 0.00 0.75 0.75 0.30 sdc 0.00 0.00 1.00 0.00 4.00 0.00 4.00 0.00 0.00 0.00 0.00 sdm 0.00 0.00 1.00 0.00 4.00 0.00 4.00 0.00 0.00 0.00 0.00 sdo 0.00 0.00 1.00 0.00 4.00 0.00 4.00 0.00 0.00 0.00 0.00 sdaf 0.00 0.00 5.00 0.00 266.00 0.00 53.20 0.00 0.20 0.20 0.10 dm-0 0.00 0.00 9.00 0.00 528.00 0.00 58.67 0.00 0.44 0.44 0.40 avg-cpu: %user %nice %system %iowait %steal %idle 0.00 0.00 0.31 0.00 0.00 99.69 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sds 0.00 0.00 0.99 0.00 3.96 0.00 4.00 0.00 0.00 0.00 0.00 sdt 0.00 0.00 0.99 0.00 3.96 0.00 4.00 0.00 0.00 0.00 0.00 sdu 0.00 0.00 0.99 0.00 3.96 0.00 4.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.00 0.00 0.19 0.00 0.00 99.81 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 0.00 1.00 0.00 4.00 0.00 4.00 0.00 0.00 0.00 0.00 sdp 0.00 0.00 1.00 0.00 4.00 0.00 4.00 0.00 0.00 0.00 0.00 sdv 0.00 0.00 1.00 0.00 4.00 0.00 4.00 0.00 1.00 1.00 0.10 sdah 0.00 0.00 1.00 0.00 4.00 0.00 4.00 0.00 0.00 0.00 0.00 sdaj 0.00 0.00 1.00 0.00 4.00 0.00 4.00 0.00 0.00 0.00 0.00
I dumped the stacks on the MDS using sysrq-t and see many of the "mdt" threads (maybe all, I've haven't done an exhaustive check) like the following:
2014-06-18 08:26:47 mdt02_090 S 000000000000000a 0 13594 2 0x00000000 2014-06-18 08:26:47 ffff881f325d7a00 0000000000000046 0000000000000000 ffff881f325d79c4 2014-06-18 08:26:47 0000000000000000 ffff88103fe83200 ffff880060636840 00000000000003ff 2014-06-18 08:26:47 ffff881f32811058 ffff881f325d7fd8 000000000000fbc8 ffff881f32811058 2014-06-18 08:26:47 Call Trace: 2014-06-18 08:26:47 [<ffffffffa060277e>] cfs_waitq_wait+0xe/0x10 [libcfs] 2014-06-18 08:26:47 [<ffffffffa08d844a>] ldlm_completion_ast+0x57a/0x960 [ptlrpc] 2014-06-18 08:26:47 [<ffffffffa08d3b50>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] 2014-06-18 08:26:47 [<ffffffff81061f20>] ? default_wake_function+0x0/0x20 2014-06-18 08:26:47 [<ffffffffa08d7af8>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] 2014-06-18 08:26:47 [<ffffffffa08d7ed0>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2014-06-18 08:26:47 [<ffffffffa0e88ad0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2014-06-18 08:26:47 [<ffffffffa0e92fef>] mdt_object_lock0+0x33f/0xaf0 [mdt] 2014-06-18 08:26:47 [<ffffffffa0e88ad0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2014-06-18 08:26:47 [<ffffffffa08d7ed0>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2014-06-18 08:26:47 [<ffffffffa0e93864>] mdt_object_lock+0x14/0x20 [mdt] 2014-06-18 08:26:47 [<ffffffffa0eaeabf>] mdt_attr_set+0x8f/0x560 [mdt] 2014-06-18 08:26:47 [<ffffffffa0eaf54d>] mdt_reint_setattr+0x5bd/0xcf0 [mdt] 2014-06-18 08:26:47 [<ffffffffa0901e7e>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc] 2014-06-18 08:26:47 [<ffffffffa0ea8e71>] mdt_reint_rec+0x41/0xe0 [mdt] 2014-06-18 08:26:47 [<ffffffffa0e8db13>] mdt_reint_internal+0x4c3/0x780 [mdt] 2014-06-18 08:26:47 [<ffffffffa0e8de14>] mdt_reint+0x44/0xe0 [mdt] 2014-06-18 08:26:47 [<ffffffffa0e91827>] mdt_handle_common+0x647/0x16d0 [mdt] 2014-06-18 08:26:47 [<ffffffffa0eccba5>] mds_regular_handle+0x15/0x20 [mdt] 2014-06-18 08:26:47 [<ffffffffa09120f8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] 2014-06-18 08:26:47 [<ffffffffa060263e>] ? cfs_timer_arm+0xe/0x10 [libcfs] 2014-06-18 08:26:47 [<ffffffffa0613f6f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2014-06-18 08:26:47 [<ffffffffa0909459>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2014-06-18 08:26:47 [<ffffffffa091348e>] ptlrpc_main+0xace/0x1700 [ptlrpc] 2014-06-18 08:26:47 [<ffffffffa09129c0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] 2014-06-18 08:26:47 [<ffffffff8100c24a>] child_rip+0xa/0x20 2014-06-18 08:26:47 [<ffffffffa09129c0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] 2014-06-18 08:26:47 [<ffffffffa09129c0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] 2014-06-18 08:26:47 [<ffffffff8100c240>] ? child_rip+0x0/0x20 2014-06-18 08:26:19 mdt03_068 S 000000000000000f 0 13280 2 0x00000000 2014-06-18 08:26:19 ffff881f4938b8f0 0000000000000046 0000000000000000 ffff881f4938b940 2014-06-18 08:26:19 0000000000000001 ffff881f49d42040 ffff881f4938b8c0 ffffffffa06124a1 2014-06-18 08:26:19 ffff881f49d425f8 ffff881f4938bfd8 000000000000fbc8 ffff881f49d425f8 2014-06-18 08:26:19 Call Trace: 2014-06-18 08:26:19 [<ffffffffa06124a1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2014-06-18 08:26:19 [<ffffffffa060277e>] cfs_waitq_wait+0xe/0x10 [libcfs] 2014-06-18 08:26:19 [<ffffffffa08d844a>] ldlm_completion_ast+0x57a/0x960 [ptlrpc] 2014-06-18 08:26:19 [<ffffffffa08d3b50>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] 2014-06-18 08:26:19 [<ffffffff81061f20>] ? default_wake_function+0x0/0x20 2014-06-18 08:26:19 [<ffffffffa08d7af8>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] 2014-06-18 08:26:19 [<ffffffffa08d7ed0>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2014-06-18 08:26:19 [<ffffffffa0e88ad0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2014-06-18 08:26:19 [<ffffffffa0e92fef>] mdt_object_lock0+0x33f/0xaf0 [mdt] 2014-06-18 08:26:19 [<ffffffffa0e88ad0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2014-06-18 08:26:19 [<ffffffffa08d7ed0>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2014-06-18 08:26:19 [<ffffffffa0e93864>] mdt_object_lock+0x14/0x20 [mdt] 2014-06-18 08:26:19 [<ffffffffa0e9ddc9>] mdt_getattr_name_lock+0xe29/0x1990 [mdt] 2014-06-18 08:26:19 [<ffffffffa0618f62>] ? cfs_hash_lookup+0x82/0xa0 [libcfs] 2014-06-18 08:26:19 [<ffffffffa090243c>] ? lustre_msg_clear_flags+0x6c/0xc0 [ptlrpc] 2014-06-18 08:26:19 [<ffffffffa0e9ebcd>] mdt_intent_getattr+0x29d/0x490 [mdt] 2014-06-18 08:26:19 [<ffffffffa0e8bf4e>] mdt_intent_policy+0x39e/0x720 [mdt] 2014-06-18 08:26:19 [<ffffffffa08b8911>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] 2014-06-18 08:26:19 [<ffffffffa08e1a4f>] ldlm_handle_enqueue0+0x4ef/0x10a0 [ptlrpc] 2014-06-18 08:26:19 [<ffffffffa0e8c3d6>] mdt_enqueue+0x46/0xe0 [mdt] 2014-06-18 08:26:19 [<ffffffffa0e91827>] mdt_handle_common+0x647/0x16d0 [mdt] 2014-06-18 08:26:19 [<ffffffffa0eccba5>] mds_regular_handle+0x15/0x20 [mdt] 2014-06-18 08:26:19 [<ffffffffa09120f8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] 2014-06-18 08:26:19 [<ffffffffa060263e>] ? cfs_timer_arm+0xe/0x10 [libcfs] 2014-06-18 08:26:19 [<ffffffffa0613f6f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2014-06-18 08:26:19 [<ffffffffa0909459>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2014-06-18 08:26:19 [<ffffffffa091348e>] ptlrpc_main+0xace/0x1700 [ptlrpc] 2014-06-18 08:26:19 [<ffffffffa09129c0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] 2014-06-18 08:26:19 [<ffffffff8100c24a>] child_rip+0xa/0x20 2014-06-18 08:26:19 [<ffffffffa09129c0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] 2014-06-18 08:26:19 [<ffffffffa09129c0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] 2014-06-18 08:26:19 [<ffffffff8100c240>] ? child_rip+0x0/0x20
I also see many "osp-syn" threads like this:
2014-06-18 08:25:19 osp-syn-36 S 000000000000000a 0 12526 2 0x00000000 2014-06-18 08:25:19 ffff880fc3325a20 0000000000000046 0000000000000000 ffff880fc3c0e640 2014-06-18 08:25:19 ffff880fc33259b8 0000000000000001 ffff880fd8156800 ffff880fc33259f0 2014-06-18 08:25:19 ffff88100f10f058 ffff880fc3325fd8 000000000000fbc8 ffff88100f10f058 2014-06-18 08:25:19 Call Trace: 2014-06-18 08:25:19 [<ffffffffa060277e>] cfs_waitq_wait+0xe/0x10 [libcfs] 2014-06-18 08:25:19 [<ffffffffa0fab0e1>] osp_sync_process_queues+0x251/0x15e0 [osp] 2014-06-18 08:25:19 [<ffffffff81061f20>] ? default_wake_function+0x0/0x20 2014-06-18 08:25:19 [<ffffffffa07240fb>] llog_process_thread+0x8fb/0xe00 [obdclass] 2014-06-18 08:25:19 [<ffffffffa0faae90>] ? osp_sync_process_queues+0x0/0x15e0 [osp] 2014-06-18 08:25:19 [<ffffffffa0725e0d>] llog_process_or_fork+0x12d/0x660 [obdclass] 2014-06-18 08:25:19 [<ffffffffa07285fa>] llog_cat_process_cb+0x56a/0x620 [obdclass] 2014-06-18 08:25:19 [<ffffffffa07240fb>] llog_process_thread+0x8fb/0xe00 [obdclass] 2014-06-18 08:25:19 [<ffffffffa0728090>] ? llog_cat_process_cb+0x0/0x620 [obdclass] 2014-06-18 08:25:19 [<ffffffffa0725e0d>] llog_process_or_fork+0x12d/0x660 [obdclass] 2014-06-18 08:25:19 [<ffffffffa0726ee9>] llog_cat_process_or_fork+0x89/0x350 [obdclass] 2014-06-18 08:25:19 [<ffffffff810547c9>] ? __wake_up_common+0x59/0x90 2014-06-18 08:25:19 [<ffffffffa0faae90>] ? osp_sync_process_queues+0x0/0x15e0 [osp] 2014-06-18 08:25:19 [<ffffffffa07271c9>] llog_cat_process+0x19/0x20 [obdclass] 2014-06-18 08:25:19 [<ffffffffa06027da>] ? cfs_waitq_signal+0x1a/0x20 [libcfs] 2014-06-18 08:25:19 [<ffffffffa0faca80>] osp_sync_thread+0x240/0x7e0 [osp] 2014-06-18 08:25:19 [<ffffffffa0fac840>] ? osp_sync_thread+0x0/0x7e0 [osp] 2014-06-18 08:25:19 [<ffffffff8100c24a>] child_rip+0xa/0x20 2014-06-18 08:25:19 [<ffffffffa0fac840>] ? osp_sync_thread+0x0/0x7e0 [osp] 2014-06-18 08:25:19 [<ffffffffa0fac840>] ? osp_sync_thread+0x0/0x7e0 [osp] 2014-06-18 08:25:19 [<ffffffff8100c240>] ? child_rip+0x0/0x20
I think this might be similar if not the same as what was reported LU-4335, which is still unresolved.
There might be some on-going network/router issues with one of the client clusters mounting this filesystem. In the console logs of the MDS and the OSS's I see reconnection messages all starting around the same time, and coming from the same client cluster:
2014-06-17 21:26:33 Lustre: MGS: Client ded91cb5-244e-1441-a4eb-3f24b0b31de4 (at 192.168.115.81@o2ib5) reconnecting 2014-06-17 21:26:33 Lustre: Skipped 3 previous similar messages 2014-06-17 21:26:37 Lustre: MGS: Client 556f6ca6-d372-c801-a2f7-9c58a2f9d6f2 (at 192.168.112.183@o2ib5) reconnecting 2014-06-17 21:26:37 Lustre: Skipped 1 previous similar message 2014-06-17 21:26:40 Lustre: lsd-MDT0000: Client bf288f47-9ee8-e9d8-6b45-440e971a9cbd (at 192.168.114.211@o2ib5) reconnecting 2014-06-17 21:26:52 Lustre: lsd-MDT0000: Client 2a1ffedf-635d-0d05-fc41-7f70e5389d13 (at 192.168.112.180@o2ib5) reconnecting 2014-06-17 21:26:56 Lustre: MGS: Client c36d44f9-4964-583a-1c78-24a2e5c1ff22 (at 192.168.115.90@o2ib5) reconnecting 2014-06-17 21:26:56 Lustre: Skipped 5 previous similar messages 2014-06-17 21:27:04 Lustre: lsd-MDT0000: Client 1e97ad41-dd1e-0abd-5ad0-95ba5e774908 (at 192.168.115.193@o2ib5) reconnecting 2014-06-17 21:27:04 Lustre: Skipped 14 previous similar messages 2014-06-17 21:27:22 Lustre: lsd-MDT0000: Client 8270f083-0c43-751e-be7a-a3b66b8a6a32 (at 192.168.113.53@o2ib5) reconnecting 2014-06-17 21:27:22 Lustre: Skipped 17 previous similar messages 2014-06-17 21:27:54 Lustre: MGS: Client c8ef8e8a-30a1-ad69-6d1c-377755d6f220 (at 192.168.119.130@o2ib5) reconnecting
From this point on, there's a lot of messages in the console log of the MDS (e.g. LNET service thread stacks, lock timer expirations, LDLM AST messages, evictions, reconnections, etc); so it's hard to make sense of it all. The OSS nodes are noisy after the initial reconnection messages as well, but much less than the MDS. The messages seen on the OSS's after the initial event are mainly evictions, reconnections, and IO erros/timeouts.
Just to make it clear, the hung stat system call I posted above, was not from a client in the cluster making these initial reconnection requests. It was from another client cluster mounting this same filesystem.
So far, my best guess is the reconnection requests from the "problem" client cluster have rendered this MDS "useless". I have vague a recollection of this happening to us before; which, if I recall correctly, is why we have decided to no longer mount all filesystems on all client clusters.
If you want to mark it "won't fix" or "cannot reproduce" since it is so old, that is fine with me.