[LU-5226] MDS idle - 'stat' system call hangs - possible router issues on different client cluster - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.4.2
Labels:
- llnl

Severity:
3
Rank (Obsolete):
14564

Description

I'm seeing an issue on one of our production Lustre filesystems where the stat system call is not returning, but the MDS of the filesystem in question appears to be nearly idle.

$ mount
...
pilsner-mds1-lnet0@o2ib100:/lsd on /p/lscratchd type lustre (rw,nosuid,noauto,_netdev,lazystatfs,flock)
$ strace -f -tt /bin/ls /p/lscratchd/surya1
...
09:39:52.854819 stat("/p/lscratchd/surya1",

Looking at the CPU utilization on the MDS, there isn't a whole lot going on:

# perf record -a sleep 5
# perf report --stdio --symfs=/ 2>/dev/null | head -n 40 | tail -n 20
#
# Overhead          Command      Shared Object                                      Symbol
# ........  ...............  .................  ..........................................
#
    25.06%             init  [kernel.kallsyms]  [k] intel_idle                            
     1.84%          swapper  [kernel.kallsyms]  [k] intel_idle                            
     1.31%             init  [kernel.kallsyms]  [k] getnstimeofday                        
     1.18%             init  [kernel.kallsyms]  [k] schedule                              
     1.17%             init  [kernel.kallsyms]  [k] menu_select                           
     1.06%             init  [kernel.kallsyms]  [k] tick_nohz_stop_sched_tick             
     0.81%            snmpd  [kernel.kallsyms]  [k] diskstats_show                        
     0.80%             init  [kernel.kallsyms]  [k] select_nohz_load_balancer             
     0.78%             init  [kernel.kallsyms]  [k] fn_hash_lookup                        
     0.77%             init  [kernel.kallsyms]  [k] cpuidle_idle_call                     
     0.61%             init  [ib_ipoib]         [k] ipoib_ib_handle_rx_wc                 
     0.59%  kiblnd_sd_02_00  [kernel.kallsyms]  [k] find_busiest_group                    
     0.59%             init  [kernel.kallsyms]  [k] __remove_hrtimer                      
     0.52%             init  [kernel.kallsyms]  [k] rb_next                               
     0.51%             init  [kernel.kallsyms]  [k] try_to_wake_up                        
     0.51%  kiblnd_sd_02_01  [kernel.kallsyms]  [k] find_busiest_group

There's not a whole lot of ARC activity or disk activity on the MDS either:

# timeout 10 arcstat.py 1
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
09:55:28     0     0      0     0    0     0    0     0    0   3.9G  1.1G  
09:55:29     0     0      0     0    0     0    0     0    0   3.9G  1.1G  
09:55:30     6     6    100     6  100     0    0     5  100   3.9G  1.1G  
09:55:31     0     0      0     0    0     0    0     0    0   3.9G  1.1G  
09:55:32     0     0      0     0    0     0    0     0    0   3.9G  1.1G  
09:55:33     0     0      0     0    0     0    0     0    0   3.9G  1.1G  
09:55:34     0     0      0     0    0     0    0     0    0   3.9G  1.1G  
09:55:35     0     0      0     0    0     0    0     0    0   3.9G  1.1G  
09:55:36     0     0      0     0    0     0    0     0    0   3.9G  1.1G  
09:55:37     0     0      0     0    0     0    0     0    0   3.9G  1.1G

# timeout 5 iostat -xyz 1
Linux 2.6.32-431.17.2.1chaos.ch5.2.x86_64 (pilsner-mds1)        06/18/14        _x86_64_        (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.13    0.00    0.00   99.87

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sde               0.00     0.00    1.00    0.00     4.00     0.00     4.00     0.00    0.00   0.00   0.00
sdh               0.00     0.00    1.00    0.00     4.00     0.00     4.00     0.00    0.00   0.00   0.00
sdl               0.00     0.00    1.00    0.00     4.00     0.00     4.00     0.00    0.00   0.00   0.00
sdan              0.00     0.00    1.00    0.00     4.00     0.00     4.00     0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.25    0.00    0.00   99.75

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdk               0.00     0.00    4.00    0.00   262.00     0.00    65.50     0.00    0.75   0.75   0.30
sdc               0.00     0.00    1.00    0.00     4.00     0.00     4.00     0.00    0.00   0.00   0.00
sdm               0.00     0.00    1.00    0.00     4.00     0.00     4.00     0.00    0.00   0.00   0.00
sdo               0.00     0.00    1.00    0.00     4.00     0.00     4.00     0.00    0.00   0.00   0.00
sdaf              0.00     0.00    5.00    0.00   266.00     0.00    53.20     0.00    0.20   0.20   0.10
dm-0              0.00     0.00    9.00    0.00   528.00     0.00    58.67     0.00    0.44   0.44   0.40

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.31    0.00    0.00   99.69

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sds               0.00     0.00    0.99    0.00     3.96     0.00     4.00     0.00    0.00   0.00   0.00
sdt               0.00     0.00    0.99    0.00     3.96     0.00     4.00     0.00    0.00   0.00   0.00
sdu               0.00     0.00    0.99    0.00     3.96     0.00     4.00     0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.19    0.00    0.00   99.81

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    1.00    0.00     4.00     0.00     4.00     0.00    0.00   0.00   0.00
sdp               0.00     0.00    1.00    0.00     4.00     0.00     4.00     0.00    0.00   0.00   0.00
sdv               0.00     0.00    1.00    0.00     4.00     0.00     4.00     0.00    1.00   1.00   0.10
sdah              0.00     0.00    1.00    0.00     4.00     0.00     4.00     0.00    0.00   0.00   0.00
sdaj              0.00     0.00    1.00    0.00     4.00     0.00     4.00     0.00    0.00   0.00   0.00

I dumped the stacks on the MDS using sysrq-t and see many of the "mdt" threads (maybe all, I've haven't done an exhaustive check) like the following:

2014-06-18 08:26:47 mdt02_090     S 000000000000000a     0 13594      2 0x00000000
2014-06-18 08:26:47  ffff881f325d7a00 0000000000000046 0000000000000000 ffff881f325d79c4
2014-06-18 08:26:47  0000000000000000 ffff88103fe83200 ffff880060636840 00000000000003ff
2014-06-18 08:26:47  ffff881f32811058 ffff881f325d7fd8 000000000000fbc8 ffff881f32811058
2014-06-18 08:26:47 Call Trace:
2014-06-18 08:26:47  [<ffffffffa060277e>] cfs_waitq_wait+0xe/0x10 [libcfs]
2014-06-18 08:26:47  [<ffffffffa08d844a>] ldlm_completion_ast+0x57a/0x960 [ptlrpc]
2014-06-18 08:26:47  [<ffffffffa08d3b50>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
2014-06-18 08:26:47  [<ffffffff81061f20>] ? default_wake_function+0x0/0x20
2014-06-18 08:26:47  [<ffffffffa08d7af8>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
2014-06-18 08:26:47  [<ffffffffa08d7ed0>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
2014-06-18 08:26:47  [<ffffffffa0e88ad0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
2014-06-18 08:26:47  [<ffffffffa0e92fef>] mdt_object_lock0+0x33f/0xaf0 [mdt]
2014-06-18 08:26:47  [<ffffffffa0e88ad0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
2014-06-18 08:26:47  [<ffffffffa08d7ed0>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
2014-06-18 08:26:47  [<ffffffffa0e93864>] mdt_object_lock+0x14/0x20 [mdt]
2014-06-18 08:26:47  [<ffffffffa0eaeabf>] mdt_attr_set+0x8f/0x560 [mdt]
2014-06-18 08:26:47  [<ffffffffa0eaf54d>] mdt_reint_setattr+0x5bd/0xcf0 [mdt]
2014-06-18 08:26:47  [<ffffffffa0901e7e>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc]
2014-06-18 08:26:47  [<ffffffffa0ea8e71>] mdt_reint_rec+0x41/0xe0 [mdt]
2014-06-18 08:26:47  [<ffffffffa0e8db13>] mdt_reint_internal+0x4c3/0x780 [mdt]
2014-06-18 08:26:47  [<ffffffffa0e8de14>] mdt_reint+0x44/0xe0 [mdt]
2014-06-18 08:26:47  [<ffffffffa0e91827>] mdt_handle_common+0x647/0x16d0 [mdt]
2014-06-18 08:26:47  [<ffffffffa0eccba5>] mds_regular_handle+0x15/0x20 [mdt]
2014-06-18 08:26:47  [<ffffffffa09120f8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
2014-06-18 08:26:47  [<ffffffffa060263e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
2014-06-18 08:26:47  [<ffffffffa0613f6f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
2014-06-18 08:26:47  [<ffffffffa0909459>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
2014-06-18 08:26:47  [<ffffffffa091348e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
2014-06-18 08:26:47  [<ffffffffa09129c0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
2014-06-18 08:26:47  [<ffffffff8100c24a>] child_rip+0xa/0x20
2014-06-18 08:26:47  [<ffffffffa09129c0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
2014-06-18 08:26:47  [<ffffffffa09129c0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
2014-06-18 08:26:47  [<ffffffff8100c240>] ? child_rip+0x0/0x20


2014-06-18 08:26:19 mdt03_068     S 000000000000000f     0 13280      2 0x00000000
2014-06-18 08:26:19  ffff881f4938b8f0 0000000000000046 0000000000000000 ffff881f4938b940
2014-06-18 08:26:19  0000000000000001 ffff881f49d42040 ffff881f4938b8c0 ffffffffa06124a1
2014-06-18 08:26:19  ffff881f49d425f8 ffff881f4938bfd8 000000000000fbc8 ffff881f49d425f8
2014-06-18 08:26:19 Call Trace:
2014-06-18 08:26:19  [<ffffffffa06124a1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
2014-06-18 08:26:19  [<ffffffffa060277e>] cfs_waitq_wait+0xe/0x10 [libcfs]
2014-06-18 08:26:19  [<ffffffffa08d844a>] ldlm_completion_ast+0x57a/0x960 [ptlrpc]
2014-06-18 08:26:19  [<ffffffffa08d3b50>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
2014-06-18 08:26:19  [<ffffffff81061f20>] ? default_wake_function+0x0/0x20
2014-06-18 08:26:19  [<ffffffffa08d7af8>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
2014-06-18 08:26:19  [<ffffffffa08d7ed0>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
2014-06-18 08:26:19  [<ffffffffa0e88ad0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
2014-06-18 08:26:19  [<ffffffffa0e92fef>] mdt_object_lock0+0x33f/0xaf0 [mdt]
2014-06-18 08:26:19  [<ffffffffa0e88ad0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
2014-06-18 08:26:19  [<ffffffffa08d7ed0>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
2014-06-18 08:26:19  [<ffffffffa0e93864>] mdt_object_lock+0x14/0x20 [mdt]
2014-06-18 08:26:19  [<ffffffffa0e9ddc9>] mdt_getattr_name_lock+0xe29/0x1990 [mdt]
2014-06-18 08:26:19  [<ffffffffa0618f62>] ? cfs_hash_lookup+0x82/0xa0 [libcfs]
2014-06-18 08:26:19  [<ffffffffa090243c>] ? lustre_msg_clear_flags+0x6c/0xc0 [ptlrpc]
2014-06-18 08:26:19  [<ffffffffa0e9ebcd>] mdt_intent_getattr+0x29d/0x490 [mdt]
2014-06-18 08:26:19  [<ffffffffa0e8bf4e>] mdt_intent_policy+0x39e/0x720 [mdt]
2014-06-18 08:26:19  [<ffffffffa08b8911>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
2014-06-18 08:26:19  [<ffffffffa08e1a4f>] ldlm_handle_enqueue0+0x4ef/0x10a0 [ptlrpc]
2014-06-18 08:26:19  [<ffffffffa0e8c3d6>] mdt_enqueue+0x46/0xe0 [mdt]
2014-06-18 08:26:19  [<ffffffffa0e91827>] mdt_handle_common+0x647/0x16d0 [mdt]
2014-06-18 08:26:19  [<ffffffffa0eccba5>] mds_regular_handle+0x15/0x20 [mdt]
2014-06-18 08:26:19  [<ffffffffa09120f8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
2014-06-18 08:26:19  [<ffffffffa060263e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
2014-06-18 08:26:19  [<ffffffffa0613f6f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
2014-06-18 08:26:19  [<ffffffffa0909459>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
2014-06-18 08:26:19  [<ffffffffa091348e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
2014-06-18 08:26:19  [<ffffffffa09129c0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
2014-06-18 08:26:19  [<ffffffff8100c24a>] child_rip+0xa/0x20
2014-06-18 08:26:19  [<ffffffffa09129c0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
2014-06-18 08:26:19  [<ffffffffa09129c0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
2014-06-18 08:26:19  [<ffffffff8100c240>] ? child_rip+0x0/0x20

I also see many "osp-syn" threads like this:

2014-06-18 08:25:19 osp-syn-36    S 000000000000000a     0 12526      2 0x00000000
2014-06-18 08:25:19  ffff880fc3325a20 0000000000000046 0000000000000000 ffff880fc3c0e640
2014-06-18 08:25:19  ffff880fc33259b8 0000000000000001 ffff880fd8156800 ffff880fc33259f0
2014-06-18 08:25:19  ffff88100f10f058 ffff880fc3325fd8 000000000000fbc8 ffff88100f10f058
2014-06-18 08:25:19 Call Trace:
2014-06-18 08:25:19  [<ffffffffa060277e>] cfs_waitq_wait+0xe/0x10 [libcfs]
2014-06-18 08:25:19  [<ffffffffa0fab0e1>] osp_sync_process_queues+0x251/0x15e0 [osp]
2014-06-18 08:25:19  [<ffffffff81061f20>] ? default_wake_function+0x0/0x20
2014-06-18 08:25:19  [<ffffffffa07240fb>] llog_process_thread+0x8fb/0xe00 [obdclass]
2014-06-18 08:25:19  [<ffffffffa0faae90>] ? osp_sync_process_queues+0x0/0x15e0 [osp]
2014-06-18 08:25:19  [<ffffffffa0725e0d>] llog_process_or_fork+0x12d/0x660 [obdclass]
2014-06-18 08:25:19  [<ffffffffa07285fa>] llog_cat_process_cb+0x56a/0x620 [obdclass]
2014-06-18 08:25:19  [<ffffffffa07240fb>] llog_process_thread+0x8fb/0xe00 [obdclass]
2014-06-18 08:25:19  [<ffffffffa0728090>] ? llog_cat_process_cb+0x0/0x620 [obdclass]
2014-06-18 08:25:19  [<ffffffffa0725e0d>] llog_process_or_fork+0x12d/0x660 [obdclass]
2014-06-18 08:25:19  [<ffffffffa0726ee9>] llog_cat_process_or_fork+0x89/0x350 [obdclass]
2014-06-18 08:25:19  [<ffffffff810547c9>] ? __wake_up_common+0x59/0x90
2014-06-18 08:25:19  [<ffffffffa0faae90>] ? osp_sync_process_queues+0x0/0x15e0 [osp]
2014-06-18 08:25:19  [<ffffffffa07271c9>] llog_cat_process+0x19/0x20 [obdclass]
2014-06-18 08:25:19  [<ffffffffa06027da>] ? cfs_waitq_signal+0x1a/0x20 [libcfs]
2014-06-18 08:25:19  [<ffffffffa0faca80>] osp_sync_thread+0x240/0x7e0 [osp]
2014-06-18 08:25:19  [<ffffffffa0fac840>] ? osp_sync_thread+0x0/0x7e0 [osp]
2014-06-18 08:25:19  [<ffffffff8100c24a>] child_rip+0xa/0x20
2014-06-18 08:25:19  [<ffffffffa0fac840>] ? osp_sync_thread+0x0/0x7e0 [osp]
2014-06-18 08:25:19  [<ffffffffa0fac840>] ? osp_sync_thread+0x0/0x7e0 [osp]
2014-06-18 08:25:19  [<ffffffff8100c240>] ? child_rip+0x0/0x20

I think this might be similar if not the same as what was reported ~~LU-4335~~, which is still unresolved.

There might be some on-going network/router issues with one of the client clusters mounting this filesystem. In the console logs of the MDS and the OSS's I see reconnection messages all starting around the same time, and coming from the same client cluster:

                                                                      
2014-06-17 21:26:33 Lustre: MGS: Client ded91cb5-244e-1441-a4eb-3f24b0b31de4 (at 192.168.115.81@o2ib5) reconnecting
2014-06-17 21:26:33 Lustre: Skipped 3 previous similar messages                 
2014-06-17 21:26:37 Lustre: MGS: Client 556f6ca6-d372-c801-a2f7-9c58a2f9d6f2 (at 192.168.112.183@o2ib5) reconnecting
2014-06-17 21:26:37 Lustre: Skipped 1 previous similar message                  
2014-06-17 21:26:40 Lustre: lsd-MDT0000: Client bf288f47-9ee8-e9d8-6b45-440e971a9cbd (at 192.168.114.211@o2ib5) reconnecting
2014-06-17 21:26:52 Lustre: lsd-MDT0000: Client 2a1ffedf-635d-0d05-fc41-7f70e5389d13 (at 192.168.112.180@o2ib5) reconnecting
2014-06-17 21:26:56 Lustre: MGS: Client c36d44f9-4964-583a-1c78-24a2e5c1ff22 (at 192.168.115.90@o2ib5) reconnecting
2014-06-17 21:26:56 Lustre: Skipped 5 previous similar messages                 
2014-06-17 21:27:04 Lustre: lsd-MDT0000: Client 1e97ad41-dd1e-0abd-5ad0-95ba5e774908 (at 192.168.115.193@o2ib5) reconnecting
2014-06-17 21:27:04 Lustre: Skipped 14 previous similar messages                
2014-06-17 21:27:22 Lustre: lsd-MDT0000: Client 8270f083-0c43-751e-be7a-a3b66b8a6a32 (at 192.168.113.53@o2ib5) reconnecting
2014-06-17 21:27:22 Lustre: Skipped 17 previous similar messages                
2014-06-17 21:27:54 Lustre: MGS: Client c8ef8e8a-30a1-ad69-6d1c-377755d6f220 (at 192.168.119.130@o2ib5) reconnecting

From this point on, there's a lot of messages in the console log of the MDS (e.g. LNET service thread stacks, lock timer expirations, LDLM AST messages, evictions, reconnections, etc); so it's hard to make sense of it all. The OSS nodes are noisy after the initial reconnection messages as well, but much less than the MDS. The messages seen on the OSS's after the initial event are mainly evictions, reconnections, and IO erros/timeouts.

Just to make it clear, the hung stat system call I posted above, was not from a client in the cluster making these initial reconnection requests. It was from another client cluster mounting this same filesystem.

So far, my best guess is the reconnection requests from the "problem" client cluster have rendered this MDS "useless". I have vague a recollection of this happening to us before; which, if I recall correctly, is why we have decided to no longer mount all filesystems on all client clusters.

Attachments

Activity

[LU-5226] MDS idle - 'stat' system call hangs - possible router issues on different client cluster

Prakash Surya (Inactive) added a comment - 07/Jul/14 7:55 PM - edited

Was the router fixed long before this?

I believe so, yes. The routers were "fixed" prior to restarting the MDS, which happened around 11:40 on 6-18 according to the console log (5 hours prior to the message you reference).

If so, there could be other potential network problems in the system.

Possibly, can't really say what the cause might've been for certain though.

With that said, after rebooting the MDS, the admins did bring up the problem routers for a brief moment but quickly shut them down again. During that small up time, many reconnection requests started to occur (similar to the original incident) which made the admins decide to turn the routers off again and leave them down. I don't know if that's significant, but I figured I would make a note of it here.

Prakash Surya (Inactive) added a comment - 07/Jul/14 7:55 PM - edited Was the router fixed long before this? I believe so, yes. The routers were "fixed" prior to restarting the MDS, which happened around 11:40 on 6-18 according to the console log (5 hours prior to the message you reference). If so, there could be other potential network problems in the system. Possibly, can't really say what the cause might've been for certain though. With that said, after rebooting the MDS, the admins did bring up the problem routers for a brief moment but quickly shut them down again. During that small up time, many reconnection requests started to occur (similar to the original incident) which made the admins decide to turn the routers off again and leave them down. I don't know if that's significant, but I figured I would make a note of it here.

Niu Yawei (Inactive) added a comment - 03/Jul/14 7:49 AM

The last network error messages in the console log is:

2014-06-18 16:12:20 LustreError: 12501:0:(ldlm_lib.c:2706:target_bulk_io()) @@@ timeout on bulk PUT after 1085+0s  req@ffff880f528f9850 x1471189200184516/t0(0) o256->86d22738-c9b5-d251-8413-6ea54b135fc3@192.168.112.103@o2ib5:0/0 lens 304/240 e 1 to 0 dl 1403133140 ref 1 fl Interpret:/0/0 rc 0/0
2014-06-18 16:12:20 LustreError: 12501:0:(ldlm_lib.c:2706:target_bulk_io()) Skipped 17 previous similar messages

Was the router fixed long before this? If so, there could be other potential network problems in the system. Looks the system was back to normal after these error messages (no watchdog triggered anymore).

Niu Yawei (Inactive) added a comment - 03/Jul/14 7:49 AM The last network error messages in the console log is: 2014-06-18 16:12:20 LustreError: 12501:0:(ldlm_lib.c:2706:target_bulk_io()) @@@ timeout on bulk PUT after 1085+0s req@ffff880f528f9850 x1471189200184516/t0(0) o256->86d22738-c9b5-d251-8413-6ea54b135fc3@192.168.112.103@o2ib5:0/0 lens 304/240 e 1 to 0 dl 1403133140 ref 1 fl Interpret:/0/0 rc 0/0 2014-06-18 16:12:20 LustreError: 12501:0:(ldlm_lib.c:2706:target_bulk_io()) Skipped 17 previous similar messages Was the router fixed long before this? If so, there could be other potential network problems in the system. Looks the system was back to normal after these error messages (no watchdog triggered anymore).

Prakash Surya (Inactive) added a comment - 27/Jun/14 4:37 PM

Thanks Peter. I've uploaded a tarball to "/uploads/~~LU-5226~~/~~LU-5226~~.tar.bz". It contains the console log of the MDS in question during the incident starting around 21:00 on 6/17. I didn't filter out anything unrelated, so it also contains the console log for older days that can be ignored.

Prakash Surya (Inactive) added a comment - 27/Jun/14 4:37 PM Thanks Peter. I've uploaded a tarball to "/uploads/ LU-5226 / LU-5226 .tar.bz". It contains the console log of the MDS in question during the incident starting around 21:00 on 6/17. I didn't filter out anything unrelated, so it also contains the console log for older days that can be ignored.

Peter Jones added a comment - 27/Jun/14 4:16 AM

I'll send you the details privately

Peter Jones added a comment - 27/Jun/14 4:16 AM I'll send you the details privately

Prakash Surya (Inactive) added a comment - 26/Jun/14 11:15 PM

How do I upload file to the FTP server? What's the FTP server's address?

Prakash Surya (Inactive) added a comment - 26/Jun/14 11:15 PM How do I upload file to the FTP server? What's the FTP server's address?

Niu Yawei (Inactive) added a comment - 23/Jun/14 2:13 AM

I didn't do much digging around after we shutdown the routers, so I don't have much information. The best I can probably do, since the issue has past, is upload the console logs for the MDS for the 24 hours or so surrounding the incident. I did a sysrq-t before we took the routers down, but did not issue one afterwards, so I don't have stacks for the threads after the routers were brought down. Would you like me to upload the MDS console logs?

Yes, please. Hope we can find something in the console log. Thanks.

Niu Yawei (Inactive) added a comment - 23/Jun/14 2:13 AM I didn't do much digging around after we shutdown the routers, so I don't have much information. The best I can probably do, since the issue has past, is upload the console logs for the MDS for the 24 hours or so surrounding the incident. I did a sysrq-t before we took the routers down, but did not issue one afterwards, so I don't have stacks for the threads after the routers were brought down. Would you like me to upload the MDS console logs? Yes, please. Hope we can find something in the console log. Thanks.

Prakash Surya (Inactive) added a comment - 20/Jun/14 9:53 PM - edited

From the trace looks like the MDT threads are waiting on lock enqueue, and the enqueue need to revoke locks from the clients connecting through "problem" routers. So, the enqueue time will be much longer than normal enqueue since it has to wait for the clients being evicted. Because lots of MDT service threads are hold by such lock enqueue, there are only few threads (or no threads) available to response other incoming requests, MDT becomes unresponsive.

Yes, as far as I can tell, that is what happened.

But I assume the situation will be alleviated once the clients are evicted, not sure why MDT is still not responsive even after shutting down 'problem' routers for one hour.

My hunch is the adaptive timeouts are not being honored everywhere. So, while it should have fixed itself within 10 minutes after shutting down the routers (we set "at_max" to 600 seconds) that doesn't happen in practice. If we waited "long enough" it probably would have eventually resolved itself, but, we didn't want to wait an undefined amount of time for a production system; a reboot was the faster course of action to get the MDS operational again.

Did you observe any error messages on MDT after shutting down routers? And what's the stack trace look like after shutting down routers?

I didn't do much digging around after we shutdown the routers, so I don't have much information. The best I can probably do, since the issue has past, is upload the console logs for the MDS for the 24 hours or so surrounding the incident. I did a sysrq-t before we took the routers down, but did not issue one afterwards, so I don't have stacks for the threads after the routers were brought down. Would you like me to upload the MDS console logs?

Prakash Surya (Inactive) added a comment - 20/Jun/14 9:53 PM - edited From the trace looks like the MDT threads are waiting on lock enqueue, and the enqueue need to revoke locks from the clients connecting through "problem" routers. So, the enqueue time will be much longer than normal enqueue since it has to wait for the clients being evicted. Because lots of MDT service threads are hold by such lock enqueue, there are only few threads (or no threads) available to response other incoming requests, MDT becomes unresponsive. Yes, as far as I can tell, that is what happened. But I assume the situation will be alleviated once the clients are evicted, not sure why MDT is still not responsive even after shutting down 'problem' routers for one hour. My hunch is the adaptive timeouts are not being honored everywhere. So, while it should have fixed itself within 10 minutes after shutting down the routers (we set "at_max" to 600 seconds) that doesn't happen in practice. If we waited "long enough" it probably would have eventually resolved itself, but, we didn't want to wait an undefined amount of time for a production system; a reboot was the faster course of action to get the MDS operational again. Did you observe any error messages on MDT after shutting down routers? And what's the stack trace look like after shutting down routers? I didn't do much digging around after we shutdown the routers, so I don't have much information. The best I can probably do, since the issue has past, is upload the console logs for the MDS for the 24 hours or so surrounding the incident. I did a sysrq-t before we took the routers down, but did not issue one afterwards, so I don't have stacks for the threads after the routers were brought down. Would you like me to upload the MDS console logs?

Niu Yawei (Inactive) added a comment - 20/Jun/14 3:54 AM

So, just to make it clear, 3 malfunctioning router nodes brought down the whole filesystem; that is a problem. And the FS wasn't down for just the client cluster connected to the "problem" routers, the FS was unresponsive to all client clusters tested (and more likely all client clusters, period). The fact that this is possible and clearly happens, leads me to think the design here is very flawed.

From the trace looks like the MDT threads are waiting on lock enqueue, and the enqueue need to revoke locks from the clients connecting through "problem" routers. So, the enqueue time will be much longer than normal enqueue since it has to wait for the clients being evicted. Because lots of MDT service threads are hold by such lock enqueue, there are only few threads (or no threads) available to response other incoming requests, MDT becomes unresponsive.

But I assume the situation will be alleviated once the clients are evicted, not sure why MDT is still not responsive even after shutting down 'problem' routers for one hour. Did you observe any error messages on MDT after shutting down routers? And what's the stack trace look like after shutting down routers? Thank you.

Niu Yawei (Inactive) added a comment - 20/Jun/14 3:54 AM So, just to make it clear, 3 malfunctioning router nodes brought down the whole filesystem; that is a problem. And the FS wasn't down for just the client cluster connected to the "problem" routers, the FS was unresponsive to all client clusters tested (and more likely all client clusters, period). The fact that this is possible and clearly happens, leads me to think the design here is very flawed. From the trace looks like the MDT threads are waiting on lock enqueue, and the enqueue need to revoke locks from the clients connecting through "problem" routers. So, the enqueue time will be much longer than normal enqueue since it has to wait for the clients being evicted. Because lots of MDT service threads are hold by such lock enqueue, there are only few threads (or no threads) available to response other incoming requests, MDT becomes unresponsive. But I assume the situation will be alleviated once the clients are evicted, not sure why MDT is still not responsive even after shutting down 'problem' routers for one hour. Did you observe any error messages on MDT after shutting down routers? And what's the stack trace look like after shutting down routers? Thank you.

Peter Jones added a comment - 19/Jun/14 3:38 PM

Niu is looking into this one

Peter Jones added a comment - 19/Jun/14 3:38 PM Niu is looking into this one

Prakash Surya (Inactive) added a comment - 18/Jun/14 8:36 PM - edited

OK, the filesystem is back in an operational state, here's the synopsis:

Around 21:00 yesterday there's re-connection messages in the server logs
Around 23:00 yesterday users report "issues"
Sometime this morning I diagnose this as "network" related
We find 3 "problem" router nodes connected to a particular client cluster and shut those down
An hour or so after shutting down the routers, the MDS hasn't "recovered" yet
We reboot the MDS, it comes back up, starts lustre, and things seem to be back to normal ("problem" routers still down)

So, just to make it clear, 3 malfunctioning router nodes brought down the whole filesystem; that is a problem. And the FS wasn't down for just the client cluster connected to the "problem" routers, the FS was unresponsive to all client clusters tested (and more likely all client clusters, period). The fact that this is possible and clearly happens, leads me to think the design here is very flawed.

From what I can gather, this is appears to be due to all of the threads on the MDS waiting for network traffic that it's not going to receive (due to the routers malfunctioning); which I believe is a known issue. I'd mark this a blocker, but I honestly don't think that'll help resolve the issue. It seems like a design (or lack thereof) flaw that might take quite some time to address properly.

Prakash Surya (Inactive) added a comment - 18/Jun/14 8:36 PM - edited OK, the filesystem is back in an operational state, here's the synopsis: Around 21:00 yesterday there's re-connection messages in the server logs Around 23:00 yesterday users report "issues" Sometime this morning I diagnose this as "network" related We find 3 "problem" router nodes connected to a particular client cluster and shut those down An hour or so after shutting down the routers, the MDS hasn't "recovered" yet We reboot the MDS, it comes back up, starts lustre, and things seem to be back to normal ("problem" routers still down) So, just to make it clear, 3 malfunctioning router nodes brought down the whole filesystem; that is a problem. And the FS wasn't down for just the client cluster connected to the "problem" routers, the FS was unresponsive to all client clusters tested (and more likely all client clusters, period). The fact that this is possible and clearly happens, leads me to think the design here is very flawed. From what I can gather, this is appears to be due to all of the threads on the MDS waiting for network traffic that it's not going to receive (due to the routers malfunctioning); which I believe is a known issue. I'd mark this a blocker, but I honestly don't think that'll help resolve the issue. It seems like a design (or lack thereof) flaw that might take quite some time to address properly.

Prakash Surya (Inactive) added a comment - 18/Jun/14 5:53 PM

And here's a better view of the "mdt" stack traces using crash:

PID: 13017  TASK: ffff881fc4e7c040  CPU: 9   COMMAND: "mdt02_019"
 #0 [ffff881fc0dbb830] schedule at ffffffff8152a940
 #1 [ffff881fc0dbb8f8] cfs_waitq_wait at ffffffffa060277e [libcfs]
 #2 [ffff881fc0dbb908] ldlm_completion_ast at ffffffffa08d844a [ptlrpc]
 #3 [ffff881fc0dbb9b8] ldlm_cli_enqueue_local at ffffffffa08d7af8 [ptlrpc]
 #4 [ffff881fc0dbba48] mdt_object_lock0 at ffffffffa0e92fef [mdt]
 #5 [ffff881fc0dbbaf8] mdt_object_lock at ffffffffa0e93864 [mdt]
 #6 [ffff881fc0dbbb08] mdt_getattr_name_lock at ffffffffa0e9ddc9 [mdt]
 #7 [ffff881fc0dbbbc8] mdt_intent_getattr at ffffffffa0e9ebcd [mdt]
 #8 [ffff881fc0dbbc28] mdt_intent_policy at ffffffffa0e8bf4e [mdt]
 #9 [ffff881fc0dbbc68] ldlm_lock_enqueue at ffffffffa08b8911 [ptlrpc]
#10 [ffff881fc0dbbcc8] ldlm_handle_enqueue0 at ffffffffa08e1a4f [ptlrpc]
#11 [ffff881fc0dbbd38] mdt_enqueue at ffffffffa0e8c3d6 [mdt]
#12 [ffff881fc0dbbd58] mdt_handle_common at ffffffffa0e91827 [mdt]
#13 [ffff881fc0dbbda8] mds_regular_handle at ffffffffa0eccba5 [mdt]
#14 [ffff881fc0dbbdb8] ptlrpc_server_handle_request at ffffffffa09120f8 [ptlrpc]
#15 [ffff881fc0dbbeb8] ptlrpc_main at ffffffffa091348e [ptlrpc]
#16 [ffff881fc0dbbf48] kernel_thread at ffffffff8100c24a

PID: 13018  TASK: ffff882025ed9500  CPU: 10  COMMAND: "mdt02_020"
 #0 [ffff881fc32fb940] schedule at ffffffff8152a940
 #1 [ffff881fc32fba08] cfs_waitq_wait at ffffffffa060277e [libcfs]
 #2 [ffff881fc32fba18] ldlm_completion_ast at ffffffffa08d844a [ptlrpc]
 #3 [ffff881fc32fbac8] ldlm_cli_enqueue_local at ffffffffa08d7af8 [ptlrpc]
 #4 [ffff881fc32fbb58] mdt_object_lock0 at ffffffffa0e92fef [mdt]
 #5 [ffff881fc32fbc08] mdt_object_lock at ffffffffa0e93864 [mdt]
 #6 [ffff881fc32fbc18] mdt_attr_set at ffffffffa0eaeabf [mdt]
 #7 [ffff881fc32fbc68] mdt_reint_setattr at ffffffffa0eaf54d [mdt]
 #8 [ffff881fc32fbcd8] mdt_reint_rec at ffffffffa0ea8e71 [mdt]
 #9 [ffff881fc32fbcf8] mdt_reint_internal at ffffffffa0e8db13 [mdt]
#10 [ffff881fc32fbd38] mdt_reint at ffffffffa0e8de14 [mdt]
#11 [ffff881fc32fbd58] mdt_handle_common at ffffffffa0e91827 [mdt]
#12 [ffff881fc32fbda8] mds_regular_handle at ffffffffa0eccba5 [mdt]
#13 [ffff881fc32fbdb8] ptlrpc_server_handle_request at ffffffffa09120f8 [ptlrpc]
#14 [ffff881fc32fbeb8] ptlrpc_main at ffffffffa091348e [ptlrpc]
#15 [ffff881fc32fbf48] kernel_thread at ffffffff8100c24a

Prakash Surya (Inactive) added a comment - 18/Jun/14 5:53 PM And here's a better view of the "mdt" stack traces using crash : PID: 13017 TASK: ffff881fc4e7c040 CPU: 9 COMMAND: "mdt02_019" #0 [ffff881fc0dbb830] schedule at ffffffff8152a940 #1 [ffff881fc0dbb8f8] cfs_waitq_wait at ffffffffa060277e [libcfs] #2 [ffff881fc0dbb908] ldlm_completion_ast at ffffffffa08d844a [ptlrpc] #3 [ffff881fc0dbb9b8] ldlm_cli_enqueue_local at ffffffffa08d7af8 [ptlrpc] #4 [ffff881fc0dbba48] mdt_object_lock0 at ffffffffa0e92fef [mdt] #5 [ffff881fc0dbbaf8] mdt_object_lock at ffffffffa0e93864 [mdt] #6 [ffff881fc0dbbb08] mdt_getattr_name_lock at ffffffffa0e9ddc9 [mdt] #7 [ffff881fc0dbbbc8] mdt_intent_getattr at ffffffffa0e9ebcd [mdt] #8 [ffff881fc0dbbc28] mdt_intent_policy at ffffffffa0e8bf4e [mdt] #9 [ffff881fc0dbbc68] ldlm_lock_enqueue at ffffffffa08b8911 [ptlrpc] #10 [ffff881fc0dbbcc8] ldlm_handle_enqueue0 at ffffffffa08e1a4f [ptlrpc] #11 [ffff881fc0dbbd38] mdt_enqueue at ffffffffa0e8c3d6 [mdt] #12 [ffff881fc0dbbd58] mdt_handle_common at ffffffffa0e91827 [mdt] #13 [ffff881fc0dbbda8] mds_regular_handle at ffffffffa0eccba5 [mdt] #14 [ffff881fc0dbbdb8] ptlrpc_server_handle_request at ffffffffa09120f8 [ptlrpc] #15 [ffff881fc0dbbeb8] ptlrpc_main at ffffffffa091348e [ptlrpc] #16 [ffff881fc0dbbf48] kernel_thread at ffffffff8100c24a PID: 13018 TASK: ffff882025ed9500 CPU: 10 COMMAND: "mdt02_020" #0 [ffff881fc32fb940] schedule at ffffffff8152a940 #1 [ffff881fc32fba08] cfs_waitq_wait at ffffffffa060277e [libcfs] #2 [ffff881fc32fba18] ldlm_completion_ast at ffffffffa08d844a [ptlrpc] #3 [ffff881fc32fbac8] ldlm_cli_enqueue_local at ffffffffa08d7af8 [ptlrpc] #4 [ffff881fc32fbb58] mdt_object_lock0 at ffffffffa0e92fef [mdt] #5 [ffff881fc32fbc08] mdt_object_lock at ffffffffa0e93864 [mdt] #6 [ffff881fc32fbc18] mdt_attr_set at ffffffffa0eaeabf [mdt] #7 [ffff881fc32fbc68] mdt_reint_setattr at ffffffffa0eaf54d [mdt] #8 [ffff881fc32fbcd8] mdt_reint_rec at ffffffffa0ea8e71 [mdt] #9 [ffff881fc32fbcf8] mdt_reint_internal at ffffffffa0e8db13 [mdt] #10 [ffff881fc32fbd38] mdt_reint at ffffffffa0e8de14 [mdt] #11 [ffff881fc32fbd58] mdt_handle_common at ffffffffa0e91827 [mdt] #12 [ffff881fc32fbda8] mds_regular_handle at ffffffffa0eccba5 [mdt] #13 [ffff881fc32fbdb8] ptlrpc_server_handle_request at ffffffffa09120f8 [ptlrpc] #14 [ffff881fc32fbeb8] ptlrpc_main at ffffffffa091348e [ptlrpc] #15 [ffff881fc32fbf48] kernel_thread at ffffffff8100c24a

People

Assignee:: Niu Yawei (Inactive)

Reporter:: Prakash Surya (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 18/Jun/14 5:32 PM

Updated:: 15/Mar/16 10:24 PM

Resolved:: 15/Mar/16 10:24 PM