Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5226

MDS idle - 'stat' system call hangs - possible router issues on different client cluster

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • Lustre 2.4.2
    • 3
    • 14564

    Description

      I'm seeing an issue on one of our production Lustre filesystems where the stat system call is not returning, but the MDS of the filesystem in question appears to be nearly idle.

      $ mount
      ...
      pilsner-mds1-lnet0@o2ib100:/lsd on /p/lscratchd type lustre (rw,nosuid,noauto,_netdev,lazystatfs,flock)
      $ strace -f -tt /bin/ls /p/lscratchd/surya1
      ...
      09:39:52.854819 stat("/p/lscratchd/surya1", 
      

      Looking at the CPU utilization on the MDS, there isn't a whole lot going on:

      # perf record -a sleep 5
      # perf report --stdio --symfs=/ 2>/dev/null | head -n 40 | tail -n 20
      #
      # Overhead          Command      Shared Object                                      Symbol
      # ........  ...............  .................  ..........................................
      #
          25.06%             init  [kernel.kallsyms]  [k] intel_idle                            
           1.84%          swapper  [kernel.kallsyms]  [k] intel_idle                            
           1.31%             init  [kernel.kallsyms]  [k] getnstimeofday                        
           1.18%             init  [kernel.kallsyms]  [k] schedule                              
           1.17%             init  [kernel.kallsyms]  [k] menu_select                           
           1.06%             init  [kernel.kallsyms]  [k] tick_nohz_stop_sched_tick             
           0.81%            snmpd  [kernel.kallsyms]  [k] diskstats_show                        
           0.80%             init  [kernel.kallsyms]  [k] select_nohz_load_balancer             
           0.78%             init  [kernel.kallsyms]  [k] fn_hash_lookup                        
           0.77%             init  [kernel.kallsyms]  [k] cpuidle_idle_call                     
           0.61%             init  [ib_ipoib]         [k] ipoib_ib_handle_rx_wc                 
           0.59%  kiblnd_sd_02_00  [kernel.kallsyms]  [k] find_busiest_group                    
           0.59%             init  [kernel.kallsyms]  [k] __remove_hrtimer                      
           0.52%             init  [kernel.kallsyms]  [k] rb_next                               
           0.51%             init  [kernel.kallsyms]  [k] try_to_wake_up                        
           0.51%  kiblnd_sd_02_01  [kernel.kallsyms]  [k] find_busiest_group
      

      There's not a whole lot of ARC activity or disk activity on the MDS either:

      # timeout 10 arcstat.py 1
          time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
      09:55:28     0     0      0     0    0     0    0     0    0   3.9G  1.1G  
      09:55:29     0     0      0     0    0     0    0     0    0   3.9G  1.1G  
      09:55:30     6     6    100     6  100     0    0     5  100   3.9G  1.1G  
      09:55:31     0     0      0     0    0     0    0     0    0   3.9G  1.1G  
      09:55:32     0     0      0     0    0     0    0     0    0   3.9G  1.1G  
      09:55:33     0     0      0     0    0     0    0     0    0   3.9G  1.1G  
      09:55:34     0     0      0     0    0     0    0     0    0   3.9G  1.1G  
      09:55:35     0     0      0     0    0     0    0     0    0   3.9G  1.1G  
      09:55:36     0     0      0     0    0     0    0     0    0   3.9G  1.1G  
      09:55:37     0     0      0     0    0     0    0     0    0   3.9G  1.1G
      
      # timeout 5 iostat -xyz 1
      Linux 2.6.32-431.17.2.1chaos.ch5.2.x86_64 (pilsner-mds1)        06/18/14        _x86_64_        (16 CPU)
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                 0.00    0.00    0.13    0.00    0.00   99.87
      
      Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
      sde               0.00     0.00    1.00    0.00     4.00     0.00     4.00     0.00    0.00   0.00   0.00
      sdh               0.00     0.00    1.00    0.00     4.00     0.00     4.00     0.00    0.00   0.00   0.00
      sdl               0.00     0.00    1.00    0.00     4.00     0.00     4.00     0.00    0.00   0.00   0.00
      sdan              0.00     0.00    1.00    0.00     4.00     0.00     4.00     0.00    0.00   0.00   0.00
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                 0.00    0.00    0.25    0.00    0.00   99.75
      
      Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
      sdk               0.00     0.00    4.00    0.00   262.00     0.00    65.50     0.00    0.75   0.75   0.30
      sdc               0.00     0.00    1.00    0.00     4.00     0.00     4.00     0.00    0.00   0.00   0.00
      sdm               0.00     0.00    1.00    0.00     4.00     0.00     4.00     0.00    0.00   0.00   0.00
      sdo               0.00     0.00    1.00    0.00     4.00     0.00     4.00     0.00    0.00   0.00   0.00
      sdaf              0.00     0.00    5.00    0.00   266.00     0.00    53.20     0.00    0.20   0.20   0.10
      dm-0              0.00     0.00    9.00    0.00   528.00     0.00    58.67     0.00    0.44   0.44   0.40
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                 0.00    0.00    0.31    0.00    0.00   99.69
      
      Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
      sds               0.00     0.00    0.99    0.00     3.96     0.00     4.00     0.00    0.00   0.00   0.00
      sdt               0.00     0.00    0.99    0.00     3.96     0.00     4.00     0.00    0.00   0.00   0.00
      sdu               0.00     0.00    0.99    0.00     3.96     0.00     4.00     0.00    0.00   0.00   0.00
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                 0.00    0.00    0.19    0.00    0.00   99.81
      
      Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
      sdb               0.00     0.00    1.00    0.00     4.00     0.00     4.00     0.00    0.00   0.00   0.00
      sdp               0.00     0.00    1.00    0.00     4.00     0.00     4.00     0.00    0.00   0.00   0.00
      sdv               0.00     0.00    1.00    0.00     4.00     0.00     4.00     0.00    1.00   1.00   0.10
      sdah              0.00     0.00    1.00    0.00     4.00     0.00     4.00     0.00    0.00   0.00   0.00
      sdaj              0.00     0.00    1.00    0.00     4.00     0.00     4.00     0.00    0.00   0.00   0.00
      

      I dumped the stacks on the MDS using sysrq-t and see many of the "mdt" threads (maybe all, I've haven't done an exhaustive check) like the following:

      2014-06-18 08:26:47 mdt02_090     S 000000000000000a     0 13594      2 0x00000000
      2014-06-18 08:26:47  ffff881f325d7a00 0000000000000046 0000000000000000 ffff881f325d79c4
      2014-06-18 08:26:47  0000000000000000 ffff88103fe83200 ffff880060636840 00000000000003ff
      2014-06-18 08:26:47  ffff881f32811058 ffff881f325d7fd8 000000000000fbc8 ffff881f32811058
      2014-06-18 08:26:47 Call Trace:
      2014-06-18 08:26:47  [<ffffffffa060277e>] cfs_waitq_wait+0xe/0x10 [libcfs]
      2014-06-18 08:26:47  [<ffffffffa08d844a>] ldlm_completion_ast+0x57a/0x960 [ptlrpc]
      2014-06-18 08:26:47  [<ffffffffa08d3b50>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
      2014-06-18 08:26:47  [<ffffffff81061f20>] ? default_wake_function+0x0/0x20
      2014-06-18 08:26:47  [<ffffffffa08d7af8>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
      2014-06-18 08:26:47  [<ffffffffa08d7ed0>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
      2014-06-18 08:26:47  [<ffffffffa0e88ad0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
      2014-06-18 08:26:47  [<ffffffffa0e92fef>] mdt_object_lock0+0x33f/0xaf0 [mdt]
      2014-06-18 08:26:47  [<ffffffffa0e88ad0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
      2014-06-18 08:26:47  [<ffffffffa08d7ed0>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
      2014-06-18 08:26:47  [<ffffffffa0e93864>] mdt_object_lock+0x14/0x20 [mdt]
      2014-06-18 08:26:47  [<ffffffffa0eaeabf>] mdt_attr_set+0x8f/0x560 [mdt]
      2014-06-18 08:26:47  [<ffffffffa0eaf54d>] mdt_reint_setattr+0x5bd/0xcf0 [mdt]
      2014-06-18 08:26:47  [<ffffffffa0901e7e>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc]
      2014-06-18 08:26:47  [<ffffffffa0ea8e71>] mdt_reint_rec+0x41/0xe0 [mdt]
      2014-06-18 08:26:47  [<ffffffffa0e8db13>] mdt_reint_internal+0x4c3/0x780 [mdt]
      2014-06-18 08:26:47  [<ffffffffa0e8de14>] mdt_reint+0x44/0xe0 [mdt]
      2014-06-18 08:26:47  [<ffffffffa0e91827>] mdt_handle_common+0x647/0x16d0 [mdt]
      2014-06-18 08:26:47  [<ffffffffa0eccba5>] mds_regular_handle+0x15/0x20 [mdt]
      2014-06-18 08:26:47  [<ffffffffa09120f8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
      2014-06-18 08:26:47  [<ffffffffa060263e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      2014-06-18 08:26:47  [<ffffffffa0613f6f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
      2014-06-18 08:26:47  [<ffffffffa0909459>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
      2014-06-18 08:26:47  [<ffffffffa091348e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
      2014-06-18 08:26:47  [<ffffffffa09129c0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      2014-06-18 08:26:47  [<ffffffff8100c24a>] child_rip+0xa/0x20
      2014-06-18 08:26:47  [<ffffffffa09129c0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      2014-06-18 08:26:47  [<ffffffffa09129c0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      2014-06-18 08:26:47  [<ffffffff8100c240>] ? child_rip+0x0/0x20
      
      
      2014-06-18 08:26:19 mdt03_068     S 000000000000000f     0 13280      2 0x00000000
      2014-06-18 08:26:19  ffff881f4938b8f0 0000000000000046 0000000000000000 ffff881f4938b940
      2014-06-18 08:26:19  0000000000000001 ffff881f49d42040 ffff881f4938b8c0 ffffffffa06124a1
      2014-06-18 08:26:19  ffff881f49d425f8 ffff881f4938bfd8 000000000000fbc8 ffff881f49d425f8
      2014-06-18 08:26:19 Call Trace:
      2014-06-18 08:26:19  [<ffffffffa06124a1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      2014-06-18 08:26:19  [<ffffffffa060277e>] cfs_waitq_wait+0xe/0x10 [libcfs]
      2014-06-18 08:26:19  [<ffffffffa08d844a>] ldlm_completion_ast+0x57a/0x960 [ptlrpc]
      2014-06-18 08:26:19  [<ffffffffa08d3b50>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
      2014-06-18 08:26:19  [<ffffffff81061f20>] ? default_wake_function+0x0/0x20
      2014-06-18 08:26:19  [<ffffffffa08d7af8>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
      2014-06-18 08:26:19  [<ffffffffa08d7ed0>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
      2014-06-18 08:26:19  [<ffffffffa0e88ad0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
      2014-06-18 08:26:19  [<ffffffffa0e92fef>] mdt_object_lock0+0x33f/0xaf0 [mdt]
      2014-06-18 08:26:19  [<ffffffffa0e88ad0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
      2014-06-18 08:26:19  [<ffffffffa08d7ed0>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
      2014-06-18 08:26:19  [<ffffffffa0e93864>] mdt_object_lock+0x14/0x20 [mdt]
      2014-06-18 08:26:19  [<ffffffffa0e9ddc9>] mdt_getattr_name_lock+0xe29/0x1990 [mdt]
      2014-06-18 08:26:19  [<ffffffffa0618f62>] ? cfs_hash_lookup+0x82/0xa0 [libcfs]
      2014-06-18 08:26:19  [<ffffffffa090243c>] ? lustre_msg_clear_flags+0x6c/0xc0 [ptlrpc]
      2014-06-18 08:26:19  [<ffffffffa0e9ebcd>] mdt_intent_getattr+0x29d/0x490 [mdt]
      2014-06-18 08:26:19  [<ffffffffa0e8bf4e>] mdt_intent_policy+0x39e/0x720 [mdt]
      2014-06-18 08:26:19  [<ffffffffa08b8911>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
      2014-06-18 08:26:19  [<ffffffffa08e1a4f>] ldlm_handle_enqueue0+0x4ef/0x10a0 [ptlrpc]
      2014-06-18 08:26:19  [<ffffffffa0e8c3d6>] mdt_enqueue+0x46/0xe0 [mdt]
      2014-06-18 08:26:19  [<ffffffffa0e91827>] mdt_handle_common+0x647/0x16d0 [mdt]
      2014-06-18 08:26:19  [<ffffffffa0eccba5>] mds_regular_handle+0x15/0x20 [mdt]
      2014-06-18 08:26:19  [<ffffffffa09120f8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
      2014-06-18 08:26:19  [<ffffffffa060263e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      2014-06-18 08:26:19  [<ffffffffa0613f6f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
      2014-06-18 08:26:19  [<ffffffffa0909459>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
      2014-06-18 08:26:19  [<ffffffffa091348e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
      2014-06-18 08:26:19  [<ffffffffa09129c0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      2014-06-18 08:26:19  [<ffffffff8100c24a>] child_rip+0xa/0x20
      2014-06-18 08:26:19  [<ffffffffa09129c0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      2014-06-18 08:26:19  [<ffffffffa09129c0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      2014-06-18 08:26:19  [<ffffffff8100c240>] ? child_rip+0x0/0x20
      

      I also see many "osp-syn" threads like this:

      2014-06-18 08:25:19 osp-syn-36    S 000000000000000a     0 12526      2 0x00000000
      2014-06-18 08:25:19  ffff880fc3325a20 0000000000000046 0000000000000000 ffff880fc3c0e640
      2014-06-18 08:25:19  ffff880fc33259b8 0000000000000001 ffff880fd8156800 ffff880fc33259f0
      2014-06-18 08:25:19  ffff88100f10f058 ffff880fc3325fd8 000000000000fbc8 ffff88100f10f058
      2014-06-18 08:25:19 Call Trace:
      2014-06-18 08:25:19  [<ffffffffa060277e>] cfs_waitq_wait+0xe/0x10 [libcfs]
      2014-06-18 08:25:19  [<ffffffffa0fab0e1>] osp_sync_process_queues+0x251/0x15e0 [osp]
      2014-06-18 08:25:19  [<ffffffff81061f20>] ? default_wake_function+0x0/0x20
      2014-06-18 08:25:19  [<ffffffffa07240fb>] llog_process_thread+0x8fb/0xe00 [obdclass]
      2014-06-18 08:25:19  [<ffffffffa0faae90>] ? osp_sync_process_queues+0x0/0x15e0 [osp]
      2014-06-18 08:25:19  [<ffffffffa0725e0d>] llog_process_or_fork+0x12d/0x660 [obdclass]
      2014-06-18 08:25:19  [<ffffffffa07285fa>] llog_cat_process_cb+0x56a/0x620 [obdclass]
      2014-06-18 08:25:19  [<ffffffffa07240fb>] llog_process_thread+0x8fb/0xe00 [obdclass]
      2014-06-18 08:25:19  [<ffffffffa0728090>] ? llog_cat_process_cb+0x0/0x620 [obdclass]
      2014-06-18 08:25:19  [<ffffffffa0725e0d>] llog_process_or_fork+0x12d/0x660 [obdclass]
      2014-06-18 08:25:19  [<ffffffffa0726ee9>] llog_cat_process_or_fork+0x89/0x350 [obdclass]
      2014-06-18 08:25:19  [<ffffffff810547c9>] ? __wake_up_common+0x59/0x90
      2014-06-18 08:25:19  [<ffffffffa0faae90>] ? osp_sync_process_queues+0x0/0x15e0 [osp]
      2014-06-18 08:25:19  [<ffffffffa07271c9>] llog_cat_process+0x19/0x20 [obdclass]
      2014-06-18 08:25:19  [<ffffffffa06027da>] ? cfs_waitq_signal+0x1a/0x20 [libcfs]
      2014-06-18 08:25:19  [<ffffffffa0faca80>] osp_sync_thread+0x240/0x7e0 [osp]
      2014-06-18 08:25:19  [<ffffffffa0fac840>] ? osp_sync_thread+0x0/0x7e0 [osp]
      2014-06-18 08:25:19  [<ffffffff8100c24a>] child_rip+0xa/0x20
      2014-06-18 08:25:19  [<ffffffffa0fac840>] ? osp_sync_thread+0x0/0x7e0 [osp]
      2014-06-18 08:25:19  [<ffffffffa0fac840>] ? osp_sync_thread+0x0/0x7e0 [osp]
      2014-06-18 08:25:19  [<ffffffff8100c240>] ? child_rip+0x0/0x20
      

      I think this might be similar if not the same as what was reported LU-4335, which is still unresolved.

      There might be some on-going network/router issues with one of the client clusters mounting this filesystem. In the console logs of the MDS and the OSS's I see reconnection messages all starting around the same time, and coming from the same client cluster:

                                                                            
      2014-06-17 21:26:33 Lustre: MGS: Client ded91cb5-244e-1441-a4eb-3f24b0b31de4 (at 192.168.115.81@o2ib5) reconnecting
      2014-06-17 21:26:33 Lustre: Skipped 3 previous similar messages                 
      2014-06-17 21:26:37 Lustre: MGS: Client 556f6ca6-d372-c801-a2f7-9c58a2f9d6f2 (at 192.168.112.183@o2ib5) reconnecting
      2014-06-17 21:26:37 Lustre: Skipped 1 previous similar message                  
      2014-06-17 21:26:40 Lustre: lsd-MDT0000: Client bf288f47-9ee8-e9d8-6b45-440e971a9cbd (at 192.168.114.211@o2ib5) reconnecting
      2014-06-17 21:26:52 Lustre: lsd-MDT0000: Client 2a1ffedf-635d-0d05-fc41-7f70e5389d13 (at 192.168.112.180@o2ib5) reconnecting
      2014-06-17 21:26:56 Lustre: MGS: Client c36d44f9-4964-583a-1c78-24a2e5c1ff22 (at 192.168.115.90@o2ib5) reconnecting
      2014-06-17 21:26:56 Lustre: Skipped 5 previous similar messages                 
      2014-06-17 21:27:04 Lustre: lsd-MDT0000: Client 1e97ad41-dd1e-0abd-5ad0-95ba5e774908 (at 192.168.115.193@o2ib5) reconnecting
      2014-06-17 21:27:04 Lustre: Skipped 14 previous similar messages                
      2014-06-17 21:27:22 Lustre: lsd-MDT0000: Client 8270f083-0c43-751e-be7a-a3b66b8a6a32 (at 192.168.113.53@o2ib5) reconnecting
      2014-06-17 21:27:22 Lustre: Skipped 17 previous similar messages                
      2014-06-17 21:27:54 Lustre: MGS: Client c8ef8e8a-30a1-ad69-6d1c-377755d6f220 (at 192.168.119.130@o2ib5) reconnecting
      

      From this point on, there's a lot of messages in the console log of the MDS (e.g. LNET service thread stacks, lock timer expirations, LDLM AST messages, evictions, reconnections, etc); so it's hard to make sense of it all. The OSS nodes are noisy after the initial reconnection messages as well, but much less than the MDS. The messages seen on the OSS's after the initial event are mainly evictions, reconnections, and IO erros/timeouts.

      Just to make it clear, the hung stat system call I posted above, was not from a client in the cluster making these initial reconnection requests. It was from another client cluster mounting this same filesystem.

      So far, my best guess is the reconnection requests from the "problem" client cluster have rendered this MDS "useless". I have vague a recollection of this happening to us before; which, if I recall correctly, is why we have decided to no longer mount all filesystems on all client clusters.

      Attachments

        Activity

          [LU-5226] MDS idle - 'stat' system call hangs - possible router issues on different client cluster

          If you want to mark it "won't fix" or "cannot reproduce" since it is so old, that is fine with me.

          morrone Christopher Morrone (Inactive) added a comment - If you want to mark it "won't fix" or "cannot reproduce" since it is so old, that is fine with me.

          The referenced bug LU-4335 was fixed with the upgrade to 2.4.3. Is this still an issue or can it be closed?

          adilger Andreas Dilger added a comment - The referenced bug LU-4335 was fixed with the upgrade to 2.4.3. Is this still an issue or can it be closed?

          Looks the network wasn't stable even after the routers got shutdown, I see lots of error messages from 11:40 ~ 16:12 like:

          2014-06-18 14:15:35 LNetError: 7946:0:(o2iblnd_cb.c:2635:kiblnd_rejected()) 172.19.1.22@o2ib100 rejected: o2iblnd fatal error
          2014-06-18 14:15:36 LNetError: 7946:0:(o2iblnd_cb.c:2635:kiblnd_rejected()) 172.19.1.22@o2ib100 rejected: o2iblnd fatal error
          2014-06-18 14:15:36 LNetError: 7946:0:(o2iblnd_cb.c:2635:kiblnd_rejected()) Skipped 7 previous similar messages
          2014-06-18 14:15:52 LustreError: 7948:0:(events.c:450:server_bulk_callback()) event type 5, status -113, desc ffff881f6b6da240
          2014-06-18 14:15:52 LustreError: 7948:0:(events.c:450:server_bulk_callback()) event type 5, status -113, desc ffff881ea7b36500
          2014-06-18 14:15:58 LustreError: 7948:0:(events.c:450:server_bulk_callback()) event type 5, status -113, desc ffff880d9283ec80
          

          I presume that admin performed some other actions to fix the network problems finally?

          niu Niu Yawei (Inactive) added a comment - Looks the network wasn't stable even after the routers got shutdown, I see lots of error messages from 11:40 ~ 16:12 like: 2014-06-18 14:15:35 LNetError: 7946:0:(o2iblnd_cb.c:2635:kiblnd_rejected()) 172.19.1.22@o2ib100 rejected: o2iblnd fatal error 2014-06-18 14:15:36 LNetError: 7946:0:(o2iblnd_cb.c:2635:kiblnd_rejected()) 172.19.1.22@o2ib100 rejected: o2iblnd fatal error 2014-06-18 14:15:36 LNetError: 7946:0:(o2iblnd_cb.c:2635:kiblnd_rejected()) Skipped 7 previous similar messages 2014-06-18 14:15:52 LustreError: 7948:0:(events.c:450:server_bulk_callback()) event type 5, status -113, desc ffff881f6b6da240 2014-06-18 14:15:52 LustreError: 7948:0:(events.c:450:server_bulk_callback()) event type 5, status -113, desc ffff881ea7b36500 2014-06-18 14:15:58 LustreError: 7948:0:(events.c:450:server_bulk_callback()) event type 5, status -113, desc ffff880d9283ec80 I presume that admin performed some other actions to fix the network problems finally?
          prakash Prakash Surya (Inactive) added a comment - - edited

          Was the router fixed long before this?

          I believe so, yes. The routers were "fixed" prior to restarting the MDS, which happened around 11:40 on 6-18 according to the console log (5 hours prior to the message you reference).

          If so, there could be other potential network problems in the system.

          Possibly, can't really say what the cause might've been for certain though.

          With that said, after rebooting the MDS, the admins did bring up the problem routers for a brief moment but quickly shut them down again. During that small up time, many reconnection requests started to occur (similar to the original incident) which made the admins decide to turn the routers off again and leave them down. I don't know if that's significant, but I figured I would make a note of it here.

          prakash Prakash Surya (Inactive) added a comment - - edited Was the router fixed long before this? I believe so, yes. The routers were "fixed" prior to restarting the MDS, which happened around 11:40 on 6-18 according to the console log (5 hours prior to the message you reference). If so, there could be other potential network problems in the system. Possibly, can't really say what the cause might've been for certain though. With that said, after rebooting the MDS, the admins did bring up the problem routers for a brief moment but quickly shut them down again. During that small up time, many reconnection requests started to occur (similar to the original incident) which made the admins decide to turn the routers off again and leave them down. I don't know if that's significant, but I figured I would make a note of it here.

          The last network error messages in the console log is:

          2014-06-18 16:12:20 LustreError: 12501:0:(ldlm_lib.c:2706:target_bulk_io()) @@@ timeout on bulk PUT after 1085+0s  req@ffff880f528f9850 x1471189200184516/t0(0) o256->86d22738-c9b5-d251-8413-6ea54b135fc3@192.168.112.103@o2ib5:0/0 lens 304/240 e 1 to 0 dl 1403133140 ref 1 fl Interpret:/0/0 rc 0/0
          2014-06-18 16:12:20 LustreError: 12501:0:(ldlm_lib.c:2706:target_bulk_io()) Skipped 17 previous similar messages
          

          Was the router fixed long before this? If so, there could be other potential network problems in the system. Looks the system was back to normal after these error messages (no watchdog triggered anymore).

          niu Niu Yawei (Inactive) added a comment - The last network error messages in the console log is: 2014-06-18 16:12:20 LustreError: 12501:0:(ldlm_lib.c:2706:target_bulk_io()) @@@ timeout on bulk PUT after 1085+0s req@ffff880f528f9850 x1471189200184516/t0(0) o256->86d22738-c9b5-d251-8413-6ea54b135fc3@192.168.112.103@o2ib5:0/0 lens 304/240 e 1 to 0 dl 1403133140 ref 1 fl Interpret:/0/0 rc 0/0 2014-06-18 16:12:20 LustreError: 12501:0:(ldlm_lib.c:2706:target_bulk_io()) Skipped 17 previous similar messages Was the router fixed long before this? If so, there could be other potential network problems in the system. Looks the system was back to normal after these error messages (no watchdog triggered anymore).

          Thanks Peter. I've uploaded a tarball to "/uploads/LU-5226/LU-5226.tar.bz". It contains the console log of the MDS in question during the incident starting around 21:00 on 6/17. I didn't filter out anything unrelated, so it also contains the console log for older days that can be ignored.

          prakash Prakash Surya (Inactive) added a comment - Thanks Peter. I've uploaded a tarball to "/uploads/ LU-5226 / LU-5226 .tar.bz". It contains the console log of the MDS in question during the incident starting around 21:00 on 6/17. I didn't filter out anything unrelated, so it also contains the console log for older days that can be ignored.
          pjones Peter Jones added a comment -

          I'll send you the details privately

          pjones Peter Jones added a comment - I'll send you the details privately

          How do I upload file to the FTP server? What's the FTP server's address?

          prakash Prakash Surya (Inactive) added a comment - How do I upload file to the FTP server? What's the FTP server's address?

          I didn't do much digging around after we shutdown the routers, so I don't have much information. The best I can probably do, since the issue has past, is upload the console logs for the MDS for the 24 hours or so surrounding the incident. I did a sysrq-t before we took the routers down, but did not issue one afterwards, so I don't have stacks for the threads after the routers were brought down. Would you like me to upload the MDS console logs?

          Yes, please. Hope we can find something in the console log. Thanks.

          niu Niu Yawei (Inactive) added a comment - I didn't do much digging around after we shutdown the routers, so I don't have much information. The best I can probably do, since the issue has past, is upload the console logs for the MDS for the 24 hours or so surrounding the incident. I did a sysrq-t before we took the routers down, but did not issue one afterwards, so I don't have stacks for the threads after the routers were brought down. Would you like me to upload the MDS console logs? Yes, please. Hope we can find something in the console log. Thanks.
          prakash Prakash Surya (Inactive) added a comment - - edited

          From the trace looks like the MDT threads are waiting on lock enqueue, and the enqueue need to revoke locks from the clients connecting through "problem" routers. So, the enqueue time will be much longer than normal enqueue since it has to wait for the clients being evicted. Because lots of MDT service threads are hold by such lock enqueue, there are only few threads (or no threads) available to response other incoming requests, MDT becomes unresponsive.

          Yes, as far as I can tell, that is what happened.

          But I assume the situation will be alleviated once the clients are evicted, not sure why MDT is still not responsive even after shutting down 'problem' routers for one hour.

          My hunch is the adaptive timeouts are not being honored everywhere. So, while it should have fixed itself within 10 minutes after shutting down the routers (we set "at_max" to 600 seconds) that doesn't happen in practice. If we waited "long enough" it probably would have eventually resolved itself, but, we didn't want to wait an undefined amount of time for a production system; a reboot was the faster course of action to get the MDS operational again.

          Did you observe any error messages on MDT after shutting down routers? And what's the stack trace look like after shutting down routers?

          I didn't do much digging around after we shutdown the routers, so I don't have much information. The best I can probably do, since the issue has past, is upload the console logs for the MDS for the 24 hours or so surrounding the incident. I did a sysrq-t before we took the routers down, but did not issue one afterwards, so I don't have stacks for the threads after the routers were brought down. Would you like me to upload the MDS console logs?

          prakash Prakash Surya (Inactive) added a comment - - edited From the trace looks like the MDT threads are waiting on lock enqueue, and the enqueue need to revoke locks from the clients connecting through "problem" routers. So, the enqueue time will be much longer than normal enqueue since it has to wait for the clients being evicted. Because lots of MDT service threads are hold by such lock enqueue, there are only few threads (or no threads) available to response other incoming requests, MDT becomes unresponsive. Yes, as far as I can tell, that is what happened. But I assume the situation will be alleviated once the clients are evicted, not sure why MDT is still not responsive even after shutting down 'problem' routers for one hour. My hunch is the adaptive timeouts are not being honored everywhere. So, while it should have fixed itself within 10 minutes after shutting down the routers (we set "at_max" to 600 seconds) that doesn't happen in practice. If we waited "long enough" it probably would have eventually resolved itself, but, we didn't want to wait an undefined amount of time for a production system; a reboot was the faster course of action to get the MDS operational again. Did you observe any error messages on MDT after shutting down routers? And what's the stack trace look like after shutting down routers? I didn't do much digging around after we shutdown the routers, so I don't have much information. The best I can probably do, since the issue has past, is upload the console logs for the MDS for the 24 hours or so surrounding the incident. I did a sysrq-t before we took the routers down, but did not issue one afterwards, so I don't have stacks for the threads after the routers were brought down. Would you like me to upload the MDS console logs?

          People

            niu Niu Yawei (Inactive)
            prakash Prakash Surya (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: