Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11936

High ldlm load, slow/unusable filesystem

Details

    • Bug
    • Resolution: Unresolved
    • Blocker
    • None
    • Lustre 2.12.0
    • None
    • CentOS 7.6, Lustre 2.12.0, 3.10.0-957.1.3.el7_lustre.x86_64
    • 3
    • 9223372036854775807

    Description

      We upgraded our cluster Sherlock to Lustre 2.12 yesterday and put Fir (Lustre 2.12 servers) into production yesterday but this morning, the filesystem is unusable due to a super high load of ldlm threads on the MDS servers.

      I can see plenty of these on the MDS:

      [Wed Feb  6 09:19:25 2019][1695265.110496] LNet: Service thread pid 35530 was inactive for 350.85s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [Wed Feb  6 09:19:25 2019][1695265.127609] Pid: 35530, comm: mdt02_024 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
      [Wed Feb  6 09:19:25 2019][1695265.137522] Call Trace:
      [Wed Feb  6 09:19:25 2019][1695265.140184]  [<ffffffffc0e3a0bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
      [Wed Feb  6 09:19:25 2019][1695265.147339]  [<ffffffffc0e3adcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
      [Wed Feb  6 09:19:25 2019][1695265.154723]  [<ffffffffc15164ab>] mdt_object_local_lock+0x50b/0xb20 [mdt]
      [Wed Feb  6 09:19:25 2019][1695265.161731]  [<ffffffffc1516b30>] mdt_object_lock_internal+0x70/0x3e0 [mdt]
      [Wed Feb  6 09:19:25 2019][1695265.168906]  [<ffffffffc1517d1a>] mdt_getattr_name_lock+0x90a/0x1c30 [mdt]
      [Wed Feb  6 09:19:25 2019][1695265.176002]  [<ffffffffc151fbb5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
      [Wed Feb  6 09:19:25 2019][1695265.182750]  [<ffffffffc151ca18>] mdt_intent_policy+0x2e8/0xd00 [mdt]
      [Wed Feb  6 09:19:25 2019][1695265.189403]  [<ffffffffc0e20ec6>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
      [Wed Feb  6 09:19:25 2019][1695265.196349]  [<ffffffffc0e498a7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
      [Wed Feb  6 09:19:25 2019][1695265.203643]  [<ffffffffc0ed0302>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [Wed Feb  6 09:19:25 2019][1695265.210001]  [<ffffffffc0ed735a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [Wed Feb  6 09:19:25 2019][1695265.217133]  [<ffffffffc0e7b92b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [Wed Feb  6 09:19:25 2019][1695265.225045]  [<ffffffffc0e7f25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
      [Wed Feb  6 09:19:25 2019][1695265.231563]  [<ffffffffadcc1c31>] kthread+0xd1/0xe0
      [Wed Feb  6 09:19:25 2019][1695265.236657]  [<ffffffffae374c24>] ret_from_fork_nospec_begin+0xe/0x21
      [Wed Feb  6 09:19:25 2019][1695265.243306]  [<ffffffffffffffff>] 0xffffffffffffffff
      [Wed Feb  6 09:19:25 2019][1695265.248518] LustreError: dumping log to /tmp/lustre-log.1549473603.35530
      

      On Fir, we have two Lustre 2.12 MDS, fir-md1-s1 and fir-md1-s2, each with 2 MDTs. I dumped the current tasks using sysrq to the console and I'm attaching the full console log for both MDS servers. The servers don't crash but lead to unsable/blocking clients,however from time to time we can access the filesystem. Any help would be appreciated. Thanks!

      Stephane

      Attachments

        1. fir-md1-s1_20191022.log
          2.94 MB
        2. fir-md1-s1-console.log
          4.22 MB
        3. fir-md1-s2-console.log
          2.85 MB
        4. lustre-log.1549473603.35530.gz
          1.08 MB

        Issue Links

          Activity

            [LU-11936] High ldlm load, slow/unusable filesystem

            This is the task that was in an endless loop:

            [Tue Oct 22 06:46:06 2019][502582.891745] ldlm_bl_06      R  running task        0 40268      2 0x00000080
            [Tue Oct 22 06:46:06 2019][502582.898941] Call Trace:
            [Tue Oct 22 06:46:07 2019][502582.901568]  [<ffffffffc10018d9>] ? ptlrpc_expire_one_request+0xf9/0x520 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502582.909199]  [<ffffffffc1003da8>] ? ptlrpc_check_set.part.23+0x378/0x1df0 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502582.916913]  [<ffffffffc100587b>] ? ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502582.923673]  [<ffffffffc1005dea>] ? ptlrpc_set_wait+0x4ea/0x790 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502582.930470]  [<ffffffffaf6d7c40>] ? wake_up_state+0x20/0x20
            [Tue Oct 22 06:46:07 2019][502582.936184]  [<ffffffffc0fc3055>] ? ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502582.943118]  [<ffffffffc0fc452f>] ? __ldlm_reprocess_all+0x11f/0x360 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502582.950395]  [<ffffffffc0fc4783>] ? ldlm_reprocess_all+0x13/0x20 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502582.957331]  [<ffffffffc0fdc01e>] ? ldlm_cli_cancel_local+0x29e/0x3f0 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502582.964708]  [<ffffffffc0fe1b67>] ? ldlm_cli_cancel+0x157/0x620 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502582.971541]  [<ffffffffaf81bd89>] ? ___slab_alloc+0x209/0x4f0
            [Tue Oct 22 06:46:07 2019][502582.977440]  [<ffffffffc0fe20d4>] ? ldlm_blocking_ast_nocheck+0xa4/0x310 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502582.985094]  [<ffffffffc0fe247a>] ? ldlm_blocking_ast+0x13a/0x170 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502582.992129]  [<ffffffffc0fbc02c>] ? lock_res_and_lock+0x2c/0x50 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502582.999006]  [<ffffffffc0fedcd8>] ? ldlm_handle_bl_callback+0xf8/0x4f0 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502583.006463]  [<ffffffffc0fc00e0>] ? ldlm_lock_decref_internal+0x1a0/0xa30 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502583.014202]  [<ffffffffc0d693b9>] ? class_handle2object+0xb9/0x1c0 [obdclass]
            [Tue Oct 22 06:46:07 2019][502583.021483]  [<ffffffffc0fc0a70>] ? ldlm_lock_decref_and_cancel+0x80/0x150 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502583.029249]  [<ffffffffc1447835>] ? mgs_completion_ast_generic+0x125/0x200 [mgs]
            [Tue Oct 22 06:46:07 2019][502583.036765]  [<ffffffffc1447930>] ? mgs_completion_ast_barrier+0x20/0x20 [mgs]
            [Tue Oct 22 06:46:07 2019][502583.044126]  [<ffffffffc1447943>] ? mgs_completion_ast_ir+0x13/0x20 [mgs]
            [Tue Oct 22 06:46:07 2019][502583.051106]  [<ffffffffc0fbdc58>] ? ldlm_work_cp_ast_lock+0xa8/0x1d0 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502583.058392]  [<ffffffffc1005972>] ? ptlrpc_set_wait+0x72/0x790 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502583.065122]  [<ffffffffaf81e41d>] ? kmem_cache_alloc_node_trace+0x11d/0x210
            [Tue Oct 22 06:46:07 2019][502583.072240]  [<ffffffffc0d68a69>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
            [Tue Oct 22 06:46:07 2019][502583.079528]  [<ffffffffc0fbdbb0>] ? ldlm_work_gl_ast_lock+0x3a0/0x3a0 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502583.086915]  [<ffffffffc0ffc202>] ? ptlrpc_prep_set+0xd2/0x280 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502583.093672]  [<ffffffffc0fc3055>] ? ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502583.100649]  [<ffffffffc0fc452f>] ? __ldlm_reprocess_all+0x11f/0x360 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502583.107927]  [<ffffffffc0fc52a5>] ? ldlm_cancel_lock_for_export.isra.27+0x195/0x360 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502583.116509]  [<ffffffffc0fc54ac>] ? ldlm_cancel_locks_for_export_cb+0x3c/0x50 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502583.124535]  [<ffffffffc0a8efb0>] ? cfs_hash_for_each_relax+0x250/0x450 [libcfs]
            [Tue Oct 22 06:46:07 2019][502583.132112]  [<ffffffffc0fc5470>] ? ldlm_cancel_lock_for_export.isra.27+0x360/0x360 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502583.140690]  [<ffffffffc0fc5470>] ? ldlm_cancel_lock_for_export.isra.27+0x360/0x360 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502583.149260]  [<ffffffffc0a92510>] ? cfs_hash_for_each_empty+0x80/0x1d0 [libcfs]
            [Tue Oct 22 06:46:07 2019][502583.156714]  [<ffffffffc0fc57ba>] ? ldlm_export_cancel_locks+0xaa/0x180 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502583.164257]  [<ffffffffc0fee888>] ? ldlm_bl_thread_main+0x7b8/0xa40 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502583.171408]  [<ffffffffaf6d7c40>] ? wake_up_state+0x20/0x20
            [Tue Oct 22 06:46:07 2019][502583.177127]  [<ffffffffc0fee0d0>] ? ldlm_handle_bl_callback+0x4f0/0x4f0 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502583.184618]  [<ffffffffaf6c2e81>] ? kthread+0xd1/0xe0
            [Tue Oct 22 06:46:07 2019][502583.189804]  [<ffffffffaf6c2db0>] ? insert_kthread_work+0x40/0x40
            [Tue Oct 22 06:46:07 2019][502583.195988]  [<ffffffffafd77c24>] ? ret_from_fork_nospec_begin+0xe/0x21
            [Tue Oct 22 06:46:07 2019][502583.202696]  [<ffffffffaf6c2db0>] ? insert_kthread_work+0x40/0x40
            

            I'm attaching the full task dump when this was happening as fir-md1-s1_20191022.log

            We have restarted the MGS. Unmounting it made the server unresponsive.

            sthiell Stephane Thiell added a comment - This is the task that was in an endless loop: [Tue Oct 22 06:46:06 2019][502582.891745] ldlm_bl_06 R running task 0 40268 2 0x00000080 [Tue Oct 22 06:46:06 2019][502582.898941] Call Trace: [Tue Oct 22 06:46:07 2019][502582.901568] [<ffffffffc10018d9>] ? ptlrpc_expire_one_request+0xf9/0x520 [ptlrpc] [Tue Oct 22 06:46:07 2019][502582.909199] [<ffffffffc1003da8>] ? ptlrpc_check_set.part.23+0x378/0x1df0 [ptlrpc] [Tue Oct 22 06:46:07 2019][502582.916913] [<ffffffffc100587b>] ? ptlrpc_check_set+0x5b/0xe0 [ptlrpc] [Tue Oct 22 06:46:07 2019][502582.923673] [<ffffffffc1005dea>] ? ptlrpc_set_wait+0x4ea/0x790 [ptlrpc] [Tue Oct 22 06:46:07 2019][502582.930470] [<ffffffffaf6d7c40>] ? wake_up_state+0x20/0x20 [Tue Oct 22 06:46:07 2019][502582.936184] [<ffffffffc0fc3055>] ? ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc] [Tue Oct 22 06:46:07 2019][502582.943118] [<ffffffffc0fc452f>] ? __ldlm_reprocess_all+0x11f/0x360 [ptlrpc] [Tue Oct 22 06:46:07 2019][502582.950395] [<ffffffffc0fc4783>] ? ldlm_reprocess_all+0x13/0x20 [ptlrpc] [Tue Oct 22 06:46:07 2019][502582.957331] [<ffffffffc0fdc01e>] ? ldlm_cli_cancel_local+0x29e/0x3f0 [ptlrpc] [Tue Oct 22 06:46:07 2019][502582.964708] [<ffffffffc0fe1b67>] ? ldlm_cli_cancel+0x157/0x620 [ptlrpc] [Tue Oct 22 06:46:07 2019][502582.971541] [<ffffffffaf81bd89>] ? ___slab_alloc+0x209/0x4f0 [Tue Oct 22 06:46:07 2019][502582.977440] [<ffffffffc0fe20d4>] ? ldlm_blocking_ast_nocheck+0xa4/0x310 [ptlrpc] [Tue Oct 22 06:46:07 2019][502582.985094] [<ffffffffc0fe247a>] ? ldlm_blocking_ast+0x13a/0x170 [ptlrpc] [Tue Oct 22 06:46:07 2019][502582.992129] [<ffffffffc0fbc02c>] ? lock_res_and_lock+0x2c/0x50 [ptlrpc] [Tue Oct 22 06:46:07 2019][502582.999006] [<ffffffffc0fedcd8>] ? ldlm_handle_bl_callback+0xf8/0x4f0 [ptlrpc] [Tue Oct 22 06:46:07 2019][502583.006463] [<ffffffffc0fc00e0>] ? ldlm_lock_decref_internal+0x1a0/0xa30 [ptlrpc] [Tue Oct 22 06:46:07 2019][502583.014202] [<ffffffffc0d693b9>] ? class_handle2object+0xb9/0x1c0 [obdclass] [Tue Oct 22 06:46:07 2019][502583.021483] [<ffffffffc0fc0a70>] ? ldlm_lock_decref_and_cancel+0x80/0x150 [ptlrpc] [Tue Oct 22 06:46:07 2019][502583.029249] [<ffffffffc1447835>] ? mgs_completion_ast_generic+0x125/0x200 [mgs] [Tue Oct 22 06:46:07 2019][502583.036765] [<ffffffffc1447930>] ? mgs_completion_ast_barrier+0x20/0x20 [mgs] [Tue Oct 22 06:46:07 2019][502583.044126] [<ffffffffc1447943>] ? mgs_completion_ast_ir+0x13/0x20 [mgs] [Tue Oct 22 06:46:07 2019][502583.051106] [<ffffffffc0fbdc58>] ? ldlm_work_cp_ast_lock+0xa8/0x1d0 [ptlrpc] [Tue Oct 22 06:46:07 2019][502583.058392] [<ffffffffc1005972>] ? ptlrpc_set_wait+0x72/0x790 [ptlrpc] [Tue Oct 22 06:46:07 2019][502583.065122] [<ffffffffaf81e41d>] ? kmem_cache_alloc_node_trace+0x11d/0x210 [Tue Oct 22 06:46:07 2019][502583.072240] [<ffffffffc0d68a69>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [Tue Oct 22 06:46:07 2019][502583.079528] [<ffffffffc0fbdbb0>] ? ldlm_work_gl_ast_lock+0x3a0/0x3a0 [ptlrpc] [Tue Oct 22 06:46:07 2019][502583.086915] [<ffffffffc0ffc202>] ? ptlrpc_prep_set+0xd2/0x280 [ptlrpc] [Tue Oct 22 06:46:07 2019][502583.093672] [<ffffffffc0fc3055>] ? ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc] [Tue Oct 22 06:46:07 2019][502583.100649] [<ffffffffc0fc452f>] ? __ldlm_reprocess_all+0x11f/0x360 [ptlrpc] [Tue Oct 22 06:46:07 2019][502583.107927] [<ffffffffc0fc52a5>] ? ldlm_cancel_lock_for_export.isra.27+0x195/0x360 [ptlrpc] [Tue Oct 22 06:46:07 2019][502583.116509] [<ffffffffc0fc54ac>] ? ldlm_cancel_locks_for_export_cb+0x3c/0x50 [ptlrpc] [Tue Oct 22 06:46:07 2019][502583.124535] [<ffffffffc0a8efb0>] ? cfs_hash_for_each_relax+0x250/0x450 [libcfs] [Tue Oct 22 06:46:07 2019][502583.132112] [<ffffffffc0fc5470>] ? ldlm_cancel_lock_for_export.isra.27+0x360/0x360 [ptlrpc] [Tue Oct 22 06:46:07 2019][502583.140690] [<ffffffffc0fc5470>] ? ldlm_cancel_lock_for_export.isra.27+0x360/0x360 [ptlrpc] [Tue Oct 22 06:46:07 2019][502583.149260] [<ffffffffc0a92510>] ? cfs_hash_for_each_empty+0x80/0x1d0 [libcfs] [Tue Oct 22 06:46:07 2019][502583.156714] [<ffffffffc0fc57ba>] ? ldlm_export_cancel_locks+0xaa/0x180 [ptlrpc] [Tue Oct 22 06:46:07 2019][502583.164257] [<ffffffffc0fee888>] ? ldlm_bl_thread_main+0x7b8/0xa40 [ptlrpc] [Tue Oct 22 06:46:07 2019][502583.171408] [<ffffffffaf6d7c40>] ? wake_up_state+0x20/0x20 [Tue Oct 22 06:46:07 2019][502583.177127] [<ffffffffc0fee0d0>] ? ldlm_handle_bl_callback+0x4f0/0x4f0 [ptlrpc] [Tue Oct 22 06:46:07 2019][502583.184618] [<ffffffffaf6c2e81>] ? kthread+0xd1/0xe0 [Tue Oct 22 06:46:07 2019][502583.189804] [<ffffffffaf6c2db0>] ? insert_kthread_work+0x40/0x40 [Tue Oct 22 06:46:07 2019][502583.195988] [<ffffffffafd77c24>] ? ret_from_fork_nospec_begin+0xe/0x21 [Tue Oct 22 06:46:07 2019][502583.202696] [<ffffffffaf6c2db0>] ? insert_kthread_work+0x40/0x40 I'm attaching the full task dump when this was happening as fir-md1-s1_20191022.log We have restarted the MGS. Unmounting it made the server unresponsive.
            sthiell Stephane Thiell added a comment - - edited

            This issue should probably be reopen because we just seen it with 2.12.3 RC1. After a client announced itself with a @tcp NID, it was very hard to remove it from Lustre. We had to evict the bogus tcp nid + use lnetctl peer del --prim_nid.

            And it didn't even work the first time:

            [root@fir-md1-s1 fir-MDT0000]# lnetctl peer show --nid 10.10.117.42@tcp
            peer:
                - primary nid: 10.10.117.42@tcp
                  Multi-Rail: True
                  peer ni:
                    - nid: 10.9.117.42@o2ib4
                      state: NA
                    - nid: 10.10.117.42@tcp
                      state: NA
            
            [root@fir-md1-s1 fir-MDT0000]# lnetctl peer del --prim_nid 10.10.117.42@tcp
            
            [root@fir-md1-s1 fir-MDT0000]# lnetctl peer show --nid 10.10.117.42@tcp
            peer:
                - primary nid: 10.10.117.42@tcp
                  Multi-Rail: False
                  peer ni:
                    - nid: 10.10.117.42@tcp
                      state: NA
            

            See, it removed the good one (o2ib4)!

             

            With a combination of eviction + peer deletion, it worked:

            [root@fir-md1-s1 fir-MDT0000]# cat exports/10.10.117.42@tcp/uuid
            2f0c7362-5db5-a6c6-08eb-fa16109cc0f9
            
            [root@fir-md1-s1 fir-MDT0000]# echo 2f0c7362-5db5-a6c6-08eb-fa16109cc0f9 > evict_client 
            
            
            [root@fir-md1-s1 fir-MDT0000]# lnetctl peer del --prim_nid 10.10.117.42@tcp
            
            
            [root@fir-md1-s1 fir-MDT0000]# lnetctl peer show --nid 10.9.117.42@o2ib4
            peer:
                - primary nid: 10.9.117.42@o2ib4
                  Multi-Rail: True
                  peer ni:
                    - nid: 10.9.117.42@o2ib4
                      state: NA
            [root@fir-md1-s1 fir-MDT0000]# ls exports| grep tcp
            [root@fir-md1-s1 fir-MDT0000]# 
            

            This is a problem, we have IB only routers and no tcp network/route defined on the servers, I don't know why Lustre does accept it at all. Also, the ldlm_bl thread at 100% is also still a problem:

            $ top
               PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
             40268 root      20   0       0      0      0 S 100.0  0.0 114:45.76 ldlm_bl_06
             16505 root      20   0  164284   4424   1548 R  22.2  0.0   0:00.04 top
             35331 root      20   0       0      0      0 S   5.6  0.0 100:37.82 kiblnd_sd_03_03
             35333 root      20   0       0      0      0 S   5.6  0.0   0:48.41 lnet_discovery
            

            I'm feeling hesitant of dumping the tasks right now, as the filesystem is heavily used.

            sthiell Stephane Thiell added a comment - - edited This issue should probably be reopen because we just seen it with 2.12.3 RC1. After a client announced itself with a @tcp NID, it was very hard to remove it from Lustre. We had to evict the bogus tcp nid + use lnetctl peer del --prim_nid . And it didn't even work the first time: [root@fir-md1-s1 fir-MDT0000]# lnetctl peer show --nid 10.10.117.42@tcp peer: - primary nid: 10.10.117.42@tcp Multi-Rail: True peer ni: - nid: 10.9.117.42@o2ib4 state: NA - nid: 10.10.117.42@tcp state: NA [root@fir-md1-s1 fir-MDT0000]# lnetctl peer del --prim_nid 10.10.117.42@tcp [root@fir-md1-s1 fir-MDT0000]# lnetctl peer show --nid 10.10.117.42@tcp peer: - primary nid: 10.10.117.42@tcp Multi-Rail: False peer ni: - nid: 10.10.117.42@tcp state: NA See, it removed the good one (o2ib4)!   With a combination of eviction + peer deletion, it worked: [root@fir-md1-s1 fir-MDT0000]# cat exports/10.10.117.42@tcp/uuid 2f0c7362-5db5-a6c6-08eb-fa16109cc0f9 [root@fir-md1-s1 fir-MDT0000]# echo 2f0c7362-5db5-a6c6-08eb-fa16109cc0f9 > evict_client [root@fir-md1-s1 fir-MDT0000]# lnetctl peer del --prim_nid 10.10.117.42@tcp [root@fir-md1-s1 fir-MDT0000]# lnetctl peer show --nid 10.9.117.42@o2ib4 peer: - primary nid: 10.9.117.42@o2ib4 Multi-Rail: True peer ni: - nid: 10.9.117.42@o2ib4 state: NA [root@fir-md1-s1 fir-MDT0000]# ls exports| grep tcp [root@fir-md1-s1 fir-MDT0000]# This is a problem, we have IB only routers and no tcp network/route defined on the servers, I don't know why Lustre does accept it at all. Also, the ldlm_bl thread at 100% is also still a problem: $ top PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 40268 root 20 0 0 0 0 S 100.0 0.0 114:45.76 ldlm_bl_06 16505 root 20 0 164284 4424 1548 R 22.2 0.0 0:00.04 top 35331 root 20 0 0 0 0 S 5.6 0.0 100:37.82 kiblnd_sd_03_03 35333 root 20 0 0 0 0 S 5.6 0.0 0:48.41 lnet_discovery I'm feeling hesitant of dumping the tasks right now, as the filesystem is heavily used.
            pjones Peter Jones added a comment -

            ok - thanks Stephane

            pjones Peter Jones added a comment - ok - thanks Stephane

            Yes I confirm that when we get rid of all bogus clients and restart all Lustre servers, everything is back to normal. The servers don't recover by themselves so a full restart is required apparently. Other Lustre servers running 2.8 or 2.10 are not affected. Please feel free to mark this one as duplicate of LU-11888. I also opened LU-11937 to find out why in some case a NID with tcp0 is loaded on the clients.
            Thanks!
            Stephane

            sthiell Stephane Thiell added a comment - Yes I confirm that when we get rid of all bogus clients and restart all Lustre servers, everything is back to normal. The servers don't recover by themselves so a full restart is required apparently. Other Lustre servers running 2.8 or 2.10 are not affected. Please feel free to mark this one as duplicate of LU-11888 . I also opened LU-11937 to find out why in some case a NID with tcp0 is loaded on the clients. Thanks! Stephane
            pjones Peter Jones added a comment -

            Seems related to a ticket you already have

            pjones Peter Jones added a comment - Seems related to a ticket you already have

            Ah! Just after submitting this ticket and attaching the different files, I just noticed that a possible issue could be that we do have (again) one client having a NID using @tcp leading to unusable servers.

            For example, from lustre-log.1549473603.35530:

            @@@ Request sent has failed due to network error: [sent 1549473603/real 1549473603]  req@ffff912521032d00 x1623682053684704/t0(0) o106->fir-MDT0002@10.10.114.10@tcp:15/16 lens 296/280 e 0 to 1 dl 1549473614 ref 1 fl Rpc:eX/2/ffffffff rc -11/-1
            lib-move.clnet_handle_find_routed_pathno route to 10.10.114.10@tcp from <?>
            

            All our Lustre servers are exclusively IB. If this is the cause of the issue, this could be a duplicate of my previous ticket LU-11888.
            I'll investigate and get rid of this bad node and then confirm or not.
            Best,
            Stephane

            sthiell Stephane Thiell added a comment - Ah! Just after submitting this ticket and attaching the different files, I just noticed that a possible issue could be that we do have (again) one client having a NID using @tcp leading to unusable servers. For example, from lustre-log.1549473603.35530: @@@ Request sent has failed due to network error: [sent 1549473603/real 1549473603] req@ffff912521032d00 x1623682053684704/t0(0) o106->fir-MDT0002@10.10.114.10@tcp:15/16 lens 296/280 e 0 to 1 dl 1549473614 ref 1 fl Rpc:eX/2/ffffffff rc -11/-1 lib-move.clnet_handle_find_routed_pathno route to 10.10.114.10@tcp from <?> All our Lustre servers are exclusively IB. If this is the cause of the issue, this could be a duplicate of my previous ticket LU-11888 . I'll investigate and get rid of this bad node and then confirm or not. Best, Stephane

            People

              ssmirnov Serguei Smirnov
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated: