Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11936

High ldlm load, slow/unusable filesystem

Details

    • Bug
    • Resolution: Unresolved
    • Blocker
    • None
    • Lustre 2.12.0
    • None
    • CentOS 7.6, Lustre 2.12.0, 3.10.0-957.1.3.el7_lustre.x86_64
    • 3
    • 9223372036854775807

    Description

      We upgraded our cluster Sherlock to Lustre 2.12 yesterday and put Fir (Lustre 2.12 servers) into production yesterday but this morning, the filesystem is unusable due to a super high load of ldlm threads on the MDS servers.

      I can see plenty of these on the MDS:

      [Wed Feb  6 09:19:25 2019][1695265.110496] LNet: Service thread pid 35530 was inactive for 350.85s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [Wed Feb  6 09:19:25 2019][1695265.127609] Pid: 35530, comm: mdt02_024 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
      [Wed Feb  6 09:19:25 2019][1695265.137522] Call Trace:
      [Wed Feb  6 09:19:25 2019][1695265.140184]  [<ffffffffc0e3a0bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
      [Wed Feb  6 09:19:25 2019][1695265.147339]  [<ffffffffc0e3adcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
      [Wed Feb  6 09:19:25 2019][1695265.154723]  [<ffffffffc15164ab>] mdt_object_local_lock+0x50b/0xb20 [mdt]
      [Wed Feb  6 09:19:25 2019][1695265.161731]  [<ffffffffc1516b30>] mdt_object_lock_internal+0x70/0x3e0 [mdt]
      [Wed Feb  6 09:19:25 2019][1695265.168906]  [<ffffffffc1517d1a>] mdt_getattr_name_lock+0x90a/0x1c30 [mdt]
      [Wed Feb  6 09:19:25 2019][1695265.176002]  [<ffffffffc151fbb5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
      [Wed Feb  6 09:19:25 2019][1695265.182750]  [<ffffffffc151ca18>] mdt_intent_policy+0x2e8/0xd00 [mdt]
      [Wed Feb  6 09:19:25 2019][1695265.189403]  [<ffffffffc0e20ec6>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
      [Wed Feb  6 09:19:25 2019][1695265.196349]  [<ffffffffc0e498a7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
      [Wed Feb  6 09:19:25 2019][1695265.203643]  [<ffffffffc0ed0302>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [Wed Feb  6 09:19:25 2019][1695265.210001]  [<ffffffffc0ed735a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [Wed Feb  6 09:19:25 2019][1695265.217133]  [<ffffffffc0e7b92b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [Wed Feb  6 09:19:25 2019][1695265.225045]  [<ffffffffc0e7f25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
      [Wed Feb  6 09:19:25 2019][1695265.231563]  [<ffffffffadcc1c31>] kthread+0xd1/0xe0
      [Wed Feb  6 09:19:25 2019][1695265.236657]  [<ffffffffae374c24>] ret_from_fork_nospec_begin+0xe/0x21
      [Wed Feb  6 09:19:25 2019][1695265.243306]  [<ffffffffffffffff>] 0xffffffffffffffff
      [Wed Feb  6 09:19:25 2019][1695265.248518] LustreError: dumping log to /tmp/lustre-log.1549473603.35530
      

      On Fir, we have two Lustre 2.12 MDS, fir-md1-s1 and fir-md1-s2, each with 2 MDTs. I dumped the current tasks using sysrq to the console and I'm attaching the full console log for both MDS servers. The servers don't crash but lead to unsable/blocking clients,however from time to time we can access the filesystem. Any help would be appreciated. Thanks!

      Stephane

      Attachments

        1. fir-md1-s1_20191022.log
          2.94 MB
        2. fir-md1-s1-console.log
          4.22 MB
        3. fir-md1-s2-console.log
          2.85 MB
        4. lustre-log.1549473603.35530.gz
          1.08 MB

        Issue Links

          Activity

            [LU-11936] High ldlm load, slow/unusable filesystem
            jstroik Jesse Stroik added a comment -

            Hi Serguei,

            We're seeing a similar situation and i can reproduce it and provide you details.  I added a comment to  LU-11989 here which includes a relevant snippet from our MDS logs: https://jira.whamcloud.com/browse/LU-11989?focusedCommentId=258916&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-258916

            In our example, we have four cluster nodes that initially connected to one of our 2.12 file systems using the wrong NID - one behind a NAT. The NIDs trapped behind the NAT are 10.23.x.x@tcp. 

            Let me know if i can provide you any other information.

            Jesse Stroik

            jstroik Jesse Stroik added a comment - Hi Serguei, We're seeing a similar situation and i can reproduce it and provide you details.  I added a comment to  LU-11989 here which includes a relevant snippet from our MDS logs: https://jira.whamcloud.com/browse/LU-11989?focusedCommentId=258916&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-258916 In our example, we have four cluster nodes that initially connected to one of our 2.12 file systems using the wrong NID - one behind a NAT. The NIDs trapped behind the NAT are 10.23.x.x@tcp.  Let me know if i can provide you any other information. Jesse Stroik

            Hi Stephane,

            If the @tcp NID seen on the server is a result of misconfiguration on any of the clients, then the reported behaviour with removing the primary nid without evicting beforehand may be expected as the server will keep trying to rebuild the representation of the client.

            Would it be possible to get configuration details from the server/router/client (via lnetctl export)?

            It would make it easier to understand what is causing the issue and how ldml is getting affected.

            Thanks,

            Serguei.

            ssmirnov Serguei Smirnov added a comment - Hi Stephane, If the @tcp NID seen on the server is a result of misconfiguration on any of the clients, then the reported behaviour with removing the primary nid without evicting beforehand may be expected as the server will keep trying to rebuild the representation of the client. Would it be possible to get configuration details from the server/router/client (via lnetctl export)? It would make it easier to understand what is causing the issue and how ldml is getting affected. Thanks, Serguei.

            This is the task that was in an endless loop:

            [Tue Oct 22 06:46:06 2019][502582.891745] ldlm_bl_06      R  running task        0 40268      2 0x00000080
            [Tue Oct 22 06:46:06 2019][502582.898941] Call Trace:
            [Tue Oct 22 06:46:07 2019][502582.901568]  [<ffffffffc10018d9>] ? ptlrpc_expire_one_request+0xf9/0x520 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502582.909199]  [<ffffffffc1003da8>] ? ptlrpc_check_set.part.23+0x378/0x1df0 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502582.916913]  [<ffffffffc100587b>] ? ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502582.923673]  [<ffffffffc1005dea>] ? ptlrpc_set_wait+0x4ea/0x790 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502582.930470]  [<ffffffffaf6d7c40>] ? wake_up_state+0x20/0x20
            [Tue Oct 22 06:46:07 2019][502582.936184]  [<ffffffffc0fc3055>] ? ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502582.943118]  [<ffffffffc0fc452f>] ? __ldlm_reprocess_all+0x11f/0x360 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502582.950395]  [<ffffffffc0fc4783>] ? ldlm_reprocess_all+0x13/0x20 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502582.957331]  [<ffffffffc0fdc01e>] ? ldlm_cli_cancel_local+0x29e/0x3f0 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502582.964708]  [<ffffffffc0fe1b67>] ? ldlm_cli_cancel+0x157/0x620 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502582.971541]  [<ffffffffaf81bd89>] ? ___slab_alloc+0x209/0x4f0
            [Tue Oct 22 06:46:07 2019][502582.977440]  [<ffffffffc0fe20d4>] ? ldlm_blocking_ast_nocheck+0xa4/0x310 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502582.985094]  [<ffffffffc0fe247a>] ? ldlm_blocking_ast+0x13a/0x170 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502582.992129]  [<ffffffffc0fbc02c>] ? lock_res_and_lock+0x2c/0x50 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502582.999006]  [<ffffffffc0fedcd8>] ? ldlm_handle_bl_callback+0xf8/0x4f0 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502583.006463]  [<ffffffffc0fc00e0>] ? ldlm_lock_decref_internal+0x1a0/0xa30 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502583.014202]  [<ffffffffc0d693b9>] ? class_handle2object+0xb9/0x1c0 [obdclass]
            [Tue Oct 22 06:46:07 2019][502583.021483]  [<ffffffffc0fc0a70>] ? ldlm_lock_decref_and_cancel+0x80/0x150 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502583.029249]  [<ffffffffc1447835>] ? mgs_completion_ast_generic+0x125/0x200 [mgs]
            [Tue Oct 22 06:46:07 2019][502583.036765]  [<ffffffffc1447930>] ? mgs_completion_ast_barrier+0x20/0x20 [mgs]
            [Tue Oct 22 06:46:07 2019][502583.044126]  [<ffffffffc1447943>] ? mgs_completion_ast_ir+0x13/0x20 [mgs]
            [Tue Oct 22 06:46:07 2019][502583.051106]  [<ffffffffc0fbdc58>] ? ldlm_work_cp_ast_lock+0xa8/0x1d0 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502583.058392]  [<ffffffffc1005972>] ? ptlrpc_set_wait+0x72/0x790 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502583.065122]  [<ffffffffaf81e41d>] ? kmem_cache_alloc_node_trace+0x11d/0x210
            [Tue Oct 22 06:46:07 2019][502583.072240]  [<ffffffffc0d68a69>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
            [Tue Oct 22 06:46:07 2019][502583.079528]  [<ffffffffc0fbdbb0>] ? ldlm_work_gl_ast_lock+0x3a0/0x3a0 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502583.086915]  [<ffffffffc0ffc202>] ? ptlrpc_prep_set+0xd2/0x280 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502583.093672]  [<ffffffffc0fc3055>] ? ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502583.100649]  [<ffffffffc0fc452f>] ? __ldlm_reprocess_all+0x11f/0x360 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502583.107927]  [<ffffffffc0fc52a5>] ? ldlm_cancel_lock_for_export.isra.27+0x195/0x360 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502583.116509]  [<ffffffffc0fc54ac>] ? ldlm_cancel_locks_for_export_cb+0x3c/0x50 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502583.124535]  [<ffffffffc0a8efb0>] ? cfs_hash_for_each_relax+0x250/0x450 [libcfs]
            [Tue Oct 22 06:46:07 2019][502583.132112]  [<ffffffffc0fc5470>] ? ldlm_cancel_lock_for_export.isra.27+0x360/0x360 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502583.140690]  [<ffffffffc0fc5470>] ? ldlm_cancel_lock_for_export.isra.27+0x360/0x360 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502583.149260]  [<ffffffffc0a92510>] ? cfs_hash_for_each_empty+0x80/0x1d0 [libcfs]
            [Tue Oct 22 06:46:07 2019][502583.156714]  [<ffffffffc0fc57ba>] ? ldlm_export_cancel_locks+0xaa/0x180 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502583.164257]  [<ffffffffc0fee888>] ? ldlm_bl_thread_main+0x7b8/0xa40 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502583.171408]  [<ffffffffaf6d7c40>] ? wake_up_state+0x20/0x20
            [Tue Oct 22 06:46:07 2019][502583.177127]  [<ffffffffc0fee0d0>] ? ldlm_handle_bl_callback+0x4f0/0x4f0 [ptlrpc]
            [Tue Oct 22 06:46:07 2019][502583.184618]  [<ffffffffaf6c2e81>] ? kthread+0xd1/0xe0
            [Tue Oct 22 06:46:07 2019][502583.189804]  [<ffffffffaf6c2db0>] ? insert_kthread_work+0x40/0x40
            [Tue Oct 22 06:46:07 2019][502583.195988]  [<ffffffffafd77c24>] ? ret_from_fork_nospec_begin+0xe/0x21
            [Tue Oct 22 06:46:07 2019][502583.202696]  [<ffffffffaf6c2db0>] ? insert_kthread_work+0x40/0x40
            

            I'm attaching the full task dump when this was happening as fir-md1-s1_20191022.log

            We have restarted the MGS. Unmounting it made the server unresponsive.

            sthiell Stephane Thiell added a comment - This is the task that was in an endless loop: [Tue Oct 22 06:46:06 2019][502582.891745] ldlm_bl_06 R running task 0 40268 2 0x00000080 [Tue Oct 22 06:46:06 2019][502582.898941] Call Trace: [Tue Oct 22 06:46:07 2019][502582.901568] [<ffffffffc10018d9>] ? ptlrpc_expire_one_request+0xf9/0x520 [ptlrpc] [Tue Oct 22 06:46:07 2019][502582.909199] [<ffffffffc1003da8>] ? ptlrpc_check_set.part.23+0x378/0x1df0 [ptlrpc] [Tue Oct 22 06:46:07 2019][502582.916913] [<ffffffffc100587b>] ? ptlrpc_check_set+0x5b/0xe0 [ptlrpc] [Tue Oct 22 06:46:07 2019][502582.923673] [<ffffffffc1005dea>] ? ptlrpc_set_wait+0x4ea/0x790 [ptlrpc] [Tue Oct 22 06:46:07 2019][502582.930470] [<ffffffffaf6d7c40>] ? wake_up_state+0x20/0x20 [Tue Oct 22 06:46:07 2019][502582.936184] [<ffffffffc0fc3055>] ? ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc] [Tue Oct 22 06:46:07 2019][502582.943118] [<ffffffffc0fc452f>] ? __ldlm_reprocess_all+0x11f/0x360 [ptlrpc] [Tue Oct 22 06:46:07 2019][502582.950395] [<ffffffffc0fc4783>] ? ldlm_reprocess_all+0x13/0x20 [ptlrpc] [Tue Oct 22 06:46:07 2019][502582.957331] [<ffffffffc0fdc01e>] ? ldlm_cli_cancel_local+0x29e/0x3f0 [ptlrpc] [Tue Oct 22 06:46:07 2019][502582.964708] [<ffffffffc0fe1b67>] ? ldlm_cli_cancel+0x157/0x620 [ptlrpc] [Tue Oct 22 06:46:07 2019][502582.971541] [<ffffffffaf81bd89>] ? ___slab_alloc+0x209/0x4f0 [Tue Oct 22 06:46:07 2019][502582.977440] [<ffffffffc0fe20d4>] ? ldlm_blocking_ast_nocheck+0xa4/0x310 [ptlrpc] [Tue Oct 22 06:46:07 2019][502582.985094] [<ffffffffc0fe247a>] ? ldlm_blocking_ast+0x13a/0x170 [ptlrpc] [Tue Oct 22 06:46:07 2019][502582.992129] [<ffffffffc0fbc02c>] ? lock_res_and_lock+0x2c/0x50 [ptlrpc] [Tue Oct 22 06:46:07 2019][502582.999006] [<ffffffffc0fedcd8>] ? ldlm_handle_bl_callback+0xf8/0x4f0 [ptlrpc] [Tue Oct 22 06:46:07 2019][502583.006463] [<ffffffffc0fc00e0>] ? ldlm_lock_decref_internal+0x1a0/0xa30 [ptlrpc] [Tue Oct 22 06:46:07 2019][502583.014202] [<ffffffffc0d693b9>] ? class_handle2object+0xb9/0x1c0 [obdclass] [Tue Oct 22 06:46:07 2019][502583.021483] [<ffffffffc0fc0a70>] ? ldlm_lock_decref_and_cancel+0x80/0x150 [ptlrpc] [Tue Oct 22 06:46:07 2019][502583.029249] [<ffffffffc1447835>] ? mgs_completion_ast_generic+0x125/0x200 [mgs] [Tue Oct 22 06:46:07 2019][502583.036765] [<ffffffffc1447930>] ? mgs_completion_ast_barrier+0x20/0x20 [mgs] [Tue Oct 22 06:46:07 2019][502583.044126] [<ffffffffc1447943>] ? mgs_completion_ast_ir+0x13/0x20 [mgs] [Tue Oct 22 06:46:07 2019][502583.051106] [<ffffffffc0fbdc58>] ? ldlm_work_cp_ast_lock+0xa8/0x1d0 [ptlrpc] [Tue Oct 22 06:46:07 2019][502583.058392] [<ffffffffc1005972>] ? ptlrpc_set_wait+0x72/0x790 [ptlrpc] [Tue Oct 22 06:46:07 2019][502583.065122] [<ffffffffaf81e41d>] ? kmem_cache_alloc_node_trace+0x11d/0x210 [Tue Oct 22 06:46:07 2019][502583.072240] [<ffffffffc0d68a69>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [Tue Oct 22 06:46:07 2019][502583.079528] [<ffffffffc0fbdbb0>] ? ldlm_work_gl_ast_lock+0x3a0/0x3a0 [ptlrpc] [Tue Oct 22 06:46:07 2019][502583.086915] [<ffffffffc0ffc202>] ? ptlrpc_prep_set+0xd2/0x280 [ptlrpc] [Tue Oct 22 06:46:07 2019][502583.093672] [<ffffffffc0fc3055>] ? ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc] [Tue Oct 22 06:46:07 2019][502583.100649] [<ffffffffc0fc452f>] ? __ldlm_reprocess_all+0x11f/0x360 [ptlrpc] [Tue Oct 22 06:46:07 2019][502583.107927] [<ffffffffc0fc52a5>] ? ldlm_cancel_lock_for_export.isra.27+0x195/0x360 [ptlrpc] [Tue Oct 22 06:46:07 2019][502583.116509] [<ffffffffc0fc54ac>] ? ldlm_cancel_locks_for_export_cb+0x3c/0x50 [ptlrpc] [Tue Oct 22 06:46:07 2019][502583.124535] [<ffffffffc0a8efb0>] ? cfs_hash_for_each_relax+0x250/0x450 [libcfs] [Tue Oct 22 06:46:07 2019][502583.132112] [<ffffffffc0fc5470>] ? ldlm_cancel_lock_for_export.isra.27+0x360/0x360 [ptlrpc] [Tue Oct 22 06:46:07 2019][502583.140690] [<ffffffffc0fc5470>] ? ldlm_cancel_lock_for_export.isra.27+0x360/0x360 [ptlrpc] [Tue Oct 22 06:46:07 2019][502583.149260] [<ffffffffc0a92510>] ? cfs_hash_for_each_empty+0x80/0x1d0 [libcfs] [Tue Oct 22 06:46:07 2019][502583.156714] [<ffffffffc0fc57ba>] ? ldlm_export_cancel_locks+0xaa/0x180 [ptlrpc] [Tue Oct 22 06:46:07 2019][502583.164257] [<ffffffffc0fee888>] ? ldlm_bl_thread_main+0x7b8/0xa40 [ptlrpc] [Tue Oct 22 06:46:07 2019][502583.171408] [<ffffffffaf6d7c40>] ? wake_up_state+0x20/0x20 [Tue Oct 22 06:46:07 2019][502583.177127] [<ffffffffc0fee0d0>] ? ldlm_handle_bl_callback+0x4f0/0x4f0 [ptlrpc] [Tue Oct 22 06:46:07 2019][502583.184618] [<ffffffffaf6c2e81>] ? kthread+0xd1/0xe0 [Tue Oct 22 06:46:07 2019][502583.189804] [<ffffffffaf6c2db0>] ? insert_kthread_work+0x40/0x40 [Tue Oct 22 06:46:07 2019][502583.195988] [<ffffffffafd77c24>] ? ret_from_fork_nospec_begin+0xe/0x21 [Tue Oct 22 06:46:07 2019][502583.202696] [<ffffffffaf6c2db0>] ? insert_kthread_work+0x40/0x40 I'm attaching the full task dump when this was happening as fir-md1-s1_20191022.log We have restarted the MGS. Unmounting it made the server unresponsive.
            sthiell Stephane Thiell added a comment - - edited

            This issue should probably be reopen because we just seen it with 2.12.3 RC1. After a client announced itself with a @tcp NID, it was very hard to remove it from Lustre. We had to evict the bogus tcp nid + use lnetctl peer del --prim_nid.

            And it didn't even work the first time:

            [root@fir-md1-s1 fir-MDT0000]# lnetctl peer show --nid 10.10.117.42@tcp
            peer:
                - primary nid: 10.10.117.42@tcp
                  Multi-Rail: True
                  peer ni:
                    - nid: 10.9.117.42@o2ib4
                      state: NA
                    - nid: 10.10.117.42@tcp
                      state: NA
            
            [root@fir-md1-s1 fir-MDT0000]# lnetctl peer del --prim_nid 10.10.117.42@tcp
            
            [root@fir-md1-s1 fir-MDT0000]# lnetctl peer show --nid 10.10.117.42@tcp
            peer:
                - primary nid: 10.10.117.42@tcp
                  Multi-Rail: False
                  peer ni:
                    - nid: 10.10.117.42@tcp
                      state: NA
            

            See, it removed the good one (o2ib4)!

             

            With a combination of eviction + peer deletion, it worked:

            [root@fir-md1-s1 fir-MDT0000]# cat exports/10.10.117.42@tcp/uuid
            2f0c7362-5db5-a6c6-08eb-fa16109cc0f9
            
            [root@fir-md1-s1 fir-MDT0000]# echo 2f0c7362-5db5-a6c6-08eb-fa16109cc0f9 > evict_client 
            
            
            [root@fir-md1-s1 fir-MDT0000]# lnetctl peer del --prim_nid 10.10.117.42@tcp
            
            
            [root@fir-md1-s1 fir-MDT0000]# lnetctl peer show --nid 10.9.117.42@o2ib4
            peer:
                - primary nid: 10.9.117.42@o2ib4
                  Multi-Rail: True
                  peer ni:
                    - nid: 10.9.117.42@o2ib4
                      state: NA
            [root@fir-md1-s1 fir-MDT0000]# ls exports| grep tcp
            [root@fir-md1-s1 fir-MDT0000]# 
            

            This is a problem, we have IB only routers and no tcp network/route defined on the servers, I don't know why Lustre does accept it at all. Also, the ldlm_bl thread at 100% is also still a problem:

            $ top
               PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
             40268 root      20   0       0      0      0 S 100.0  0.0 114:45.76 ldlm_bl_06
             16505 root      20   0  164284   4424   1548 R  22.2  0.0   0:00.04 top
             35331 root      20   0       0      0      0 S   5.6  0.0 100:37.82 kiblnd_sd_03_03
             35333 root      20   0       0      0      0 S   5.6  0.0   0:48.41 lnet_discovery
            

            I'm feeling hesitant of dumping the tasks right now, as the filesystem is heavily used.

            sthiell Stephane Thiell added a comment - - edited This issue should probably be reopen because we just seen it with 2.12.3 RC1. After a client announced itself with a @tcp NID, it was very hard to remove it from Lustre. We had to evict the bogus tcp nid + use lnetctl peer del --prim_nid . And it didn't even work the first time: [root@fir-md1-s1 fir-MDT0000]# lnetctl peer show --nid 10.10.117.42@tcp peer: - primary nid: 10.10.117.42@tcp Multi-Rail: True peer ni: - nid: 10.9.117.42@o2ib4 state: NA - nid: 10.10.117.42@tcp state: NA [root@fir-md1-s1 fir-MDT0000]# lnetctl peer del --prim_nid 10.10.117.42@tcp [root@fir-md1-s1 fir-MDT0000]# lnetctl peer show --nid 10.10.117.42@tcp peer: - primary nid: 10.10.117.42@tcp Multi-Rail: False peer ni: - nid: 10.10.117.42@tcp state: NA See, it removed the good one (o2ib4)!   With a combination of eviction + peer deletion, it worked: [root@fir-md1-s1 fir-MDT0000]# cat exports/10.10.117.42@tcp/uuid 2f0c7362-5db5-a6c6-08eb-fa16109cc0f9 [root@fir-md1-s1 fir-MDT0000]# echo 2f0c7362-5db5-a6c6-08eb-fa16109cc0f9 > evict_client [root@fir-md1-s1 fir-MDT0000]# lnetctl peer del --prim_nid 10.10.117.42@tcp [root@fir-md1-s1 fir-MDT0000]# lnetctl peer show --nid 10.9.117.42@o2ib4 peer: - primary nid: 10.9.117.42@o2ib4 Multi-Rail: True peer ni: - nid: 10.9.117.42@o2ib4 state: NA [root@fir-md1-s1 fir-MDT0000]# ls exports| grep tcp [root@fir-md1-s1 fir-MDT0000]# This is a problem, we have IB only routers and no tcp network/route defined on the servers, I don't know why Lustre does accept it at all. Also, the ldlm_bl thread at 100% is also still a problem: $ top PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 40268 root 20 0 0 0 0 S 100.0 0.0 114:45.76 ldlm_bl_06 16505 root 20 0 164284 4424 1548 R 22.2 0.0 0:00.04 top 35331 root 20 0 0 0 0 S 5.6 0.0 100:37.82 kiblnd_sd_03_03 35333 root 20 0 0 0 0 S 5.6 0.0 0:48.41 lnet_discovery I'm feeling hesitant of dumping the tasks right now, as the filesystem is heavily used.
            pjones Peter Jones added a comment -

            ok - thanks Stephane

            pjones Peter Jones added a comment - ok - thanks Stephane

            Yes I confirm that when we get rid of all bogus clients and restart all Lustre servers, everything is back to normal. The servers don't recover by themselves so a full restart is required apparently. Other Lustre servers running 2.8 or 2.10 are not affected. Please feel free to mark this one as duplicate of LU-11888. I also opened LU-11937 to find out why in some case a NID with tcp0 is loaded on the clients.
            Thanks!
            Stephane

            sthiell Stephane Thiell added a comment - Yes I confirm that when we get rid of all bogus clients and restart all Lustre servers, everything is back to normal. The servers don't recover by themselves so a full restart is required apparently. Other Lustre servers running 2.8 or 2.10 are not affected. Please feel free to mark this one as duplicate of LU-11888 . I also opened LU-11937 to find out why in some case a NID with tcp0 is loaded on the clients. Thanks! Stephane
            pjones Peter Jones added a comment -

            Seems related to a ticket you already have

            pjones Peter Jones added a comment - Seems related to a ticket you already have

            Ah! Just after submitting this ticket and attaching the different files, I just noticed that a possible issue could be that we do have (again) one client having a NID using @tcp leading to unusable servers.

            For example, from lustre-log.1549473603.35530:

            @@@ Request sent has failed due to network error: [sent 1549473603/real 1549473603]  req@ffff912521032d00 x1623682053684704/t0(0) o106->fir-MDT0002@10.10.114.10@tcp:15/16 lens 296/280 e 0 to 1 dl 1549473614 ref 1 fl Rpc:eX/2/ffffffff rc -11/-1
            lib-move.clnet_handle_find_routed_pathno route to 10.10.114.10@tcp from <?>
            

            All our Lustre servers are exclusively IB. If this is the cause of the issue, this could be a duplicate of my previous ticket LU-11888.
            I'll investigate and get rid of this bad node and then confirm or not.
            Best,
            Stephane

            sthiell Stephane Thiell added a comment - Ah! Just after submitting this ticket and attaching the different files, I just noticed that a possible issue could be that we do have (again) one client having a NID using @tcp leading to unusable servers. For example, from lustre-log.1549473603.35530: @@@ Request sent has failed due to network error: [sent 1549473603/real 1549473603] req@ffff912521032d00 x1623682053684704/t0(0) o106->fir-MDT0002@10.10.114.10@tcp:15/16 lens 296/280 e 0 to 1 dl 1549473614 ref 1 fl Rpc:eX/2/ffffffff rc -11/-1 lib-move.clnet_handle_find_routed_pathno route to 10.10.114.10@tcp from <?> All our Lustre servers are exclusively IB. If this is the cause of the issue, this could be a duplicate of my previous ticket LU-11888 . I'll investigate and get rid of this bad node and then confirm or not. Best, Stephane

            People

              ssmirnov Serguei Smirnov
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated: