[LU-11936] High ldlm load, slow/unusable filesystem Created: 06/Feb/19  Updated: 27/Nov/19

Status: Reopened
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Stephane Thiell Assignee: Serguei Smirnov
Resolution: Unresolved Votes: 0
Labels: None
Environment:

CentOS 7.6, Lustre 2.12.0, 3.10.0-957.1.3.el7_lustre.x86_64


Attachments: Text File fir-md1-s1-console.log     Text File fir-md1-s1_20191022.log     Text File fir-md1-s2-console.log     File lustre-log.1549473603.35530.gz    
Issue Links:
Duplicate
duplicates LU-11888 Unreachable client NID confusing Lust... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We upgraded our cluster Sherlock to Lustre 2.12 yesterday and put Fir (Lustre 2.12 servers) into production yesterday but this morning, the filesystem is unusable due to a super high load of ldlm threads on the MDS servers.

I can see plenty of these on the MDS:

[Wed Feb  6 09:19:25 2019][1695265.110496] LNet: Service thread pid 35530 was inactive for 350.85s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[Wed Feb  6 09:19:25 2019][1695265.127609] Pid: 35530, comm: mdt02_024 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
[Wed Feb  6 09:19:25 2019][1695265.137522] Call Trace:
[Wed Feb  6 09:19:25 2019][1695265.140184]  [<ffffffffc0e3a0bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
[Wed Feb  6 09:19:25 2019][1695265.147339]  [<ffffffffc0e3adcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
[Wed Feb  6 09:19:25 2019][1695265.154723]  [<ffffffffc15164ab>] mdt_object_local_lock+0x50b/0xb20 [mdt]
[Wed Feb  6 09:19:25 2019][1695265.161731]  [<ffffffffc1516b30>] mdt_object_lock_internal+0x70/0x3e0 [mdt]
[Wed Feb  6 09:19:25 2019][1695265.168906]  [<ffffffffc1517d1a>] mdt_getattr_name_lock+0x90a/0x1c30 [mdt]
[Wed Feb  6 09:19:25 2019][1695265.176002]  [<ffffffffc151fbb5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
[Wed Feb  6 09:19:25 2019][1695265.182750]  [<ffffffffc151ca18>] mdt_intent_policy+0x2e8/0xd00 [mdt]
[Wed Feb  6 09:19:25 2019][1695265.189403]  [<ffffffffc0e20ec6>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
[Wed Feb  6 09:19:25 2019][1695265.196349]  [<ffffffffc0e498a7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
[Wed Feb  6 09:19:25 2019][1695265.203643]  [<ffffffffc0ed0302>] tgt_enqueue+0x62/0x210 [ptlrpc]
[Wed Feb  6 09:19:25 2019][1695265.210001]  [<ffffffffc0ed735a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[Wed Feb  6 09:19:25 2019][1695265.217133]  [<ffffffffc0e7b92b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[Wed Feb  6 09:19:25 2019][1695265.225045]  [<ffffffffc0e7f25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[Wed Feb  6 09:19:25 2019][1695265.231563]  [<ffffffffadcc1c31>] kthread+0xd1/0xe0
[Wed Feb  6 09:19:25 2019][1695265.236657]  [<ffffffffae374c24>] ret_from_fork_nospec_begin+0xe/0x21
[Wed Feb  6 09:19:25 2019][1695265.243306]  [<ffffffffffffffff>] 0xffffffffffffffff
[Wed Feb  6 09:19:25 2019][1695265.248518] LustreError: dumping log to /tmp/lustre-log.1549473603.35530

On Fir, we have two Lustre 2.12 MDS, fir-md1-s1 and fir-md1-s2, each with 2 MDTs. I dumped the current tasks using sysrq to the console and I'm attaching the full console log for both MDS servers. The servers don't crash but lead to unsable/blocking clients,however from time to time we can access the filesystem. Any help would be appreciated. Thanks!

Stephane



 Comments   
Comment by Stephane Thiell [ 06/Feb/19 ]

Ah! Just after submitting this ticket and attaching the different files, I just noticed that a possible issue could be that we do have (again) one client having a NID using @tcp leading to unusable servers.

For example, from lustre-log.1549473603.35530:

@@@ Request sent has failed due to network error: [sent 1549473603/real 1549473603]  req@ffff912521032d00 x1623682053684704/t0(0) o106->fir-MDT0002@10.10.114.10@tcp:15/16 lens 296/280 e 0 to 1 dl 1549473614 ref 1 fl Rpc:eX/2/ffffffff rc -11/-1
lib-move.clnet_handle_find_routed_pathno route to 10.10.114.10@tcp from <?>

All our Lustre servers are exclusively IB. If this is the cause of the issue, this could be a duplicate of my previous ticket LU-11888.
I'll investigate and get rid of this bad node and then confirm or not.
Best,
Stephane

Comment by Peter Jones [ 06/Feb/19 ]

Seems related to a ticket you already have

Comment by Stephane Thiell [ 07/Feb/19 ]

Yes I confirm that when we get rid of all bogus clients and restart all Lustre servers, everything is back to normal. The servers don't recover by themselves so a full restart is required apparently. Other Lustre servers running 2.8 or 2.10 are not affected. Please feel free to mark this one as duplicate of LU-11888. I also opened LU-11937 to find out why in some case a NID with tcp0 is loaded on the clients.
Thanks!
Stephane

Comment by Peter Jones [ 07/Feb/19 ]

ok - thanks Stephane

Comment by Stephane Thiell [ 21/Oct/19 ]

This issue should probably be reopen because we just seen it with 2.12.3 RC1. After a client announced itself with a @tcp NID, it was very hard to remove it from Lustre. We had to evict the bogus tcp nid + use lnetctl peer del --prim_nid.

And it didn't even work the first time:

[root@fir-md1-s1 fir-MDT0000]# lnetctl peer show --nid 10.10.117.42@tcp
peer:
    - primary nid: 10.10.117.42@tcp
      Multi-Rail: True
      peer ni:
        - nid: 10.9.117.42@o2ib4
          state: NA
        - nid: 10.10.117.42@tcp
          state: NA

[root@fir-md1-s1 fir-MDT0000]# lnetctl peer del --prim_nid 10.10.117.42@tcp

[root@fir-md1-s1 fir-MDT0000]# lnetctl peer show --nid 10.10.117.42@tcp
peer:
    - primary nid: 10.10.117.42@tcp
      Multi-Rail: False
      peer ni:
        - nid: 10.10.117.42@tcp
          state: NA

See, it removed the good one (o2ib4)!

 

With a combination of eviction + peer deletion, it worked:

[root@fir-md1-s1 fir-MDT0000]# cat exports/10.10.117.42@tcp/uuid
2f0c7362-5db5-a6c6-08eb-fa16109cc0f9

[root@fir-md1-s1 fir-MDT0000]# echo 2f0c7362-5db5-a6c6-08eb-fa16109cc0f9 > evict_client 


[root@fir-md1-s1 fir-MDT0000]# lnetctl peer del --prim_nid 10.10.117.42@tcp


[root@fir-md1-s1 fir-MDT0000]# lnetctl peer show --nid 10.9.117.42@o2ib4
peer:
    - primary nid: 10.9.117.42@o2ib4
      Multi-Rail: True
      peer ni:
        - nid: 10.9.117.42@o2ib4
          state: NA
[root@fir-md1-s1 fir-MDT0000]# ls exports| grep tcp
[root@fir-md1-s1 fir-MDT0000]# 

This is a problem, we have IB only routers and no tcp network/route defined on the servers, I don't know why Lustre does accept it at all. Also, the ldlm_bl thread at 100% is also still a problem:

$ top
   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 40268 root      20   0       0      0      0 S 100.0  0.0 114:45.76 ldlm_bl_06
 16505 root      20   0  164284   4424   1548 R  22.2  0.0   0:00.04 top
 35331 root      20   0       0      0      0 S   5.6  0.0 100:37.82 kiblnd_sd_03_03
 35333 root      20   0       0      0      0 S   5.6  0.0   0:48.41 lnet_discovery

I'm feeling hesitant of dumping the tasks right now, as the filesystem is heavily used.

Comment by Stephane Thiell [ 22/Oct/19 ]

This is the task that was in an endless loop:

[Tue Oct 22 06:46:06 2019][502582.891745] ldlm_bl_06      R  running task        0 40268      2 0x00000080
[Tue Oct 22 06:46:06 2019][502582.898941] Call Trace:
[Tue Oct 22 06:46:07 2019][502582.901568]  [<ffffffffc10018d9>] ? ptlrpc_expire_one_request+0xf9/0x520 [ptlrpc]
[Tue Oct 22 06:46:07 2019][502582.909199]  [<ffffffffc1003da8>] ? ptlrpc_check_set.part.23+0x378/0x1df0 [ptlrpc]
[Tue Oct 22 06:46:07 2019][502582.916913]  [<ffffffffc100587b>] ? ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
[Tue Oct 22 06:46:07 2019][502582.923673]  [<ffffffffc1005dea>] ? ptlrpc_set_wait+0x4ea/0x790 [ptlrpc]
[Tue Oct 22 06:46:07 2019][502582.930470]  [<ffffffffaf6d7c40>] ? wake_up_state+0x20/0x20
[Tue Oct 22 06:46:07 2019][502582.936184]  [<ffffffffc0fc3055>] ? ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
[Tue Oct 22 06:46:07 2019][502582.943118]  [<ffffffffc0fc452f>] ? __ldlm_reprocess_all+0x11f/0x360 [ptlrpc]
[Tue Oct 22 06:46:07 2019][502582.950395]  [<ffffffffc0fc4783>] ? ldlm_reprocess_all+0x13/0x20 [ptlrpc]
[Tue Oct 22 06:46:07 2019][502582.957331]  [<ffffffffc0fdc01e>] ? ldlm_cli_cancel_local+0x29e/0x3f0 [ptlrpc]
[Tue Oct 22 06:46:07 2019][502582.964708]  [<ffffffffc0fe1b67>] ? ldlm_cli_cancel+0x157/0x620 [ptlrpc]
[Tue Oct 22 06:46:07 2019][502582.971541]  [<ffffffffaf81bd89>] ? ___slab_alloc+0x209/0x4f0
[Tue Oct 22 06:46:07 2019][502582.977440]  [<ffffffffc0fe20d4>] ? ldlm_blocking_ast_nocheck+0xa4/0x310 [ptlrpc]
[Tue Oct 22 06:46:07 2019][502582.985094]  [<ffffffffc0fe247a>] ? ldlm_blocking_ast+0x13a/0x170 [ptlrpc]
[Tue Oct 22 06:46:07 2019][502582.992129]  [<ffffffffc0fbc02c>] ? lock_res_and_lock+0x2c/0x50 [ptlrpc]
[Tue Oct 22 06:46:07 2019][502582.999006]  [<ffffffffc0fedcd8>] ? ldlm_handle_bl_callback+0xf8/0x4f0 [ptlrpc]
[Tue Oct 22 06:46:07 2019][502583.006463]  [<ffffffffc0fc00e0>] ? ldlm_lock_decref_internal+0x1a0/0xa30 [ptlrpc]
[Tue Oct 22 06:46:07 2019][502583.014202]  [<ffffffffc0d693b9>] ? class_handle2object+0xb9/0x1c0 [obdclass]
[Tue Oct 22 06:46:07 2019][502583.021483]  [<ffffffffc0fc0a70>] ? ldlm_lock_decref_and_cancel+0x80/0x150 [ptlrpc]
[Tue Oct 22 06:46:07 2019][502583.029249]  [<ffffffffc1447835>] ? mgs_completion_ast_generic+0x125/0x200 [mgs]
[Tue Oct 22 06:46:07 2019][502583.036765]  [<ffffffffc1447930>] ? mgs_completion_ast_barrier+0x20/0x20 [mgs]
[Tue Oct 22 06:46:07 2019][502583.044126]  [<ffffffffc1447943>] ? mgs_completion_ast_ir+0x13/0x20 [mgs]
[Tue Oct 22 06:46:07 2019][502583.051106]  [<ffffffffc0fbdc58>] ? ldlm_work_cp_ast_lock+0xa8/0x1d0 [ptlrpc]
[Tue Oct 22 06:46:07 2019][502583.058392]  [<ffffffffc1005972>] ? ptlrpc_set_wait+0x72/0x790 [ptlrpc]
[Tue Oct 22 06:46:07 2019][502583.065122]  [<ffffffffaf81e41d>] ? kmem_cache_alloc_node_trace+0x11d/0x210
[Tue Oct 22 06:46:07 2019][502583.072240]  [<ffffffffc0d68a69>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[Tue Oct 22 06:46:07 2019][502583.079528]  [<ffffffffc0fbdbb0>] ? ldlm_work_gl_ast_lock+0x3a0/0x3a0 [ptlrpc]
[Tue Oct 22 06:46:07 2019][502583.086915]  [<ffffffffc0ffc202>] ? ptlrpc_prep_set+0xd2/0x280 [ptlrpc]
[Tue Oct 22 06:46:07 2019][502583.093672]  [<ffffffffc0fc3055>] ? ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
[Tue Oct 22 06:46:07 2019][502583.100649]  [<ffffffffc0fc452f>] ? __ldlm_reprocess_all+0x11f/0x360 [ptlrpc]
[Tue Oct 22 06:46:07 2019][502583.107927]  [<ffffffffc0fc52a5>] ? ldlm_cancel_lock_for_export.isra.27+0x195/0x360 [ptlrpc]
[Tue Oct 22 06:46:07 2019][502583.116509]  [<ffffffffc0fc54ac>] ? ldlm_cancel_locks_for_export_cb+0x3c/0x50 [ptlrpc]
[Tue Oct 22 06:46:07 2019][502583.124535]  [<ffffffffc0a8efb0>] ? cfs_hash_for_each_relax+0x250/0x450 [libcfs]
[Tue Oct 22 06:46:07 2019][502583.132112]  [<ffffffffc0fc5470>] ? ldlm_cancel_lock_for_export.isra.27+0x360/0x360 [ptlrpc]
[Tue Oct 22 06:46:07 2019][502583.140690]  [<ffffffffc0fc5470>] ? ldlm_cancel_lock_for_export.isra.27+0x360/0x360 [ptlrpc]
[Tue Oct 22 06:46:07 2019][502583.149260]  [<ffffffffc0a92510>] ? cfs_hash_for_each_empty+0x80/0x1d0 [libcfs]
[Tue Oct 22 06:46:07 2019][502583.156714]  [<ffffffffc0fc57ba>] ? ldlm_export_cancel_locks+0xaa/0x180 [ptlrpc]
[Tue Oct 22 06:46:07 2019][502583.164257]  [<ffffffffc0fee888>] ? ldlm_bl_thread_main+0x7b8/0xa40 [ptlrpc]
[Tue Oct 22 06:46:07 2019][502583.171408]  [<ffffffffaf6d7c40>] ? wake_up_state+0x20/0x20
[Tue Oct 22 06:46:07 2019][502583.177127]  [<ffffffffc0fee0d0>] ? ldlm_handle_bl_callback+0x4f0/0x4f0 [ptlrpc]
[Tue Oct 22 06:46:07 2019][502583.184618]  [<ffffffffaf6c2e81>] ? kthread+0xd1/0xe0
[Tue Oct 22 06:46:07 2019][502583.189804]  [<ffffffffaf6c2db0>] ? insert_kthread_work+0x40/0x40
[Tue Oct 22 06:46:07 2019][502583.195988]  [<ffffffffafd77c24>] ? ret_from_fork_nospec_begin+0xe/0x21
[Tue Oct 22 06:46:07 2019][502583.202696]  [<ffffffffaf6c2db0>] ? insert_kthread_work+0x40/0x40

I'm attaching the full task dump when this was happening as fir-md1-s1_20191022.log

We have restarted the MGS. Unmounting it made the server unresponsive.

Comment by Serguei Smirnov [ 22/Oct/19 ]

Hi Stephane,

If the @tcp NID seen on the server is a result of misconfiguration on any of the clients, then the reported behaviour with removing the primary nid without evicting beforehand may be expected as the server will keep trying to rebuild the representation of the client.

Would it be possible to get configuration details from the server/router/client (via lnetctl export)?

It would make it easier to understand what is causing the issue and how ldml is getting affected.

Thanks,

Serguei.

Comment by Jesse Stroik [ 27/Nov/19 ]

Hi Serguei,

We're seeing a similar situation and i can reproduce it and provide you details.  I added a comment to  LU-11989 here which includes a relevant snippet from our MDS logs: https://jira.whamcloud.com/browse/LU-11989?focusedCommentId=258916&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-258916

In our example, we have four cluster nodes that initially connected to one of our 2.12 file systems using the wrong NID - one behind a NAT. The NIDs trapped behind the NAT are 10.23.x.x@tcp. 

Let me know if i can provide you any other information.

Jesse Stroik

Generated at Sat Feb 10 02:48:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.