[LU-11644] LNet: Service thread inactive for 300 causes client evictions - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.10.5, Lustre 2.12.1
Labels:
None

Severity:
2
Rank (Obsolete):
9223372036854775807

Description

Update to 2.10.5 now we are seeing periods of mass evictions from servers. On the server we have the following stack trace

Nov  7 11:33:12 nbp8-oss7 kernel: [531465.033253] Pid: 11080, comm: ll_ost01_220 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.033260] Call Trace:
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.033274]  [<ffffffffa0c1d0e0>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038258]  [<ffffffffa0bdae43>] ldlm_run_ast_work+0xd3/0x3a0 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038285]  [<ffffffffa0bfbabb>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038294]  [<ffffffffa10e78a4>] ofd_intent_policy+0x444/0xa40 [ofd]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038318]  [<ffffffffa0bda2ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038346]  [<ffffffffa0c03b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038390]  [<ffffffffa0c89262>] tgt_enqueue+0x62/0x210 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038425]  [<ffffffffa0c8ceca>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038455]  [<ffffffffa0c354bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038484]  [<ffffffffa0c394a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038489]  [<ffffffff810b1131>] kthread+0xd1/0xe0
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038492]  [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038512]  [<ffffffffffffffff>] 0xffffffffffffffff
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038515] LustreError: dumping log to /tmp/lustre-log.1541619192.11080
Nov  7 11:33:14 nbp8-oss7 kernel: [531467.254898] LNet: Service thread pid 9724 was inactive for 303.19s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Nov  7 11:33:14 nbp8-oss7 kernel: [531467.310852] Pid: 9724, comm: ll_ost01_019 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018
Nov  7 11:33:14 nbp8-oss7 kernel: [531467.310854] Call Trace:
Nov  7 11:33:14 nbp8-oss7 kernel: [531467.310866]  [<ffffffffa0c1d0e0>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc]
Nov  7 11:33:14 nbp8-oss7 kernel: [531467.332869]  [<ffffffffa0bdae43>] ldlm_run_ast_work+0xd3/0x3a0 [ptlrpc]
Nov  7 11:33:14 nbp8-oss7 kernel: [531467.332902]  [<ffffffffa0bfbabb>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
Nov  7 11:33:14 nbp8-oss7 kernel: [531467.332912]  [<ffffffffa10e78a4>] ofd_intent_policy+0x444/0xa40 [ofd]
Nov  7 11:33:14 nbp8-oss7 kernel: [531467.332936]  [<ffffffffa0bda2ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc]
Nov  7 11:33:15 nbp8-oss7 kernel: [531467.332988]  [<ffffffffa0c03b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333032]  [<ffffffffa0c89262>] tgt_enqueue+0x62/0x210 [ptlrpc]
Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333067]  [<ffffffffa0c8ceca>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333099]  [<ffffffffa0c354bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333128]  [<ffffffffa0c394a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333134]  [<ffffffff810b1131>] kthread+0xd1/0xe0
Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333137]  [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0
Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333158]  [<ffffffffffffffff>] 0xffffffffffffffff

will upload to ftp:/uploads/LU11613/lustre-log.1541619192.11080

we didn't have rpctrace or dlmtrace so may not be very useful.

Could be related to https://jira.whamcloud.com/browse/LU-11613

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

client_evictions_charts.pdf
63 kB
02/Dec/19 9:45 PM
eviction_s611.06.05.19
23 kB
14/Jun/19 5:54 PM
lnet_metrics_during_eviction.pdf
398 kB
18/Apr/19 2:17 AM
nasa_lu11644.patch
15 kB
20/Jun/19 11:54 PM
s214_bt.20200108.18.21.23
990 kB
09/Jan/20 5:31 AM
s618.out
37 kB
01/Feb/21 5:38 PM
zero.io.top.20210130.19.31.37
480 kB
01/Feb/21 5:38 PM

Issue Links

is related to

LU-11768 sanity-quota test 6 fails with ‘LNet: Service thread pid <pid> was inactive for …’

Resolved

mentioned in: Page No Confluence page found with the given URL.

Activity

[LU-11644] LNet: Service thread inactive for 300 causes client evictions

Mahmoud Hanafi added a comment - 28/May/20 6:01 AM

We need a way to debug this issue. We are still hitting this only way to workaround it is by enabling rpctrace.

Mahmoud Hanafi added a comment - 28/May/20 6:01 AM We need a way to debug this issue. We are still hitting this only way to workaround it is by enabling rpctrace.

Mahmoud Hanafi added a comment - 23/Apr/20 9:58 PM

By enabling +rpctrace it slow the server just enough for the issue to not occur.

Mahmoud Hanafi added a comment - 23/Apr/20 9:58 PM By enabling +rpctrace it slow the server just enough for the issue to not occur.

Andreas Dilger added a comment - 23/Apr/20 5:36 PM

One issue about elevating OBD_PING priority is that these RPCs are mostly "dead weight" just to keep the servers informed that the clients are still alive. I think the idle_timeout feature would instead allow idle clients to disconnect and reduce the OBD_PING traffic on the servers.

Another approach is to prevent evicting clients if the service threads are not making progress on the RPC queue, or if the clients already have a pending OBD_PING RPC in the request queue, even if it hasn't been processed yet. That avoids interrupting useful work by prioritizing OBD_PING processing under normal usage, but also avoids mass evictions in the rare case that the server threads are overloaded.

As for the ldlm contention issue, it would be useful to get the +rpctrace (and +ldlmtrace if possible}} debug log from one of these events, to see what the root of the problem is. Fixing OBD_PING is only the symptom.

Andreas Dilger added a comment - 23/Apr/20 5:36 PM One issue about elevating OBD_PING priority is that these RPCs are mostly "dead weight" just to keep the servers informed that the clients are still alive. I think the idle_timeout feature would instead allow idle clients to disconnect and reduce the OBD_PING traffic on the servers. Another approach is to prevent evicting clients if the service threads are not making progress on the RPC queue, or if the clients already have a pending OBD_PING RPC in the request queue, even if it hasn't been processed yet. That avoids interrupting useful work by prioritizing OBD_PING processing under normal usage, but also avoids mass evictions in the rare case that the server threads are overloaded. As for the ldlm contention issue, it would be useful to get the +rpctrace (and +ldlmtrace if possible}} debug log from one of these events, to see what the root of the problem is. Fixing OBD_PING is only the symptom.

Mahmoud Hanafi added a comment - 27/Feb/20 4:28 AM

What about raising the priority of obd ping rpcs. This will ensure that they are not dropped.

Mahmoud Hanafi added a comment - 27/Feb/20 4:28 AM What about raising the priority of obd ping rpcs. This will ensure that they are not dropped.

Andreas Dilger added a comment - 30/Jan/20 7:20 PM

Correct, patch is only to improve diagnostics for similar situations.

Andreas Dilger added a comment - 30/Jan/20 7:20 PM Correct, patch is only to improve diagnostics for similar situations.

Peter Jones added a comment - 28/Jan/20 3:59 PM

Do I understand correctly that the patch that landed was not a fix for the issue but just made the issue easier to debug?

Peter Jones added a comment - 28/Jan/20 3:59 PM Do I understand correctly that the patch that landed was not a fix for the issue but just made the issue easier to debug?

Gerrit Updater added a comment - 28/Jan/20 6:02 AM

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37193/
Subject: LU-11644 ptlrpc: show target name in req_history
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 83b6c6608e94c05766ed1eddad1b7f2fee5e79fa

Gerrit Updater added a comment - 28/Jan/20 6:02 AM Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37193/ Subject: LU-11644 ptlrpc: show target name in req_history Project: fs/lustre-release Branch: master Current Patch Set: Commit: 83b6c6608e94c05766ed1eddad1b7f2fee5e79fa

Andreas Dilger added a comment - 15/Jan/20 11:43 AM

For very large clusters it would also be nice to set obd_ping rate and eviction timeout independent of at_min. At 10k nodes our OSS average ~4000 obd_ping rpcs/sec

For newer Lustre versions (2.12 client+server), clients will disconnect from an OST in the background if they are not actively communicating with it. This will reduce the number of pings from an idle client, and will also reduce the load during recovery due to fewer clients being connected.

Andreas Dilger added a comment - 15/Jan/20 11:43 AM For very large clusters it would also be nice to set obd_ping rate and eviction timeout independent of at_min. At 10k nodes our OSS average ~4000 obd_ping rpcs/sec For newer Lustre versions (2.12 client+server), clients will disconnect from an OST in the background if they are not actively communicating with it. This will reduce the number of pings from an idle client, and will also reduce the load during recovery due to fewer clients being connected.

Mahmoud Hanafi added a comment - 14/Jan/20 8:20 PM

For now we have enabled rpctrace. Long term I think elevating ping rpc prior it may be a good option. For very large clusters it would also be nice to set obd_ping rate and eviction timeout independent of at_min. At 10k nodes our OSS average ~4000 obd_ping rpcs/sec

Mahmoud Hanafi added a comment - 14/Jan/20 8:20 PM For now we have enabled rpctrace. Long term I think elevating ping rpc prior it may be a good option. For very large clusters it would also be nice to set obd_ping rate and eviction timeout independent of at_min. At 10k nodes our OSS average ~4000 obd_ping rpcs/sec

Gerrit Updater added a comment - 10/Jan/20 11:19 PM

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37193
Subject: LU-11644 ptlrpc: show target name in req_history
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3d7e754be0a0aec62cd74c977cab159796b4b8a8

Gerrit Updater added a comment - 10/Jan/20 11:19 PM Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37193 Subject: LU-11644 ptlrpc: show target name in req_history Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 3d7e754be0a0aec62cd74c977cab159796b4b8a8

Andreas Dilger added a comment - 10/Jan/20 10:35 PM - edited

It looks like 2.10.x has the patch https://review.whamcloud.com/17345 "LU-7470 nrs: extend TBF with NID/JobID/OPCode expression" which allows you to rate-limit the RPCs by opcode. While I would not normally suggest to limit LDLM traffic, you could try this for opc 101 (LDLM_ENQUEUE) to see if the problem can be avoided. I'm not sure of the specific rate limits you want, but looking at the recently-attached graph you could try 4000 as a reasonable cap (i.e. above normal usage, but enough to slow down the huge spike of incoming RPCs. You may have more fine-grained breakdown of RPCs by type to see how many of the RPCs in that spike are from LDLM_ENQUEUE.

If enabling +rpctrace debugging is too heavy-weight, you could also try to enable the RPC history functionality, via "lctl set_param ost.OSS.*.req_buffer_history_max=10000" and then grab it every second via "lctl get_param ost.OSS.*.req_history >> /var/log/lustre_oss_history-$(date +%Y%m%d%H%M)" (make sure you have a rule in /etc/logrotate.d/lustre.conf to clean up those logs if you cannot reproduce it quickly) and then post-process it after an event for a detailed RPC log (sort -u would be needed to discard duplicate records in a given logfile). The fields are:

timestamp:target_nid:client_nid:client_xid:req_length:req_state:arrival_time:expiry_time opc number
6780438299461025792:192.168.20.1@tcp:12345-192.168.20.159@tcp:x1654112309314912:328:Complete:1578693813.544069:-1.455930s(-43.0s) opc 101

Since the RPC history does not involve printing to the logs (just saving the request into a list), it may be less overhead than the kernel debug logging, and allow you to capture information when the problem happens.

Andreas Dilger added a comment - 10/Jan/20 10:35 PM - edited It looks like 2.10.x has the patch https://review.whamcloud.com/17345 " LU-7470 nrs: extend TBF with NID/JobID/OPCode expression " which allows you to rate-limit the RPCs by opcode. While I would not normally suggest to limit LDLM traffic, you could try this for opc 101 ( LDLM_ENQUEUE ) to see if the problem can be avoided. I'm not sure of the specific rate limits you want, but looking at the recently-attached graph you could try 4000 as a reasonable cap (i.e. above normal usage, but enough to slow down the huge spike of incoming RPCs. You may have more fine-grained breakdown of RPCs by type to see how many of the RPCs in that spike are from LDLM_ENQUEUE . If enabling +rpctrace debugging is too heavy-weight, you could also try to enable the RPC history functionality, via " lctl set_param ost.OSS.*.req_buffer_history_max=10000 " and then grab it every second via " lctl get_param ost.OSS.*.req_history >> /var/log/lustre_oss_history-$(date +%Y%m%d%H%M) " (make sure you have a rule in /etc/logrotate.d/lustre.conf to clean up those logs if you cannot reproduce it quickly) and then post-process it after an event for a detailed RPC log ( sort -u would be needed to discard duplicate records in a given logfile). The fields are: timestamp:target_nid:client_nid:client_xid:req_length:req_state:arrival_time:expiry_time opc number 6780438299461025792:192.168.20.1@tcp:12345-192.168.20.159@tcp:x1654112309314912:328:Complete:1578693813.544069:-1.455930s(-43.0s) opc 101 Since the RPC history does not involve printing to the logs (just saving the request into a list), it may be less overhead than the kernel debug logging, and allow you to capture information when the problem happens.

People

Assignee:: Amir Shehata (Inactive)

Reporter:: Mahmoud Hanafi

Votes:: 1 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 08/Nov/18 4:03 AM

Updated:: 15/Oct/24 7:38 PM