[LU-11644] LNet: Service thread inactive for 300 causes client evictions - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.10.5, Lustre 2.12.1
Labels:
None

Severity:
2
Rank (Obsolete):
9223372036854775807

Description

Update to 2.10.5 now we are seeing periods of mass evictions from servers. On the server we have the following stack trace

Nov  7 11:33:12 nbp8-oss7 kernel: [531465.033253] Pid: 11080, comm: ll_ost01_220 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.033260] Call Trace:
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.033274]  [<ffffffffa0c1d0e0>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038258]  [<ffffffffa0bdae43>] ldlm_run_ast_work+0xd3/0x3a0 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038285]  [<ffffffffa0bfbabb>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038294]  [<ffffffffa10e78a4>] ofd_intent_policy+0x444/0xa40 [ofd]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038318]  [<ffffffffa0bda2ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038346]  [<ffffffffa0c03b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038390]  [<ffffffffa0c89262>] tgt_enqueue+0x62/0x210 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038425]  [<ffffffffa0c8ceca>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038455]  [<ffffffffa0c354bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038484]  [<ffffffffa0c394a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038489]  [<ffffffff810b1131>] kthread+0xd1/0xe0
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038492]  [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038512]  [<ffffffffffffffff>] 0xffffffffffffffff
Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038515] LustreError: dumping log to /tmp/lustre-log.1541619192.11080
Nov  7 11:33:14 nbp8-oss7 kernel: [531467.254898] LNet: Service thread pid 9724 was inactive for 303.19s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Nov  7 11:33:14 nbp8-oss7 kernel: [531467.310852] Pid: 9724, comm: ll_ost01_019 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018
Nov  7 11:33:14 nbp8-oss7 kernel: [531467.310854] Call Trace:
Nov  7 11:33:14 nbp8-oss7 kernel: [531467.310866]  [<ffffffffa0c1d0e0>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc]
Nov  7 11:33:14 nbp8-oss7 kernel: [531467.332869]  [<ffffffffa0bdae43>] ldlm_run_ast_work+0xd3/0x3a0 [ptlrpc]
Nov  7 11:33:14 nbp8-oss7 kernel: [531467.332902]  [<ffffffffa0bfbabb>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
Nov  7 11:33:14 nbp8-oss7 kernel: [531467.332912]  [<ffffffffa10e78a4>] ofd_intent_policy+0x444/0xa40 [ofd]
Nov  7 11:33:14 nbp8-oss7 kernel: [531467.332936]  [<ffffffffa0bda2ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc]
Nov  7 11:33:15 nbp8-oss7 kernel: [531467.332988]  [<ffffffffa0c03b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333032]  [<ffffffffa0c89262>] tgt_enqueue+0x62/0x210 [ptlrpc]
Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333067]  [<ffffffffa0c8ceca>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333099]  [<ffffffffa0c354bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333128]  [<ffffffffa0c394a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333134]  [<ffffffff810b1131>] kthread+0xd1/0xe0
Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333137]  [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0
Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333158]  [<ffffffffffffffff>] 0xffffffffffffffff

will upload to ftp:/uploads/LU11613/lustre-log.1541619192.11080

we didn't have rpctrace or dlmtrace so may not be very useful.

Could be related to https://jira.whamcloud.com/browse/LU-11613

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

client_evictions_charts.pdf
63 kB
02/Dec/19 9:45 PM
eviction_s611.06.05.19
23 kB
14/Jun/19 5:54 PM
lnet_metrics_during_eviction.pdf
398 kB
18/Apr/19 2:17 AM
nasa_lu11644.patch
15 kB
20/Jun/19 11:54 PM
s214_bt.20200108.18.21.23
990 kB
09/Jan/20 5:31 AM
s618.out
37 kB
01/Feb/21 5:38 PM
zero.io.top.20210130.19.31.37
480 kB
01/Feb/21 5:38 PM

Issue Links

is related to

LU-11768 sanity-quota test 6 fails with ‘LNet: Service thread pid <pid> was inactive for …’

Resolved

mentioned in: Page No Confluence page found with the given URL.

Activity

[LU-11644] LNet: Service thread inactive for 300 causes client evictions

Andreas Dilger added a comment - 10/Jan/20 12:32 AM

Mahmoud, do you know what the client application is doing at this point in the run? Glimpse RPCs are generated when clients do "stat()" operations on files to get the size, that send the LDLM glimpse RPC for the OST object(s) in the file, which may in turn cause the OST to send RPCs to the client(s) holding the locks for the file if it is actively being written. So if there were a multiple clients doing parallel directory tree traversal in the same directory where other clients are writing it could generate a lot of glimpses, or if the application was malformed and calling stat() repeatedly on a shared file for some reason (e.g. to poll for updates/completion)?

Andreas Dilger added a comment - 10/Jan/20 12:32 AM Mahmoud, do you know what the client application is doing at this point in the run? Glimpse RPCs are generated when clients do " stat() " operations on files to get the size, that send the LDLM glimpse RPC for the OST object(s) in the file, which may in turn cause the OST to send RPCs to the client(s) holding the locks for the file if it is actively being written. So if there were a multiple clients doing parallel directory tree traversal in the same directory where other clients are writing it could generate a lot of glimpses, or if the application was malformed and calling stat() repeatedly on a shared file for some reason (e.g. to poll for updates/completion)?

Mahmoud Hanafi added a comment - 09/Jan/20 5:33 AM

I was able to get a backtrace of all threads during when the servers ping rpcs drop to zero. It show 508 outof 512 ll_ost threads in ldlm_run_ast_work. This must block receiving all other RPCS.

What options do we have to slow down the rate of ldlm_glimpse_equeues?

s214_bt.20200108.18.21.23

Mahmoud Hanafi added a comment - 09/Jan/20 5:33 AM I was able to get a backtrace of all threads during when the servers ping rpcs drop to zero. It show 508 outof 512 ll_ost threads in ldlm_run_ast_work. This must block receiving all other RPCS. What options do we have to slow down the rate of ldlm_glimpse_equeues? s214_bt.20200108.18.21.23

Mahmoud Hanafi added a comment - 07/Jan/20 6:43 PM

I am trying to create a reproducer for the case were a OSS get a large spike in ldlm_glimpse_equeue RPCS. What is the best way to recreate this RPC workload.

Mahmoud Hanafi added a comment - 07/Jan/20 6:43 PM I am trying to create a reproducer for the case were a OSS get a large spike in ldlm_glimpse_equeue RPCS. What is the best way to recreate this RPC workload.

Mahmoud Hanafi added a comment - 02/Dec/19 9:46 PM

I was able to capture some rpc rates before a client evictions. It showed that the server gets large spike in ldlm_glimpse_equeue rpcs that starves out the ping rpcs. I have two charts that show this. So some how when we enable +net debugging it slows down things and the pings rpcs doen't get blocked. client_evictions_charts.pdf
What can cause just a large spike in ldlm_glimpse_equeue?

Mahmoud Hanafi added a comment - 02/Dec/19 9:46 PM I was able to capture some rpc rates before a client evictions. It showed that the server gets large spike in ldlm_glimpse_equeue rpcs that starves out the ping rpcs. I have two charts that show this. So some how when we enable +net debugging it slows down things and the pings rpcs doen't get blocked. client_evictions_charts.pdf What can cause just a large spike in ldlm_glimpse_equeue?

Mahmoud Hanafi added a comment - 13/Jul/19 6:37 PM

Because we had the crash with the larger peers.lndprs_num_peers we wanted to do more testing before installing on production filesystem. I haven't had time to get back to this yet... Since things are stable with +net debugging it's a bit lower on the priority.

Mahmoud Hanafi added a comment - 13/Jul/19 6:37 PM Because we had the crash with the larger peers.lndprs_num_peers we wanted to do more testing before installing on production filesystem. I haven't had time to get back to this yet... Since things are stable with +net debugging it's a bit lower on the priority.

Peter Jones added a comment - 13/Jul/19 3:42 PM

Mahmoud

Do you have the dedicated time scheduled to run this test yet?

Peter

Peter Jones added a comment - 13/Jul/19 3:42 PM Mahmoud Do you have the dedicated time scheduled to run this test yet? Peter

Amir Shehata (Inactive) added a comment - 27/Jun/19 2:06 PM

Hi Mahmoud, I noted this limitation in my initial comment about the patch. I wanted to get the patch out and it would've taken longer to implement an iterative way of pulling up the peers in 4K chunks. But as long at the 16K works, it should be ok.

Amir Shehata (Inactive) added a comment - 27/Jun/19 2:06 PM Hi Mahmoud, I noted this limitation in my initial comment about the patch. I wanted to get the patch out and it would've taken longer to implement an iterative way of pulling up the peers in 4K chunks. But as long at the 16K works, it should be ok.

Mahmoud Hanafi added a comment - 26/Jun/19 5:35 PM - edited

Found an issue with the debug patch. It is only reporting the first 4097 peers.

nbptest2-srv1 ~ # ls -l /proc/fs/lustre/mdt/nbptest2-MDT0000/exports/| wc -l
14260
nbptest2-srv1 ~ # lnetctl peer show --lnd --net o2ib |grep ' nid:' | wc -l
4097

We recompiled with larger value for peers.lndprs_num_peers = 16k.

Mahmoud Hanafi added a comment - 26/Jun/19 5:35 PM - edited Found an issue with the debug patch. It is only reporting the first 4097 peers. nbptest2-srv1 ~ # ls -l /proc/fs/lustre/mdt/nbptest2-MDT0000/exports/| wc -l 14260 nbptest2-srv1 ~ # lnetctl peer show --lnd --net o2ib |grep ' nid:' | wc -l 4097 We recompiled with larger value for peers.lndprs_num_peers = 16k.

Mahmoud Hanafi added a comment - 26/Jun/19 12:02 AM

I tested the patch with +8000 clients and no issues. It will take a few week to schedule dedicated time on our production filesystem. Will report back once I have more data.

Mahmoud Hanafi added a comment - 26/Jun/19 12:02 AM I tested the patch with +8000 clients and no issues. It will take a few week to schedule dedicated time on our production filesystem. Will report back once I have more data.

Mahmoud Hanafi added a comment - 21/Jun/19 5:09 PM

Thanks, will test and report back results.

Mahmoud Hanafi added a comment - 21/Jun/19 5:09 PM Thanks, will test and report back results.

Amir Shehata (Inactive) added a comment - 20/Jun/19 11:53 PM - edited

took me a bit longer to get through.

Here is a debug patch which can be used to monitor the internal iblnd queues and trigger an action when the queues get too large. I tested it locally, but I don't have a large cluster. So it'll be a good idea to test it on a larger cluster before deploying it on a live system.

Patch is nasa_lu11644.patch.

you can run:

lnetctl peer show --lnd --net <net type: ex o2ib1>

This will dump output like

 [root@trevis-407 ~]# lnetctl peer show --lnd --net o2ib
lnd_peer:
    - nid: 172.16.1.6@o2ib
      ni_nid: 172.16.1.7@o2ib
      num_conns: 1
      tx_queue: 0
      accepting: 0
      connecting: 0
      reconnecting: 0
      conn_races: 0
      reconnected: 0
      refcount: 2
      max_frags: 256
      queue_depth: 8
      conns:
        - refcount: 0
          credits: 20
          outstanding_credits: 8
          reserved_credits: 1
          early_rxs: 8
          tx_noops: 0
          tx_active: 0
          tx_queue_nocred: 0
          tx_queue_rsrvd: 0
          tx_queue: 0
          conn_state: 0
          sends_posted: 3
          queue_depth: 0
          max_frags: 0
    - nid: 172.16.1.7@o2ib
      ni_nid: 172.16.1.7@o2ib
      num_conns: 2
      tx_queue: 0
      accepting: 0
      connecting: 0
      reconnecting: 0
      conn_races: 0
      reconnected: 0
      refcount: 3
      max_frags: 256
      queue_depth: 8
      conns:
        - refcount: 0
          credits: 20
          outstanding_credits: 7
          reserved_credits: 0
          early_rxs: 8
          tx_noops: 0
          tx_active: 0
          tx_queue_nocred: 0
          tx_queue_rsrvd: 0
          tx_queue: 0
          conn_state: 0
          sends_posted: 3
          queue_depth: 0
          max_frags: 0
        - refcount: 16777224
          credits: 20
          outstanding_credits: 8
          reserved_credits: 1
          early_rxs: 8
          tx_noops: 0
          tx_active: 0
          tx_queue_nocred: 0
          tx_queue_rsrvd: 0
          tx_queue: 0
          conn_state: 0
          sends_posted: 3
          queue_depth: 0
          max_frags: 0

Under the conns section you can monitor the different queue sizes: tx_active is the one of interest at the moment. This will require a bit of experimentation on a heavily loaded system, to see the average size of these queues. Then you can have a python script (or similar) to trigger an action whenever the queues grow beyond the expected average. The discussed action was to initiate some MLNX debugging to capture more data.

Note the patch grabs the first 4096 peers only.

An example python script is below. I used o2ib for the network and 300 for the expected average queue size. For the action, I simply print an output.

import yaml
import subprocess
import time

while True:
    output = subprocess.check_output(['lnetctl', 'peer', 'show', '--lnd', '--net', 'o2ib'])
    y = yaml.safe_load(output)
    for i in range (0, len(y['lnd_peer'])):
        peer = y['lnd_peer'][i]
        for j in range(0, len(peer['conns'])):
           if peer['conns'][j]['tx_active'] > 300:
              print("peer: ", peer['nid'], "active_tx is growing too large: ", peer['conns'][j]['tx_active'])
    time.sleep(10)

Amir Shehata (Inactive) added a comment - 20/Jun/19 11:53 PM - edited took me a bit longer to get through. Here is a debug patch which can be used to monitor the internal iblnd queues and trigger an action when the queues get too large. I tested it locally, but I don't have a large cluster. So it'll be a good idea to test it on a larger cluster before deploying it on a live system. Patch is nasa_lu11644.patch . you can run: lnetctl peer show --lnd --net <net type: ex o2ib1> This will dump output like [root@trevis-407 ~]# lnetctl peer show --lnd --net o2ib lnd_peer: - nid: 172.16.1.6@o2ib ni_nid: 172.16.1.7@o2ib num_conns: 1 tx_queue: 0 accepting: 0 connecting: 0 reconnecting: 0 conn_races: 0 reconnected: 0 refcount: 2 max_frags: 256 queue_depth: 8 conns: - refcount: 0 credits: 20 outstanding_credits: 8 reserved_credits: 1 early_rxs: 8 tx_noops: 0 tx_active: 0 tx_queue_nocred: 0 tx_queue_rsrvd: 0 tx_queue: 0 conn_state: 0 sends_posted: 3 queue_depth: 0 max_frags: 0 - nid: 172.16.1.7@o2ib ni_nid: 172.16.1.7@o2ib num_conns: 2 tx_queue: 0 accepting: 0 connecting: 0 reconnecting: 0 conn_races: 0 reconnected: 0 refcount: 3 max_frags: 256 queue_depth: 8 conns: - refcount: 0 credits: 20 outstanding_credits: 7 reserved_credits: 0 early_rxs: 8 tx_noops: 0 tx_active: 0 tx_queue_nocred: 0 tx_queue_rsrvd: 0 tx_queue: 0 conn_state: 0 sends_posted: 3 queue_depth: 0 max_frags: 0 - refcount: 16777224 credits: 20 outstanding_credits: 8 reserved_credits: 1 early_rxs: 8 tx_noops: 0 tx_active: 0 tx_queue_nocred: 0 tx_queue_rsrvd: 0 tx_queue: 0 conn_state: 0 sends_posted: 3 queue_depth: 0 max_frags: 0 Under the conns section you can monitor the different queue sizes: tx_active is the one of interest at the moment. This will require a bit of experimentation on a heavily loaded system, to see the average size of these queues. Then you can have a python script (or similar) to trigger an action whenever the queues grow beyond the expected average. The discussed action was to initiate some MLNX debugging to capture more data. Note the patch grabs the first 4096 peers only. An example python script is below. I used o2ib for the network and 300 for the expected average queue size. For the action, I simply print an output. import yaml import subprocess import time while True: output = subprocess.check_output([ 'lnetctl' , 'peer' , 'show' , '--lnd' , '--net' , 'o2ib' ]) y = yaml.safe_load(output) for i in range (0, len(y[ 'lnd_peer' ])): peer = y[ 'lnd_peer' ][i] for j in range(0, len(peer[ 'conns' ])): if peer[ 'conns' ][j][ 'tx_active' ] > 300: print( "peer: " , peer[ 'nid' ], "active_tx is growing too large: " , peer[ 'conns' ][j][ 'tx_active' ]) time.sleep(10)

People

Assignee:: Amir Shehata (Inactive)

Reporter:: Mahmoud Hanafi

Votes:: 1 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 08/Nov/18 4:03 AM

Updated:: 15/Oct/24 7:38 PM