[LU-16220] lnet recovery_interval setting Created: 06/Oct/22  Updated: 26/Oct/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.8
Fix Version/s: None

Type: Question/Request Priority: Major
Reporter: Mahmoud Hanafi Assignee: Serguei Smirnov
Resolution: Unresolved Votes: 0
Labels: None

Attachments: File dk.rmmod3.r435i0n15.bz2    
Rank (Obsolete): 9223372036854775807

 Description   

If health_sensitivity=0 and the peer is offline, does recovery_interval play any role in how often the client pings the peer?
I ask because, when a filesystem is down, the clients trying to connect to the servers are causing path lookup storms on our InfiniBand fabric. Too many path lookups cause our subnet manager to lock up and it requires a restart.



 Comments   
Comment by Peter Jones [ 07/Oct/22 ]

Serguei

Could you please advise?

Thanks

Peter

Comment by Serguei Smirnov [ 07/Oct/22 ]

Hi,

Peer recovery shouldn't be happening if health feature is disabled. You should be able to verify this with

lnetctl debug recovery --peer 

There's a possibility that the client pings the server driven by higher-level lustre keepalive mechanism, which, if I remember correctly, pings the server several times within obd_timeout period if there's no other traffic. Another possibility is lnd trying to reconnect on its own. 

If you could share net debug log from the client trying to reconnect, we could clarify what is going on in your case.

Thanks,

Serguei.

 

Comment by Mahmoud Hanafi [ 13/Oct/22 ]

Got some debugging info.
With the servers shutdown the client are trying to reconnect. With our new filesystem each target has 4 IP address to try. They keep trying over and over.

 

00000100:00080000:13.0:1665694986.626629:0:3025:0:(import.c:86:import_set_state_nolock()) ffff990a2dd1a800 nbptest4-OST0064_UUID: changing import state from CONNECTING to DISCONN
00000100:00080000:13.0:1665694986.626629:0:3025:0:(import.c:1422:ptlrpc_connect_interpret()) recovery of nbptest4-OST0064_UUID on 10.151.27.139@o2ib failed (-110)
00000100:00080000:13.0:1665694986.627063:0:234:0:(pinger.c:247:ptlrpc_pinger_process_import()) efda8ea5-1d5f-3073-1623-a227220573a0->nbptest4-OST0064_UUID: level DISCONN/3 force 1 force_next 0 deactive 0 pingable 1 suppress 0
00000100:00080000:13.0:1665694986.627065:0:234:0:(recover.c:58:ptlrpc_initiate_recovery()) nbptest4-OST0064_UUID: starting recovery
00000100:00080000:13.0:1665694986.627065:0:234:0:(import.c:86:import_set_state_nolock()) ffff990a2dd1a800 nbptest4-OST0064_UUID: changing import state from DISCONN to CONNECTING
00000100:00080000:13.0:1665694986.627066:0:234:0:(import.c:544:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: connect to NID 10.151.27.138@o2ib last attempt 1897596
00000100:00080000:13.0:1665694986.627067:0:234:0:(import.c:544:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: connect to NID 10.151.27.139@o2ib last attempt 1897596
00000100:00080000:13.0:1665694986.627068:0:234:0:(import.c:544:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: connect to NID 10.151.27.141@o2ib last attempt 1897590
00000100:00080000:13.0:1665694986.627070:0:234:0:(import.c:544:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: connect to NID 10.151.27.140@o2ib last attempt 1897591
00000100:00080000:13.0:1665694986.627071:0:234:0:(import.c:616:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: Connection changing to nbptest4-OST0064 (at 10.151.27.141@o2ib)
00000100:00080000:13.0:1665694986.627073:0:234:0:(import.c:624:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: import ffff990a2dd1a800 using connection 10.151.27.141@o2ib/10.151.27.141@o2ib
00000100:00100000:13.0:1665694986.627076:0:234:0:(import.c:817:ptlrpc_connect_import_locked()) @@@ (re)connect request (timeout 5)  req@ffff99040756b600 x1744618326358592/t0(0) o8->nbptest4-OST0064-osc-ffff990a2dd1a000@10.151.27.141@o2ib:28/4 lens 520/544 e 0 to 0 dl 0 ref 1 fl New:N/0/ffffffff rc 0/-1
00000100:00100000:13.0:1665694986.632012:0:3025:0:(client.c:2188:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1665694986/real 1665694986]  req@ffff99040756b600 x1744618326358592/t0(0) o8->nbptest4-OST0064-osc-ffff990a2dd1a000@10.151.27.141@o2ib:28/4 lens 520/544 e 0 to 1 dl 1665695266 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
00000100:00100000:13.0:1665694986.632016:0:3025:0:(client.c:2217:ptlrpc_expire_one_request()) @@@ err -110, sent_state=CONNECTING (now=CONNECTING)  req@ffff99040756b600 x1744618326358592/t0(0) o8->nbptest4-OST0064-osc-ffff990a2dd1a000@10.151.27.141@o2ib:28/4 lens 520/544 e 0 to 1 dl 1665695266 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
00000100:00080000:13.0:1665694986.632019:0:3025:0:(import.c:86:import_set_state_nolock()) ffff990a2dd1a800 nbptest4-OST0064_UUID: changing import state from CONNECTING to DISCONN
00000100:00080000:13.0:1665694986.632020:0:3025:0:(import.c:1422:ptlrpc_connect_interpret()) recovery of nbptest4-OST0064_UUID on 10.151.27.141@o2ib failed (-110)
00000100:00080000:13.0:1665694986.632471:0:234:0:(pinger.c:247:ptlrpc_pinger_process_import()) efda8ea5-1d5f-3073-1623-a227220573a0->nbptest4-OST0064_UUID: level DISCONN/3 force 1 force_next 0 deactive 0 pingable 1 suppress 0
00000100:00080000:13.0:1665694986.632473:0:234:0:(recover.c:58:ptlrpc_initiate_recovery()) nbptest4-OST0064_UUID: starting recovery
00000100:00080000:13.0:1665694986.632474:0:234:0:(import.c:86:import_set_state_nolock()) ffff990a2dd1a800 nbptest4-OST0064_UUID: changing import state from DISCONN to CONNECTING
00000100:00080000:13.0:1665694986.632476:0:234:0:(import.c:544:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: connect to NID 10.151.27.138@o2ib last attempt 1897596
00000100:00080000:13.0:1665694986.632477:0:234:0:(import.c:544:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: connect to NID 10.151.27.139@o2ib last attempt 1897596
00000100:00080000:13.0:1665694986.632479:0:234:0:(import.c:544:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: connect to NID 10.151.27.141@o2ib last attempt 1897596
00000100:00080000:13.0:1665694986.632481:0:234:0:(import.c:544:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: connect to NID 10.151.27.140@o2ib last attempt 1897591
00000100:00080000:13.0:1665694986.632483:0:234:0:(import.c:616:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: Connection changing to nbptest4-OST0064 (at 10.151.27.140@o2ib)
00000100:00080000:13.0:1665694986.632485:0:234:0:(import.c:624:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: import ffff990a2dd1a800 using connection 10.151.27.140@o2ib/10.151.27.140@o2ib 
Comment by Mahmoud Hanafi [ 13/Oct/22 ]

Is there a way to pause/stop all client reconnect attempts? 

Comment by Serguei Smirnov [ 18/Oct/22 ]

Checked with adilger about this. Here's the summary.

If FS stays mounted on the clients: 

For OST connections, setting "osc.*.idle_timeout" prior to servers going down should prevent reconnect attempts. 

Setting "lctl set_param timeout=3600" will reduce the ping interval to 15min. Large value can be used if needed.

To avoid client evictions, both settings need to be restored before the server comes back up.

Thanks,

Serguei.

Comment by Mahmoud Hanafi [ 26/Oct/22 ]

thank you, I will test this setting. 

Generated at Sat Feb 10 03:25:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.