[LU-16220] lnet recovery_interval setting Created: 06/Oct/22 Updated: 26/Oct/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.8 |
| Fix Version/s: | None |
| Type: | Question/Request | Priority: | Major |
| Reporter: | Mahmoud Hanafi | Assignee: | Serguei Smirnov |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
If health_sensitivity=0 and the peer is offline, does recovery_interval play any role in how often the client pings the peer? |
| Comments |
| Comment by Peter Jones [ 07/Oct/22 ] |
|
Serguei Could you please advise? Thanks Peter |
| Comment by Serguei Smirnov [ 07/Oct/22 ] |
|
Hi, Peer recovery shouldn't be happening if health feature is disabled. You should be able to verify this with lnetctl debug recovery --peer There's a possibility that the client pings the server driven by higher-level lustre keepalive mechanism, which, if I remember correctly, pings the server several times within obd_timeout period if there's no other traffic. Another possibility is lnd trying to reconnect on its own. If you could share net debug log from the client trying to reconnect, we could clarify what is going on in your case. Thanks, Serguei.
|
| Comment by Mahmoud Hanafi [ 13/Oct/22 ] |
|
Got some debugging info.
00000100:00080000:13.0:1665694986.626629:0:3025:0:(import.c:86:import_set_state_nolock()) ffff990a2dd1a800 nbptest4-OST0064_UUID: changing import state from CONNECTING to DISCONN 00000100:00080000:13.0:1665694986.626629:0:3025:0:(import.c:1422:ptlrpc_connect_interpret()) recovery of nbptest4-OST0064_UUID on 10.151.27.139@o2ib failed (-110) 00000100:00080000:13.0:1665694986.627063:0:234:0:(pinger.c:247:ptlrpc_pinger_process_import()) efda8ea5-1d5f-3073-1623-a227220573a0->nbptest4-OST0064_UUID: level DISCONN/3 force 1 force_next 0 deactive 0 pingable 1 suppress 0 00000100:00080000:13.0:1665694986.627065:0:234:0:(recover.c:58:ptlrpc_initiate_recovery()) nbptest4-OST0064_UUID: starting recovery 00000100:00080000:13.0:1665694986.627065:0:234:0:(import.c:86:import_set_state_nolock()) ffff990a2dd1a800 nbptest4-OST0064_UUID: changing import state from DISCONN to CONNECTING 00000100:00080000:13.0:1665694986.627066:0:234:0:(import.c:544:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: connect to NID 10.151.27.138@o2ib last attempt 1897596 00000100:00080000:13.0:1665694986.627067:0:234:0:(import.c:544:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: connect to NID 10.151.27.139@o2ib last attempt 1897596 00000100:00080000:13.0:1665694986.627068:0:234:0:(import.c:544:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: connect to NID 10.151.27.141@o2ib last attempt 1897590 00000100:00080000:13.0:1665694986.627070:0:234:0:(import.c:544:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: connect to NID 10.151.27.140@o2ib last attempt 1897591 00000100:00080000:13.0:1665694986.627071:0:234:0:(import.c:616:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: Connection changing to nbptest4-OST0064 (at 10.151.27.141@o2ib) 00000100:00080000:13.0:1665694986.627073:0:234:0:(import.c:624:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: import ffff990a2dd1a800 using connection 10.151.27.141@o2ib/10.151.27.141@o2ib 00000100:00100000:13.0:1665694986.627076:0:234:0:(import.c:817:ptlrpc_connect_import_locked()) @@@ (re)connect request (timeout 5) req@ffff99040756b600 x1744618326358592/t0(0) o8->nbptest4-OST0064-osc-ffff990a2dd1a000@10.151.27.141@o2ib:28/4 lens 520/544 e 0 to 0 dl 0 ref 1 fl New:N/0/ffffffff rc 0/-1 00000100:00100000:13.0:1665694986.632012:0:3025:0:(client.c:2188:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1665694986/real 1665694986] req@ffff99040756b600 x1744618326358592/t0(0) o8->nbptest4-OST0064-osc-ffff990a2dd1a000@10.151.27.141@o2ib:28/4 lens 520/544 e 0 to 1 dl 1665695266 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 00000100:00100000:13.0:1665694986.632016:0:3025:0:(client.c:2217:ptlrpc_expire_one_request()) @@@ err -110, sent_state=CONNECTING (now=CONNECTING) req@ffff99040756b600 x1744618326358592/t0(0) o8->nbptest4-OST0064-osc-ffff990a2dd1a000@10.151.27.141@o2ib:28/4 lens 520/544 e 0 to 1 dl 1665695266 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 00000100:00080000:13.0:1665694986.632019:0:3025:0:(import.c:86:import_set_state_nolock()) ffff990a2dd1a800 nbptest4-OST0064_UUID: changing import state from CONNECTING to DISCONN 00000100:00080000:13.0:1665694986.632020:0:3025:0:(import.c:1422:ptlrpc_connect_interpret()) recovery of nbptest4-OST0064_UUID on 10.151.27.141@o2ib failed (-110) 00000100:00080000:13.0:1665694986.632471:0:234:0:(pinger.c:247:ptlrpc_pinger_process_import()) efda8ea5-1d5f-3073-1623-a227220573a0->nbptest4-OST0064_UUID: level DISCONN/3 force 1 force_next 0 deactive 0 pingable 1 suppress 0 00000100:00080000:13.0:1665694986.632473:0:234:0:(recover.c:58:ptlrpc_initiate_recovery()) nbptest4-OST0064_UUID: starting recovery 00000100:00080000:13.0:1665694986.632474:0:234:0:(import.c:86:import_set_state_nolock()) ffff990a2dd1a800 nbptest4-OST0064_UUID: changing import state from DISCONN to CONNECTING 00000100:00080000:13.0:1665694986.632476:0:234:0:(import.c:544:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: connect to NID 10.151.27.138@o2ib last attempt 1897596 00000100:00080000:13.0:1665694986.632477:0:234:0:(import.c:544:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: connect to NID 10.151.27.139@o2ib last attempt 1897596 00000100:00080000:13.0:1665694986.632479:0:234:0:(import.c:544:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: connect to NID 10.151.27.141@o2ib last attempt 1897596 00000100:00080000:13.0:1665694986.632481:0:234:0:(import.c:544:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: connect to NID 10.151.27.140@o2ib last attempt 1897591 00000100:00080000:13.0:1665694986.632483:0:234:0:(import.c:616:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: Connection changing to nbptest4-OST0064 (at 10.151.27.140@o2ib) 00000100:00080000:13.0:1665694986.632485:0:234:0:(import.c:624:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: import ffff990a2dd1a800 using connection 10.151.27.140@o2ib/10.151.27.140@o2ib |
| Comment by Mahmoud Hanafi [ 13/Oct/22 ] |
|
Is there a way to pause/stop all client reconnect attempts? |
| Comment by Serguei Smirnov [ 18/Oct/22 ] |
|
Checked with adilger about this. Here's the summary. If FS stays mounted on the clients: For OST connections, setting "osc.*.idle_timeout" prior to servers going down should prevent reconnect attempts. Setting "lctl set_param timeout=3600" will reduce the ping interval to 15min. Large value can be used if needed. To avoid client evictions, both settings need to be restored before the server comes back up. Thanks, Serguei. |
| Comment by Mahmoud Hanafi [ 26/Oct/22 ] |
|
thank you, I will test this setting. |