[LU-11236] client MDT OST ENOTCONN loops Created: 13/Aug/18  Updated: 13/Aug/18

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: John Hammond Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-11119 A 'mv' of a file from a local file sy... Resolved
is related to LU-11227 client process hangs when lod_sync ac... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

When an OST is deactivated some MDT to OST operation will return -ENOTCONN. In some cases this will result in -ENOTCONN being returned to the client as the Lustre message status which the client ptlrpc code will interpret as indicating that the client has been "abruptly disconnected" from the MDT. Then the client will reconnect and resend the operation causing a loop.

o:~# bash $LUSTRE/tests/llmount.sh
...
o:~# lctl set_param osp.lustre-OST0000-osc-MDT0000.active=0
osp.lustre-OST0000-osc-MDT0000.active=0
o:~# touch /mnt/lustre/f0
o:~# chown sanity: /mnt/lustre/f0
o:~# lctl set_param debug=+trace
debug=+trace
o:~# lctl set_param debug_mb=64
debug_mb=64
o:~# lctl clear
o:~# sudo -u sanity chgrp gsanity0 /mnt/lustre/f0 &
[1] 12744
o:~# sleep 4
o:~# lctl set_param osp.lustre-OST0000-osc-MDT0000.active=1
osp.lustre-OST0000-osc-MDT0000.active=1
o:~# wait
[1]+  Done                    sudo -u sanity chgrp gsanity0 /mnt/lustre/f0
o:~# lctl dk > /tmp/1.dk

With lots of lines elided:

00000100:00100000:0.0:1534175675.393208:0:12746:0:(client.c:1625:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc chgrp:7dc6380a-74bd-0c12-bf7e-6c11809b0af5:12746:1608699144600192:0@lo:36
00000100:00100000:0.0:1534175675.393458:0:11330:0:(service.c:2129:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:7dc6380a-74bd-0c12-bf7e-6c11809b0af5+13:12746:x1608699144600192:12345-0@lo:36
00000004:00000001:0.0:1534175675.393963:0:11330:0:(osp_dev.c:815:osp_sync()) Process leaving (rc=18446744073709551509 : -107 : ffffffffffffff95)
00000004:00020000:0.0:1534175675.393967:0:11330:0:(lod_dev.c:1415:lod_sync()) lustre-MDT0000-mdtlov: can't sync ost 0: -107
00000004:00000001:0.0:1534175675.429293:0:11330:0:(lod_dev.c:1422:lod_sync()) Process leaving (rc=18446744073709551509 : -107 : ffffffffffffff95)
00000004:00000001:0.0:1534175675.429298:0:11330:0:(mdd_object.c:1174:mdd_attr_set()) Process leaving via out (rc=18446744073709551509 : -107 : 0xffffffffffffff95)
00000100:00100000:0.0:1534175675.429886:0:11330:0:(service.c:2179:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:7dc6380a-74bd-0c12-bf7e-6c11809b0af5+8:12746:x1608699144600192:12345-0@lo:36 Request processed in 36428us (36621us total) trans 0 rc -107/-107

Client:
00000100:00000001:0.0:1534175675.430424:0:12746:0:(client.c:1284:ptlrpc_check_status()) Process leaving (rc=18446744073709551509 : -107 : ffffffffffffff95)
00000100:00000001:0.0:1534175675.430429:0:12746:0:(recover.c:232:ptlrpc_request_handle_notconn()) Process entered
00000100:00080000:0.0:1534175675.430430:0:12746:0:(recover.c:236:ptlrpc_request_handle_notconn()) import lustre-MDT0000-mdc-ffff8eb4e8bb6800 of lustre-MDT0000_UUID@192.168.122.131@tcp abruptly disconnected: reconnecting

Generated at Sat Feb 10 02:42:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.