[LU-13517] sanity test_812b: lfs quota should succeed Created: 05/May/20  Updated: 08/May/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Chris Horn <hornc@cray.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/7d35d4a5-67ab-46ef-80f1-9db5632510f5

test_812b failed with the following error:

lfs quota should succeed

Client dmesg shows:

[14761.105314] Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff9418a40e1800.ost_server_uuid in CONNECTING state after 16 sec
[14765.511881] Lustre: 12533:0:(client.c:2244:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1588645836/real 1588645836]  req@ffff94189cd0d200 x1665810030678400/t0(0) o9->lustre-OST0000-osc-ffff9418a40e1800@10.9.4.109@tcp:28/4 lens 224/224 e 0 to 1 dl 1588645842 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
[14766.670229] Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff9418a40e1800.ost_server_uuid in CONNECTING state after 16 sec
[14767.279200] LustreError: 11-0: lustre-OST0000-osc-ffff9418a40e1800: operation ost_connect to node 10.9.4.109@tcp failed: rc = -16
[14767.280448] LustreError: 20270:0:(osc_quota.c:308:osc_quotactl()) ptlrpc_queue_wait failed, rc: -5
[14767.996905] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_812b: @@@@@@ FAIL: lfs quota should succeed 

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_812b - lfs quota should succeed



 Comments   
Comment by Chris Horn [ 05/May/20 ]

test 812a and 812b do very similar things, so we can compare them to see where things went awry.

812a - Client: Sends disconnect RPC:

00000001:02000400:0.0:1588645784.707551:0:17493:0:(debug.c:492:libcfs_debug_mark_buffer()) DEBUG MARKER: trevis-10vm1.trevis.whamcloud.com: executing wait_import_state CONNECTING osc.lustre-OST0000-osc-ffff9418a40e1800.ost_server_uuid 40
00000100:00100000:1.0:1588645801.470595:0:12533:0:(client.c:1711:ptlrpc_send_new_req()) Sending RPC req@ffff94189cd0c900 pname:cluuid:pid:xid:nid:opc:job ptlrpcd_rcv:018ba189-a393-4:12533:1665810030674560:10.9.4.109@tcp:9:kworker/u4:3.0

812b - Client: Sends disconnect RPC:

00000001:02000400:1.0:1588645821.309597:0:19942:0:(debug.c:492:libcfs_debug_mark_buffer()) DEBUG MARKER: trevis-10vm1.trevis.whamcloud.com: executing wait_import_state CONNECTING osc.lustre-OST0000-osc-ffff9418a40e1800.ost_server_uuid 40
00000100:00100000:1.0:1588645836.526594:0:12533:0:(client.c:1711:ptlrpc_send_new_req()) Sending RPC req@ffff94189cd0d200 pname:cluuid:pid:xid:nid:opc:job ptlrpcd_rcv:018ba189-a393-4:12533:1665810030678400:10.9.4.109@tcp:9:kworker/u4:2.0

812a - Server: Receives disconnect RPC, delayed handling:

00000001:02000400:0.0:1588645784.388569:0:1338:0:(debug.c:492:libcfs_debug_mark_buffer()) DEBUG MARKER: trevis-10vm1.trevis.whamcloud.com: executing wait_import_state CONNECTING osc.lustre-OST0000-osc-ffff9418a40e1800.ost_server_uuid 40
00000100:00100000:1.0:1588645801.398631:0:7828:0:(service.c:2143:ptlrpc_server_handle_req_in()) got req x1665810030674560
00000100:00100000:1.0:1588645801.398667:0:7828:0:(service.c:2300:ptlrpc_server_handle_request()) Handling RPC req@ffff8ed5389f3a80 pname:cluuid+ref:pid:xid:nid:opc:job ll_ost00_037:018ba189-a393-4+5:12533:x1665810030674560:12345-10.9.4.107@tcp:9:kworker/u4:3.0

812b - Server: Receives disconnect RPC, delayed handling:

00000001:02000400:1.0:1588645820.951470:0:2333:0:(debug.c:492:libcfs_debug_mark_buffer()) DEBUG MARKER: trevis-10vm1.trevis.whamcloud.com: executing wait_import_state CONNECTING osc.lustre-OST0000-osc-ffff9418a40e1800.ost_server_uuid 40
00000100:00100000:1.0:1588645836.454391:0:29351:0:(service.c:2143:ptlrpc_server_handle_req_in()) got req x1665810030678400
00000100:00100000:1.0:1588645836.454426:0:29351:0:(service.c:2300:ptlrpc_server_handle_request()) Handling RPC req@ffff8ed52ba53a80 pname:cluuid+ref:pid:xid:nid:opc:job ll_ost00_030:018ba189-a393-4+5:12533:x1665810030678400:12345-10.9.4.107@tcp:9:kworker/u4:2.0

812a - Client: In CONNECTING

00000001:02000400:1.0:1588645802.505905:0:17709:0:(debug.c:492:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff9418a40e1800.ost_server_uuid in CONNECTING state after 17 sec
00000001:02000400:0.0:1588645803.098041:0:17812:0:(debug.c:492:libcfs_debug_mark_buffer()) DEBUG MARKER: osc.lustre-OST0000-osc-ffff9418a40e1800.ost_server_uuid in CONNECTING state after 17 sec

812b - Client: In CONNECTING

00000001:02000400:0.0:1588645838.119645:0:20151:0:(debug.c:492:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff9418a40e1800.ost_server_uuid in CONNECTING state after 16 sec

812a - Server: In CONNECTING

00000001:02000400:1.0:1588645802.188977:0:1344:0:(debug.c:492:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff9418a40e1800.ost_server_uuid in CONNECTING state after 17 sec
00000001:02000400:1.0:1588645802.781945:0:1408:0:(debug.c:492:libcfs_debug_mark_buffer()) DEBUG MARKER: osc.lustre-OST0000-osc-ffff9418a40e1800.ost_server_uuid in CONNECTING state after 17 sec

812b - Server: In CONNECTING

00000001:02000400:1.0:1588645837.776525:0:2341:0:(debug.c:492:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff9418a40e1800.ost_server_uuid in CONNECTING state after 16 sec
00000001:02000400:1.0:1588645838.378593:0:2405:0:(debug.c:492:libcfs_debug_mark_buffer()) DEBUG MARKER: osc.lustre-OST0000-osc-ffff9418a40e1800.ost_server_uuid in CONNECTING state after 16 sec

812a - Server: Handles disconnect RPC

00000001:02000400:1.0:1588645803.339396:0:1414:0:(debug.c:492:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0 fail_val=0
00000100:00100000:1.0:1588645803.702376:0:7828:0:(service.c:2349:ptlrpc_server_handle_request()) Handled RPC req@ffff8ed5389f3a80 pname:cluuid+ref:pid:xid:nid:opc:job ll_ost00_037:018ba189-a393-4+2:12533:x1665810030674560:12345-10.9.4.107@tcp:9:kworker/u4:3.0 Request processed in 2303708us (2303789us total) trans 0 rc 0/0

812a - Client: Disconnect RPC reply

00000100:00080000:1.0:1588645803.775075:0:12533:0:(import.c:1779:ptlrpc_disconnect_idle_interpret()) @@@ inflight=2, refcount=6: rc = 0  req@ffff94189cd0c900 x1665810030674560/t0(0) o9->lustre-OST0000-osc-ffff9418a40e1800@10.9.4.109@tcp:28/4 lens 224/224 e 0 to 0 dl 1588645807 ref 1 fl Interpret:RNQU/0/0 rc 0/0 job:'kworker/u4:3.0'
00000100:00100000:1.0:1588645803.775151:0:12533:0:(client.c:2164:ptlrpc_check_set()) Completed RPC req@ffff94189cd0c900 pname:cluuid:pid:xid:nid:opc:job ptlrpcd_rcv:018ba189-a393-4:12533:1665810030674560:10.9.4.109@tcp:9:kworker/u4:3.0

812b - Client: Timeout disconnect RPC:

00000100:00000400:1.0:1588645842.526455:0:12533:0:(client.c:2244:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1588645836/real 1588645836]  req@ffff94189cd0d200 x1665810030678400/t0(0) o9->lustre-OST0000-osc-ffff9418a40e1800@10.9.4.109@tcp:28/4 lens 224/224 e 0 to 1 dl 1588645842 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
00000100:00100000:1.0:1588645842.529484:0:12533:0:(client.c:2275:ptlrpc_expire_one_request()) @@@ err -110, sent_state=CONNECTING (now=CONNECTING)  req@ffff94189cd0d200 x1665810030678400/t0(0) o9->lustre-OST0000-osc-ffff9418a40e1800@10.9.4.109@tcp:28/4 lens 224/224 e 0 to 1 dl 1588645842 ref 1 fl Rpc:XNQU/0/ffffffff rc 0/-1 job:'kworker/u4:2.0'
00000100:00080000:1.0:1588645842.529492:0:12533:0:(import.c:1779:ptlrpc_disconnect_idle_interpret()) @@@ inflight=1, refcount=5: rc = -110  req@ffff94189cd0d200 x1665810030678400/t0(0) o9->lustre-OST0000-osc-ffff9418a40e1800@10.9.4.109@tcp:28/4 lens 224/224 e 0 to 1 dl 1588645842 ref 1 fl Interpret:EXNQU/0/ffffffff rc -110/-1 job:'kworker/u4:2.0'
00000100:00100000:1.0:1588645842.529499:0:12533:0:(client.c:2164:ptlrpc_check_set()) Completed RPC req@ffff94189cd0d200 pname:cluuid:pid:xid:nid:opc:job ptlrpcd_rcv:018ba189-a393-4:12533:1665810030678400:10.9.4.109@tcp:9:kworker/u4:2.0

812b - Server: Handles disconnect RPC

00000001:02000400:1.0:1588645843.912913:0:2411:0:(debug.c:492:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0 fail_val=0
00000100:00100000:1.0:1588645844.257396:0:29351:0:(service.c:2349:ptlrpc_server_handle_request()) Handled RPC req@ffff8ed52ba53a80 pname:cluuid+ref:pid:xid:nid:opc:job ll_ost00_030:018ba189-a393-4+2:12533:x1665810030678400:12345-10.9.4.107@tcp:9:kworker/u4:2.0 Request processed in 7802987us (7803020us total) trans 0 rc 0/0

812b - Client: Drops disconnect RPC reply

00000001:02000400:1.0:1588645843.684878:0:20255:0:(debug.c:492:libcfs_debug_mark_buffer()) DEBUG MARKER: osc.lustre-OST0000-osc-ffff9418a40e1800.ost_server_uuid in CONNECTING state after 16 sec
00000400:00000100:0.0:1588645844.330131:0:12528:0:(lib-move.c:3907:lnet_parse_put()) Dropping PUT from 12345-10.9.4.109@tcp portal 4 match 1665810030678400 offset 224 length 224: 4

812a - Client: Reconnect to OST

00000100:00080000:1.0:1588645803.775085:0:12533:0:(import.c:544:import_select_connection()) lustre-OST0000-osc-ffff9418a40e1800: connect to NID 10.9.4.109@tcp last attempt 14697
00000100:00080000:1.0:1588645803.775089:0:12533:0:(import.c:627:import_select_connection()) lustre-OST0000-osc-ffff9418a40e1800: import ffff9418a40e3000 using connection 10.9.4.109@tcp/10.9.4.109@tcp
00000008:00100000:1.0:1588645803.775123:0:12533:0:(osc_request.c:2985:osc_reconnect()) ocd_connect_flags: 0xa0465af2e3640478 ocd_version: 34419968 ocd_grant: 8437760, lost: 0.
00000100:00100000:1.0:1588645803.775134:0:12533:0:(import.c:820:ptlrpc_connect_import_locked()) @@@ (re)connect request (timeout 5)  req@ffff94189cd0ed00 x1665810030674752/t0(0) o8->lustre-OST0000-osc-ffff9418a40e1800@10.9.4.109@tcp:28/4 lens 520/544 e 0 to 0 dl 0 ref 1 fl New:NQU/0/ffffffff rc 0/-1 job:''
00000100:00100000:1.0:1588645803.775173:0:12533:0:(client.c:1711:ptlrpc_send_new_req()) Sending RPC req@ffff94189cd0ed00 pname:cluuid:pid:xid:nid:opc:job ptlrpcd_rcv:018ba189-a393-4:12533:1665810030674752:10.9.4.109@tcp:8:ptlrpcd_rcv.0

812b - Client: Reconnect to OST

00000100:00080000:1.0:1588645844.290620:0:20270:0:(import.c:544:import_select_connection()) lustre-OST0000-osc-ffff9418a40e1800: connect to NID 10.9.4.109@tcp last attempt 14727
00000100:00080000:1.0:1588645844.290623:0:20270:0:(import.c:627:import_select_connection()) lustre-OST0000-osc-ffff9418a40e1800: import ffff9418a40e3000 using connection 10.9.4.109@tcp/10.9.4.109@tcp
00000008:00100000:1.0:1588645844.290638:0:20270:0:(osc_request.c:2985:osc_reconnect()) ocd_connect_flags: 0xa0465af2e3640478 ocd_version: 34419968 ocd_grant: 8437760, lost: 0.
00000100:00100000:1.0:1588645844.290646:0:20270:0:(import.c:820:ptlrpc_connect_import_locked()) @@@ (re)connect request (timeout 5)  req@ffff94189cd0e880 x1665810030678912/t0(0) o8->lustre-OST0000-osc-ffff9418a40e1800@10.9.4.109@tcp:28/4 lens 520/544 e 0 to 0 dl 0 ref 1 fl New:NQU/0/ffffffff rc 0/-1 job:''
00000100:00100000:1.0:1588645844.290669:0:12533:0:(client.c:1711:ptlrpc_send_new_req()) Sending RPC req@ffff94189cd0e880 pname:cluuid:pid:xid:nid:opc:job ptlrpcd_rcv:018ba189-a393-4:12533:1665810030678912:10.9.4.109@tcp:8:lfs.0

812a - Server: Handle connect RPC

00000100:00100000:1.0:1588645803.703084:0:7828:0:(service.c:2143:ptlrpc_server_handle_req_in()) got req x1665810030674752
00000100:00100000:1.0:1588645803.703100:0:7828:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.9.4.107@tcp, seq: 142006
00000100:00100000:1.0:1588645803.703103:0:7828:0:(service.c:2300:ptlrpc_server_handle_request()) Handling RPC req@ffff8ed507f5d200 pname:cluuid+ref:pid:xid:nid:opc:job ll_ost00_037:0+-99:12533:x1665810030674752:12345-10.9.4.107@tcp:8:ptlrpcd_rcv.0
00010000:00080000:1.0:1588645803.703112:0:7828:0:(ldlm_lib.c:1288:target_handle_connect()) lustre-OST0000: connection from 018ba189-a393-4@10.9.4.107@tcp t0 exp           (null) cur 14629 last 0
00000020:00000080:1.0:1588645803.703122:0:7828:0:(genops.c:1363:class_connect()) connect: client 018ba189-a393-4, cookie 0x6041f8af644efcda
00002000:00100000:1.0:1588645803.703125:0:7828:0:(ofd_obd.c:238:ofd_parse_connect_data()) lustre-OST0000: cli 018ba189-a393-4/ffff8ed51ffa8400 ocd_connect_flags: 0xa0465af2e3640478 ocd_version: 20d3500 ocd_grant: 8437760 ocd_index: 0 ocd_group 0
00002000:00100000:1.0:1588645803.703130:0:7828:0:(ofd_obd.c:337:ofd_parse_connect_data()) lustre-OST0000: cli <unknown> supports cksum type f7, return f7
00000020:01000000:1.0:1588645803.703140:0:7828:0:(lprocfs_status_server.c:523:lprocfs_exp_setup()) using hash ffff8ed517793180
00002000:00080000:1.0:1588645803.703143:0:7828:0:(ofd_obd.c:476:ofd_obd_connect()) lustre-OST0000: get connection from MDS 0
00000100:00080000:1.0:1588645803.703152:0:7828:0:(import.c:86:import_set_state_nolock()) ffff8ed51bed2800 p���Ԏ��p�L6Վ��: changing import state from RECOVER to FULL
00000100:02000000:1.0:1588645803.703154:0:7828:0:(import.c:1622:ptlrpc_import_recovery_state_machine()) lustre-OST0000: Connection restored to 9b0a0fe9-8fa7-4 (at 10.9.4.107@tcp)
00000100:00100000:1.0:1588645803.703169:0:7828:0:(service.c:2349:ptlrpc_server_handle_request()) Handled RPC req@ffff8ed507f5d200 pname:cluuid+ref:pid:xid:nid:opc:job ll_ost00_037:018ba189-a393-4+6:12533:x1665810030674752:12345-10.9.4.107@tcp:8:ptlrpcd_rcv.0 Request processed in 66us (98us total) trans 0 rc 0/0

812b - Server: Handle connect RPC

00000100:00100000:1.0:1588645844.218486:0:7827:0:(service.c:2143:ptlrpc_server_handle_req_in()) got req x1665810030678912
00000100:00100000:1.0:1588645844.218514:0:7827:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.9.4.107@tcp, seq: 142010
00000100:00100000:1.0:1588645844.218518:0:7827:0:(service.c:2300:ptlrpc_server_handle_request()) Handling RPC req@ffff8ed4f1831200 pname:cluuid+ref:pid:xid:nid:opc:job ll_ost00_036:0+-99:12533:x1665810030678912:12345-10.9.4.107@tcp:8:lfs.0
00010000:02000400:1.0:1588645844.218532:0:7827:0:(ldlm_lib.c:837:target_handle_reconnect()) lustre-OST0000: Client 018ba189-a393-4 (at 10.9.4.107@tcp) reconnecting
00010000:02000400:1.0:1588645844.220018:0:7827:0:(ldlm_lib.c:1268:target_handle_connect()) lustre-OST0000: Client 018ba189-a393-4 (at 10.9.4.107@tcp) refused connection, still busy with 6 references
00000100:00100000:1.0:1588645844.221141:0:7827:0:(service.c:2349:ptlrpc_server_handle_request()) Handled RPC req@ffff8ed4f1831200 pname:cluuid+ref:pid:xid:nid:opc:job ll_ost00_036:0+-99:12533:x1665810030678912:12345-10.9.4.107@tcp:8:lfs.0 Request processed in 2622us (2762us total) trans 0 rc -16/-16

812a - Client: Reconnect completes

00000100:00080000:1.0:1588645803.775698:0:12533:0:(import.c:86:import_set_state_nolock()) ffff9418a40e3000 lustre-OST0000_UUID: changing import state from CONNECTING to FULL

812b - Client: Reconnect fails - QUOTACTL fails as a result?

00000100:00100000:1.0:1588645844.295098:0:12533:0:(client.c:2164:ptlrpc_check_set()) Completed RPC req@ffff94189cd0e880 pname:cluuid:pid:xid:nid:opc:job ptlrpcd_rcv:018ba189-a393-4:12533:1665810030678912:10.9.4.109@tcp:8:lfs.0
00000100:02020000:1.0:1588645844.293876:0:12533:0:(client.c:1354:ptlrpc_check_status()) 11-0: lustre-OST0000-osc-ffff9418a40e1800: operation ost_connect to node 10.9.4.109@tcp failed: rc = -16
00000100:00100000:1.0:1588645844.295090:0:12533:0:(client.c:2847:ptlrpc_free_committed()) lustre-OST0000-osc-ffff9418a40e1800: committing for last_committed 30064774834 gen 5
00000100:00080000:1.0:1588645844.295095:0:12533:0:(import.c:86:import_set_state_nolock()) ffff9418a40e3000 lustre-OST0000_UUID: changing import state from CONNECTING to DISCONN
00000100:00080000:1.0:1588645844.295096:0:12533:0:(import.c:1408:ptlrpc_connect_interpret()) recovery of lustre-OST0000_UUID on 10.9.4.109@tcp failed (-16)
...
00000100:00100000:1.0:1588645844.295108:0:20270:0:(client.c:2164:ptlrpc_check_set()) Completed RPC req@ffff94189cd0d200 pname:cluuid:pid:xid:nid:opc:job lfs:018ba189-a393-4:20270:1665810030678976:10.9.4.109@tcp:19:lfs.0
00000001:00020000:1.0:1588645844.295125:0:20270:0:(osc_quota.c:308:osc_quotactl()) ptlrpc_queue_wait failed, rc: -5

The primary difference seems to be the timeout of the disconnect RPC on the client. I'm not super familiar with this code, but that timeout appears to have trickled down into eventually failing to recover the OST0 connection which in turn caused the -EIO for the quotactl.

Comment by Chris Horn [ 05/May/20 ]

This test failure was seen with patch https://review.whamcloud.com/#/c/38460/ . That patch modifies default LND timeout values. However, based on the logs above it doesn't appear that the patch is responsible for the failure. The timeout of the disconnect RPC happened after ~ 6 seconds, which is less than the new default LND timeout (16 seconds).

Generated at Sat Feb 10 03:01:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.