[LU-13517] sanity test_812b: lfs quota should succeed Created: 05/May/20 Updated: 08/May/20 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
This issue was created by maloo for Chris Horn <hornc@cray.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/7d35d4a5-67ab-46ef-80f1-9db5632510f5 test_812b failed with the following error: lfs quota should succeed Client dmesg shows: [14761.105314] Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff9418a40e1800.ost_server_uuid in CONNECTING state after 16 sec [14765.511881] Lustre: 12533:0:(client.c:2244:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1588645836/real 1588645836] req@ffff94189cd0d200 x1665810030678400/t0(0) o9->lustre-OST0000-osc-ffff9418a40e1800@10.9.4.109@tcp:28/4 lens 224/224 e 0 to 1 dl 1588645842 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' [14766.670229] Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff9418a40e1800.ost_server_uuid in CONNECTING state after 16 sec [14767.279200] LustreError: 11-0: lustre-OST0000-osc-ffff9418a40e1800: operation ost_connect to node 10.9.4.109@tcp failed: rc = -16 [14767.280448] LustreError: 20270:0:(osc_quota.c:308:osc_quotactl()) ptlrpc_queue_wait failed, rc: -5 [14767.996905] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity test_812b: @@@@@@ FAIL: lfs quota should succeed VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Chris Horn [ 05/May/20 ] |
|
test 812a and 812b do very similar things, so we can compare them to see where things went awry. 812a - Client: Sends disconnect RPC: 00000001:02000400:0.0:1588645784.707551:0:17493:0:(debug.c:492:libcfs_debug_mark_buffer()) DEBUG MARKER: trevis-10vm1.trevis.whamcloud.com: executing wait_import_state CONNECTING osc.lustre-OST0000-osc-ffff9418a40e1800.ost_server_uuid 40 00000100:00100000:1.0:1588645801.470595:0:12533:0:(client.c:1711:ptlrpc_send_new_req()) Sending RPC req@ffff94189cd0c900 pname:cluuid:pid:xid:nid:opc:job ptlrpcd_rcv:018ba189-a393-4:12533:1665810030674560:10.9.4.109@tcp:9:kworker/u4:3.0 812b - Client: Sends disconnect RPC: 00000001:02000400:1.0:1588645821.309597:0:19942:0:(debug.c:492:libcfs_debug_mark_buffer()) DEBUG MARKER: trevis-10vm1.trevis.whamcloud.com: executing wait_import_state CONNECTING osc.lustre-OST0000-osc-ffff9418a40e1800.ost_server_uuid 40 00000100:00100000:1.0:1588645836.526594:0:12533:0:(client.c:1711:ptlrpc_send_new_req()) Sending RPC req@ffff94189cd0d200 pname:cluuid:pid:xid:nid:opc:job ptlrpcd_rcv:018ba189-a393-4:12533:1665810030678400:10.9.4.109@tcp:9:kworker/u4:2.0 812a - Server: Receives disconnect RPC, delayed handling: 00000001:02000400:0.0:1588645784.388569:0:1338:0:(debug.c:492:libcfs_debug_mark_buffer()) DEBUG MARKER: trevis-10vm1.trevis.whamcloud.com: executing wait_import_state CONNECTING osc.lustre-OST0000-osc-ffff9418a40e1800.ost_server_uuid 40 00000100:00100000:1.0:1588645801.398631:0:7828:0:(service.c:2143:ptlrpc_server_handle_req_in()) got req x1665810030674560 00000100:00100000:1.0:1588645801.398667:0:7828:0:(service.c:2300:ptlrpc_server_handle_request()) Handling RPC req@ffff8ed5389f3a80 pname:cluuid+ref:pid:xid:nid:opc:job ll_ost00_037:018ba189-a393-4+5:12533:x1665810030674560:12345-10.9.4.107@tcp:9:kworker/u4:3.0 812b - Server: Receives disconnect RPC, delayed handling: 00000001:02000400:1.0:1588645820.951470:0:2333:0:(debug.c:492:libcfs_debug_mark_buffer()) DEBUG MARKER: trevis-10vm1.trevis.whamcloud.com: executing wait_import_state CONNECTING osc.lustre-OST0000-osc-ffff9418a40e1800.ost_server_uuid 40 00000100:00100000:1.0:1588645836.454391:0:29351:0:(service.c:2143:ptlrpc_server_handle_req_in()) got req x1665810030678400 00000100:00100000:1.0:1588645836.454426:0:29351:0:(service.c:2300:ptlrpc_server_handle_request()) Handling RPC req@ffff8ed52ba53a80 pname:cluuid+ref:pid:xid:nid:opc:job ll_ost00_030:018ba189-a393-4+5:12533:x1665810030678400:12345-10.9.4.107@tcp:9:kworker/u4:2.0 812a - Client: In CONNECTING 00000001:02000400:1.0:1588645802.505905:0:17709:0:(debug.c:492:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff9418a40e1800.ost_server_uuid in CONNECTING state after 17 sec 00000001:02000400:0.0:1588645803.098041:0:17812:0:(debug.c:492:libcfs_debug_mark_buffer()) DEBUG MARKER: osc.lustre-OST0000-osc-ffff9418a40e1800.ost_server_uuid in CONNECTING state after 17 sec 812b - Client: In CONNECTING 00000001:02000400:0.0:1588645838.119645:0:20151:0:(debug.c:492:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff9418a40e1800.ost_server_uuid in CONNECTING state after 16 sec 812a - Server: In CONNECTING 00000001:02000400:1.0:1588645802.188977:0:1344:0:(debug.c:492:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff9418a40e1800.ost_server_uuid in CONNECTING state after 17 sec 00000001:02000400:1.0:1588645802.781945:0:1408:0:(debug.c:492:libcfs_debug_mark_buffer()) DEBUG MARKER: osc.lustre-OST0000-osc-ffff9418a40e1800.ost_server_uuid in CONNECTING state after 17 sec 812b - Server: In CONNECTING 00000001:02000400:1.0:1588645837.776525:0:2341:0:(debug.c:492:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff9418a40e1800.ost_server_uuid in CONNECTING state after 16 sec 00000001:02000400:1.0:1588645838.378593:0:2405:0:(debug.c:492:libcfs_debug_mark_buffer()) DEBUG MARKER: osc.lustre-OST0000-osc-ffff9418a40e1800.ost_server_uuid in CONNECTING state after 16 sec 812a - Server: Handles disconnect RPC 00000001:02000400:1.0:1588645803.339396:0:1414:0:(debug.c:492:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0 fail_val=0 00000100:00100000:1.0:1588645803.702376:0:7828:0:(service.c:2349:ptlrpc_server_handle_request()) Handled RPC req@ffff8ed5389f3a80 pname:cluuid+ref:pid:xid:nid:opc:job ll_ost00_037:018ba189-a393-4+2:12533:x1665810030674560:12345-10.9.4.107@tcp:9:kworker/u4:3.0 Request processed in 2303708us (2303789us total) trans 0 rc 0/0 812a - Client: Disconnect RPC reply 00000100:00080000:1.0:1588645803.775075:0:12533:0:(import.c:1779:ptlrpc_disconnect_idle_interpret()) @@@ inflight=2, refcount=6: rc = 0 req@ffff94189cd0c900 x1665810030674560/t0(0) o9->lustre-OST0000-osc-ffff9418a40e1800@10.9.4.109@tcp:28/4 lens 224/224 e 0 to 0 dl 1588645807 ref 1 fl Interpret:RNQU/0/0 rc 0/0 job:'kworker/u4:3.0' 00000100:00100000:1.0:1588645803.775151:0:12533:0:(client.c:2164:ptlrpc_check_set()) Completed RPC req@ffff94189cd0c900 pname:cluuid:pid:xid:nid:opc:job ptlrpcd_rcv:018ba189-a393-4:12533:1665810030674560:10.9.4.109@tcp:9:kworker/u4:3.0 812b - Client: Timeout disconnect RPC: 00000100:00000400:1.0:1588645842.526455:0:12533:0:(client.c:2244:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1588645836/real 1588645836] req@ffff94189cd0d200 x1665810030678400/t0(0) o9->lustre-OST0000-osc-ffff9418a40e1800@10.9.4.109@tcp:28/4 lens 224/224 e 0 to 1 dl 1588645842 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' 00000100:00100000:1.0:1588645842.529484:0:12533:0:(client.c:2275:ptlrpc_expire_one_request()) @@@ err -110, sent_state=CONNECTING (now=CONNECTING) req@ffff94189cd0d200 x1665810030678400/t0(0) o9->lustre-OST0000-osc-ffff9418a40e1800@10.9.4.109@tcp:28/4 lens 224/224 e 0 to 1 dl 1588645842 ref 1 fl Rpc:XNQU/0/ffffffff rc 0/-1 job:'kworker/u4:2.0' 00000100:00080000:1.0:1588645842.529492:0:12533:0:(import.c:1779:ptlrpc_disconnect_idle_interpret()) @@@ inflight=1, refcount=5: rc = -110 req@ffff94189cd0d200 x1665810030678400/t0(0) o9->lustre-OST0000-osc-ffff9418a40e1800@10.9.4.109@tcp:28/4 lens 224/224 e 0 to 1 dl 1588645842 ref 1 fl Interpret:EXNQU/0/ffffffff rc -110/-1 job:'kworker/u4:2.0' 00000100:00100000:1.0:1588645842.529499:0:12533:0:(client.c:2164:ptlrpc_check_set()) Completed RPC req@ffff94189cd0d200 pname:cluuid:pid:xid:nid:opc:job ptlrpcd_rcv:018ba189-a393-4:12533:1665810030678400:10.9.4.109@tcp:9:kworker/u4:2.0 812b - Server: Handles disconnect RPC 00000001:02000400:1.0:1588645843.912913:0:2411:0:(debug.c:492:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0 fail_val=0 00000100:00100000:1.0:1588645844.257396:0:29351:0:(service.c:2349:ptlrpc_server_handle_request()) Handled RPC req@ffff8ed52ba53a80 pname:cluuid+ref:pid:xid:nid:opc:job ll_ost00_030:018ba189-a393-4+2:12533:x1665810030678400:12345-10.9.4.107@tcp:9:kworker/u4:2.0 Request processed in 7802987us (7803020us total) trans 0 rc 0/0 812b - Client: Drops disconnect RPC reply 00000001:02000400:1.0:1588645843.684878:0:20255:0:(debug.c:492:libcfs_debug_mark_buffer()) DEBUG MARKER: osc.lustre-OST0000-osc-ffff9418a40e1800.ost_server_uuid in CONNECTING state after 16 sec 00000400:00000100:0.0:1588645844.330131:0:12528:0:(lib-move.c:3907:lnet_parse_put()) Dropping PUT from 12345-10.9.4.109@tcp portal 4 match 1665810030678400 offset 224 length 224: 4 812a - Client: Reconnect to OST 00000100:00080000:1.0:1588645803.775085:0:12533:0:(import.c:544:import_select_connection()) lustre-OST0000-osc-ffff9418a40e1800: connect to NID 10.9.4.109@tcp last attempt 14697 00000100:00080000:1.0:1588645803.775089:0:12533:0:(import.c:627:import_select_connection()) lustre-OST0000-osc-ffff9418a40e1800: import ffff9418a40e3000 using connection 10.9.4.109@tcp/10.9.4.109@tcp 00000008:00100000:1.0:1588645803.775123:0:12533:0:(osc_request.c:2985:osc_reconnect()) ocd_connect_flags: 0xa0465af2e3640478 ocd_version: 34419968 ocd_grant: 8437760, lost: 0. 00000100:00100000:1.0:1588645803.775134:0:12533:0:(import.c:820:ptlrpc_connect_import_locked()) @@@ (re)connect request (timeout 5) req@ffff94189cd0ed00 x1665810030674752/t0(0) o8->lustre-OST0000-osc-ffff9418a40e1800@10.9.4.109@tcp:28/4 lens 520/544 e 0 to 0 dl 0 ref 1 fl New:NQU/0/ffffffff rc 0/-1 job:'' 00000100:00100000:1.0:1588645803.775173:0:12533:0:(client.c:1711:ptlrpc_send_new_req()) Sending RPC req@ffff94189cd0ed00 pname:cluuid:pid:xid:nid:opc:job ptlrpcd_rcv:018ba189-a393-4:12533:1665810030674752:10.9.4.109@tcp:8:ptlrpcd_rcv.0 812b - Client: Reconnect to OST 00000100:00080000:1.0:1588645844.290620:0:20270:0:(import.c:544:import_select_connection()) lustre-OST0000-osc-ffff9418a40e1800: connect to NID 10.9.4.109@tcp last attempt 14727 00000100:00080000:1.0:1588645844.290623:0:20270:0:(import.c:627:import_select_connection()) lustre-OST0000-osc-ffff9418a40e1800: import ffff9418a40e3000 using connection 10.9.4.109@tcp/10.9.4.109@tcp 00000008:00100000:1.0:1588645844.290638:0:20270:0:(osc_request.c:2985:osc_reconnect()) ocd_connect_flags: 0xa0465af2e3640478 ocd_version: 34419968 ocd_grant: 8437760, lost: 0. 00000100:00100000:1.0:1588645844.290646:0:20270:0:(import.c:820:ptlrpc_connect_import_locked()) @@@ (re)connect request (timeout 5) req@ffff94189cd0e880 x1665810030678912/t0(0) o8->lustre-OST0000-osc-ffff9418a40e1800@10.9.4.109@tcp:28/4 lens 520/544 e 0 to 0 dl 0 ref 1 fl New:NQU/0/ffffffff rc 0/-1 job:'' 00000100:00100000:1.0:1588645844.290669:0:12533:0:(client.c:1711:ptlrpc_send_new_req()) Sending RPC req@ffff94189cd0e880 pname:cluuid:pid:xid:nid:opc:job ptlrpcd_rcv:018ba189-a393-4:12533:1665810030678912:10.9.4.109@tcp:8:lfs.0 812a - Server: Handle connect RPC 00000100:00100000:1.0:1588645803.703084:0:7828:0:(service.c:2143:ptlrpc_server_handle_req_in()) got req x1665810030674752 00000100:00100000:1.0:1588645803.703100:0:7828:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.9.4.107@tcp, seq: 142006 00000100:00100000:1.0:1588645803.703103:0:7828:0:(service.c:2300:ptlrpc_server_handle_request()) Handling RPC req@ffff8ed507f5d200 pname:cluuid+ref:pid:xid:nid:opc:job ll_ost00_037:0+-99:12533:x1665810030674752:12345-10.9.4.107@tcp:8:ptlrpcd_rcv.0 00010000:00080000:1.0:1588645803.703112:0:7828:0:(ldlm_lib.c:1288:target_handle_connect()) lustre-OST0000: connection from 018ba189-a393-4@10.9.4.107@tcp t0 exp (null) cur 14629 last 0 00000020:00000080:1.0:1588645803.703122:0:7828:0:(genops.c:1363:class_connect()) connect: client 018ba189-a393-4, cookie 0x6041f8af644efcda 00002000:00100000:1.0:1588645803.703125:0:7828:0:(ofd_obd.c:238:ofd_parse_connect_data()) lustre-OST0000: cli 018ba189-a393-4/ffff8ed51ffa8400 ocd_connect_flags: 0xa0465af2e3640478 ocd_version: 20d3500 ocd_grant: 8437760 ocd_index: 0 ocd_group 0 00002000:00100000:1.0:1588645803.703130:0:7828:0:(ofd_obd.c:337:ofd_parse_connect_data()) lustre-OST0000: cli <unknown> supports cksum type f7, return f7 00000020:01000000:1.0:1588645803.703140:0:7828:0:(lprocfs_status_server.c:523:lprocfs_exp_setup()) using hash ffff8ed517793180 00002000:00080000:1.0:1588645803.703143:0:7828:0:(ofd_obd.c:476:ofd_obd_connect()) lustre-OST0000: get connection from MDS 0 00000100:00080000:1.0:1588645803.703152:0:7828:0:(import.c:86:import_set_state_nolock()) ffff8ed51bed2800 p���Ԏ��p�L6Վ��: changing import state from RECOVER to FULL 00000100:02000000:1.0:1588645803.703154:0:7828:0:(import.c:1622:ptlrpc_import_recovery_state_machine()) lustre-OST0000: Connection restored to 9b0a0fe9-8fa7-4 (at 10.9.4.107@tcp) 00000100:00100000:1.0:1588645803.703169:0:7828:0:(service.c:2349:ptlrpc_server_handle_request()) Handled RPC req@ffff8ed507f5d200 pname:cluuid+ref:pid:xid:nid:opc:job ll_ost00_037:018ba189-a393-4+6:12533:x1665810030674752:12345-10.9.4.107@tcp:8:ptlrpcd_rcv.0 Request processed in 66us (98us total) trans 0 rc 0/0 812b - Server: Handle connect RPC 00000100:00100000:1.0:1588645844.218486:0:7827:0:(service.c:2143:ptlrpc_server_handle_req_in()) got req x1665810030678912 00000100:00100000:1.0:1588645844.218514:0:7827:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.9.4.107@tcp, seq: 142010 00000100:00100000:1.0:1588645844.218518:0:7827:0:(service.c:2300:ptlrpc_server_handle_request()) Handling RPC req@ffff8ed4f1831200 pname:cluuid+ref:pid:xid:nid:opc:job ll_ost00_036:0+-99:12533:x1665810030678912:12345-10.9.4.107@tcp:8:lfs.0 00010000:02000400:1.0:1588645844.218532:0:7827:0:(ldlm_lib.c:837:target_handle_reconnect()) lustre-OST0000: Client 018ba189-a393-4 (at 10.9.4.107@tcp) reconnecting 00010000:02000400:1.0:1588645844.220018:0:7827:0:(ldlm_lib.c:1268:target_handle_connect()) lustre-OST0000: Client 018ba189-a393-4 (at 10.9.4.107@tcp) refused connection, still busy with 6 references 00000100:00100000:1.0:1588645844.221141:0:7827:0:(service.c:2349:ptlrpc_server_handle_request()) Handled RPC req@ffff8ed4f1831200 pname:cluuid+ref:pid:xid:nid:opc:job ll_ost00_036:0+-99:12533:x1665810030678912:12345-10.9.4.107@tcp:8:lfs.0 Request processed in 2622us (2762us total) trans 0 rc -16/-16 812a - Client: Reconnect completes 00000100:00080000:1.0:1588645803.775698:0:12533:0:(import.c:86:import_set_state_nolock()) ffff9418a40e3000 lustre-OST0000_UUID: changing import state from CONNECTING to FULL 812b - Client: Reconnect fails - QUOTACTL fails as a result? 00000100:00100000:1.0:1588645844.295098:0:12533:0:(client.c:2164:ptlrpc_check_set()) Completed RPC req@ffff94189cd0e880 pname:cluuid:pid:xid:nid:opc:job ptlrpcd_rcv:018ba189-a393-4:12533:1665810030678912:10.9.4.109@tcp:8:lfs.0 00000100:02020000:1.0:1588645844.293876:0:12533:0:(client.c:1354:ptlrpc_check_status()) 11-0: lustre-OST0000-osc-ffff9418a40e1800: operation ost_connect to node 10.9.4.109@tcp failed: rc = -16 00000100:00100000:1.0:1588645844.295090:0:12533:0:(client.c:2847:ptlrpc_free_committed()) lustre-OST0000-osc-ffff9418a40e1800: committing for last_committed 30064774834 gen 5 00000100:00080000:1.0:1588645844.295095:0:12533:0:(import.c:86:import_set_state_nolock()) ffff9418a40e3000 lustre-OST0000_UUID: changing import state from CONNECTING to DISCONN 00000100:00080000:1.0:1588645844.295096:0:12533:0:(import.c:1408:ptlrpc_connect_interpret()) recovery of lustre-OST0000_UUID on 10.9.4.109@tcp failed (-16) ... 00000100:00100000:1.0:1588645844.295108:0:20270:0:(client.c:2164:ptlrpc_check_set()) Completed RPC req@ffff94189cd0d200 pname:cluuid:pid:xid:nid:opc:job lfs:018ba189-a393-4:20270:1665810030678976:10.9.4.109@tcp:19:lfs.0 00000001:00020000:1.0:1588645844.295125:0:20270:0:(osc_quota.c:308:osc_quotactl()) ptlrpc_queue_wait failed, rc: -5 The primary difference seems to be the timeout of the disconnect RPC on the client. I'm not super familiar with this code, but that timeout appears to have trickled down into eventually failing to recover the OST0 connection which in turn caused the -EIO for the quotactl. |
| Comment by Chris Horn [ 05/May/20 ] |
|
This test failure was seen with patch https://review.whamcloud.com/#/c/38460/ . That patch modifies default LND timeout values. However, based on the logs above it doesn't appear that the patch is responsible for the failure. The timeout of the disconnect RPC happened after ~ 6 seconds, which is less than the new default LND timeout (16 seconds). |