[LU-14764] Single OST quotas out of sync (with hung I/O) and acquire quota failed:-115 Created: 15/Jun/21 Updated: 23/Nov/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Stephane Thiell | Assignee: | Hongchao Zhang |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 7.9, ldiskfs |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
We're enforcing group quota limits on Oak. They usually work great. After a MDS crash with 2.12.6 (we hit unrelated bug LU-14705) and restart of the MDTs, one OST (oak-OST0135) did not came back in sync with the quota master, and it's still out of sync. A few groups are out of quotas on this OST and this is generating hung I/Os at least new files that are stripped on this OST. As a workaround, I have set max_create_count=0 on all MDTs for this OST as a temporary workaround (and this worked). Example of a group out of quotas on OST0135 (index 309): # lfs quota -v -g oak_cblish /oak
[...]
oak-OST0134_UUID
243010468 - 246011728 - - - - -
oak-OST0135_UUID
256548944* - 256542812 - - - - -
oak-OST0136_UUID
85301696 - 86780916 - - - - -
oak-OST0137_UUID
156684012 - 160709804 - - - - -
oak-OST0138_UUID
35652032 - 36721800 - - - - -
oak-OST0139_UUID
8830024 - 12589328 - - - - -
Example of hung I/O on clients: [2021-06-14T15:31:23-07:00] [3776399.960372] tar S ffff8b55ed7cb180 0 13047 4645 0x00000000^M [2021-06-14T15:31:23-07:00] [3776399.968457] Call Trace:^M [2021-06-14T15:31:23-07:00] [3776399.971387] [<ffffffffb90b042b>] ? recalc_sigpending+0x1b/0x70^M [2021-06-14T15:31:23-07:00] [3776399.978194] [<ffffffffb9788eb9>] schedule+0x29/0x70^M [2021-06-14T15:31:23-07:00] [3776399.983972] [<ffffffffc0e935b5>] cl_sync_io_wait+0x2b5/0x3d0 [obdclass]^M [2021-06-14T15:31:23-07:00] [3776399.991654] [<ffffffffb90dadf0>] ? wake_up_state+0x20/0x20^M [2021-06-14T15:31:23-07:00] [3776399.998377] [<ffffffffc0e93d78>] cl_io_submit_sync+0x178/0x270 [obdclass]^M [2021-06-14T15:31:23-07:00] [3776400.006298] [<ffffffffc125d1d6>] vvp_io_commit_sync+0x106/0x340 [lustre]^M [2021-06-14T15:31:23-07:00] [3776400.014084] [<ffffffffc125e5b6>] vvp_io_write_commit+0x4c6/0x600 [lustre]^M [2021-06-14T15:31:23-07:00] [3776400.021971] [<ffffffffc125ed25>] vvp_io_write_start+0x635/0xa70 [lustre]^M [2021-06-14T15:31:23-07:00] [3776400.029760] [<ffffffffc0e90225>] ? cl_lock_enqueue+0x65/0x120 [obdclass]^M [2021-06-14T15:31:23-07:00] [3776400.037550] [<ffffffffc0e92788>] cl_io_start+0x68/0x130 [obdclass]^M [2021-06-14T15:31:23-07:00] [3776400.044750] [<ffffffffc0e949fc>] cl_io_loop+0xcc/0x1c0 [obdclass]^M [2021-06-14T15:31:23-07:00] [3776400.051848] [<ffffffffc121407b>] ll_file_io_generic+0x63b/0xc90 [lustre]^M [2021-06-14T15:31:23-07:00] [3776400.059631] [<ffffffffc1214b69>] ll_file_aio_write+0x289/0x660 [lustre]^M [2021-06-14T15:31:23-07:00] [3776400.067317] [<ffffffffc1215040>] ll_file_write+0x100/0x1c0 [lustre]^M [2021-06-14T15:31:23-07:00] [3776400.074609] [<ffffffffb924de00>] vfs_write+0xc0/0x1f0^M [2021-06-14T15:31:23-07:00] [3776400.080541] [<ffffffffb924ebdf>] SyS_write+0x7f/0xf0^M [2021-06-14T15:31:23-07:00] [3776400.086380] [<ffffffffb9795f92>] system_call_fastpath+0x25/0x2a^M echo 1 > /proc/fs/lustre/osd-ldiskfs/oak-OST0135/quota_slave/force_reint but that didn't seem to help. The status of the quota slave on this OST is now: [root@oak-io6-s2 ~]# cat /proc/fs/lustre/osd-ldiskfs/oak-OST0135/quota_slave/info target name: oak-OST0135 pool ID: 0 type: dt quota enabled: g conn to master: setup space acct: ugp user uptodate: glb[0],slv[0],reint[0] group uptodate: glb[0],slv[0],reint[1] project uptodate: glb[0],slv[0],reint[0] but should likely be: group uptodate: glb[1],slv[1],reint[0] like other healthy OSTs. I've taken some debug logs with +trace and +quota from the OSS in question that I'm attaching as oak-OST0135_quota_issue.log.gz We can see some -115 EINPROGRESS errors in the logs: [root@oak-io6-s2 ~]# grep qsd:oak-OST0135 oak-OST0135_quota_issue.log | head 00040000:04000000:46.0:1623797919.334289:0:107324:0:(qsd_handler.c:732:qsd_op_begin0()) $$$ op_begin space:12 qsd:oak-OST0135 qtype:grp id:8663 enforced:1 granted: 119761748 pending:0 waiting:0 req:0 usage: 119769696 qunit:0 qtune:0 edquot:0 default:no 00040000:04000000:46.0:1623797919.334291:0:107324:0:(qsd_handler.c:636:qsd_acquire()) $$$ acquiring:12 count=0 qsd:oak-OST0135 qtype:grp id:8663 enforced:1 granted: 119761748 pending:0 waiting:12 req:0 usage: 119769696 qunit:0 qtune:0 edquot:0 default:no 00040000:04000000:46.0:1623797919.334312:0:107324:0:(qsd_entry.c:247:qsd_refresh_usage()) $$$ disk usage: 119769696 qsd:oak-OST0135 qtype:grp id:8663 enforced:1 granted: 119761748 pending:0 waiting:12 req:0 usage: 119769696 qunit:0 qtune:0 edquot:0 default:no 00040000:04000000:46.0:1623797919.334316:0:107324:0:(qsd_handler.c:121:qsd_ready()) $$$ not up-to-date, dropping request and kicking off reintegration qsd:oak-OST0135 qtype:grp id:8663 enforced:1 granted: 119761748 pending:0 waiting:12 req:0 usage: 119769696 qunit:0 qtune:0 edquot:0 default:no 00040000:04000000:46.0:1623797919.334320:0:107324:0:(qsd_handler.c:751:qsd_op_begin0()) $$$ acquire quota failed:-115 qsd:oak-OST0135 qtype:grp id:8663 enforced:1 granted: 119761748 pending:0 waiting:12 req:0 usage: 119769696 qunit:0 qtune:0 edquot:0 default:no 00040000:04000000:46.0:1623797919.334330:0:107324:0:(qsd_entry.c:247:qsd_refresh_usage()) $$$ disk usage: 119769696 qsd:oak-OST0135 qtype:grp id:8663 enforced:1 granted: 119761748 pending:0 waiting:0 req:0 usage: 119769696 qunit:0 qtune:0 edquot:0 default:no 00040000:04000000:46.0:1623797919.334332:0:107324:0:(qsd_handler.c:222:qsd_calc_adjust()) $$$ overrun, reporting usage qsd:oak-OST0135 qtype:grp id:8663 enforced:1 granted: 119761748 pending:0 waiting:0 req:0 usage: 119769696 qunit:0 qtune:0 edquot:0 default:no 00040000:04000000:46.0:1623797919.334334:0:107324:0:(qsd_handler.c:121:qsd_ready()) $$$ not up-to-date, dropping request and kicking off reintegration qsd:oak-OST0135 qtype:grp id:8663 enforced:1 granted: 119761748 pending:0 waiting:0 req:0 usage: 119769696 qunit:0 qtune:0 edquot:0 default:no 00040000:04000000:46.0:1623797919.334336:0:107324:0:(qsd_handler.c:929:qsd_adjust()) $$$ delaying adjustment since qsd isn't ready qsd:oak-OST0135 qtype:grp id:8663 enforced:1 granted: 119761748 pending:0 waiting:0 req:0 usage: 119769696 qunit:0 qtune:0 edquot:0 default:no 00040000:04000000:18.0:1623797930.024599:0:271450:0:(qsd_entry.c:247:qsd_refresh_usage()) $$$ disk usage: 232050092 qsd:oak-OST0135 qtype:grp id:4722 enforced:1 granted: 268435456 pending:0 waiting:0 req:0 usage: 232050092 qunit:0 qtune:0 edquot:0 default:no Please let me know if you have any idea of what could be wrong, help troubleshoot, and/or if there is a way to force a resync in a different way. Otherwise, I'm going to restart this OST at a scheduled maintenance. Thanks! |
| Comments |
| Comment by Peter Jones [ 16/Jun/21 ] |
|
Hongchao Could you please advise Thanks Peter |
| Comment by Hongchao Zhang [ 17/Jun/21 ] |
|
Hi, |
| Comment by Stephane Thiell [ 18/Jun/21 ] |
|
Hi Hongchao, Thanks! OK. I've tried, but I don't think I was able to capture the error. I'm not sure why, I don't see any error -115 on the QMT, nor even mention of OST0135 itself. I used +quota and +trace debug options on MDT0 and create a file on OST index 309 (OST0135) and started a dd to write data to it. I also used lfs quota -v during that time. I can see many other OSTs being queried like: lookup slave index file for oak-MDT0000-lwp-OST006b_UUID but I can't find OST0135... Just in case, I'm attaching the logs as oak-md1-s2-dk+quota+trace.log.gz |
| Comment by Stephane Thiell [ 18/Jun/21 ] |
|
Hongchao, 111 /* In most case, reintegration must have been triggered (when enable
112 * quota or on OST start), however, in rare race condition (enabling
113 * quota when starting OSTs), we might miss triggering reintegration
114 * for some qqi.
115 *
116 * If the previous reintegration failed for some reason, we'll
117 * re-trigger it here as well. */
118 if (!qqi->qqi_glb_uptodate || !qqi->qqi_slv_uptodate) {
119 read_unlock(&qsd->qsd_lock);
120 LQUOTA_DEBUG(lqe, "not up-to-date, dropping request and "
121 "kicking off reintegration");
122 qsd_start_reint_thread(qqi);
123 RETURN(-EINPROGRESS);
124 }
which generated: 00040000:04000000:46.0:1623797919.334316:0:107324:0:(qsd_handler.c:121:qsd_ready()) $$$ not up-to-date, dropping request and kicking off reintegration qsd:oak-OST0135 qtype:grp id:8663 enforced:1 granted: 119761748 pending:0 waiting:12 req:0 usage: 119769696 qunit:0 qtune:0 edquot:0 default:no |
| Comment by Hongchao Zhang [ 22/Jun/21 ] |
|
Hi Stephane, Sorry! At OSS hosting OST0135, is there a process with name "qsd_rein_2.oak-OST0135”? this should be the kernel thread |
| Comment by Stephane Thiell [ 22/Jun/21 ] |
|
Thanks Hongchao! I've attached a dump of all kernel tasks of the OSS as oak-io6-s2-LU-14764-sysrq-t.log I can see this task: [2021-06-22T08:56:09-07:00] [1031942.051787] qsd_reint_1.oak S ffff8b6f1a37e300 0 88577 2 0x00000080 [2021-06-22T08:56:09-07:00] [1031942.059042] Call Trace: [2021-06-22T08:56:09-07:00] [1031942.061663] [<ffffffffab187229>] schedule+0x29/0x70 [2021-06-22T08:56:09-07:00] [1031942.066788] [<ffffffffab184c58>] schedule_timeout+0x168/0x2d0 [2021-06-22T08:56:09-07:00] [1031942.072782] [<ffffffffaaaad6e0>] ? __internal_add_timer+0x130/0x130 [2021-06-22T08:56:09-07:00] [1031942.079314] [<ffffffffc10f9600>] ? ptlrpc_init_rq_pool+0x110/0x110 [ptlrpc] [2021-06-22T08:56:09-07:00] [1031942.086542] [<ffffffffc1103350>] ptlrpc_set_wait+0x480/0x790 [ptlrpc] [2021-06-22T08:56:09-07:00] [1031942.093228] [<ffffffffaaadaf40>] ? wake_up_state+0x20/0x20 [2021-06-22T08:56:09-07:00] [1031942.098983] [<ffffffffc11036e3>] ptlrpc_queue_wait+0x83/0x230 [ptlrpc] [2021-06-22T08:56:09-07:00] [1031942.105764] [<ffffffffc154d005>] qsd_fetch_index+0x175/0x490 [lquota] [2021-06-22T08:56:09-07:00] [1031942.112451] [<ffffffffc15517b8>] qsd_reint_index+0x5c8/0x15d0 [lquota] [2021-06-22T08:56:09-07:00] [1031942.119228] [<ffffffffc1553bb2>] qsd_reint_main+0x902/0xe90 [lquota] [2021-06-22T08:56:09-07:00] [1031942.125834] [<ffffffffc15532b0>] ? qsd_reconciliation+0xaf0/0xaf0 [lquota] [2021-06-22T08:56:09-07:00] [1031942.132955] [<ffffffffaaac5c21>] kthread+0xd1/0xe0 [2021-06-22T08:56:09-07:00] [1031942.137996] [<ffffffffaaac5b50>] ? insert_kthread_work+0x40/0x40 [2021-06-22T08:56:09-07:00] [1031942.144245] [<ffffffffab194ddd>] ret_from_fork_nospec_begin+0x7/0x21 [2021-06-22T08:56:09-07:00] [1031942.150845] [<ffffffffaaac5b50>] ? insert_kthread_work+0x40/0x40 |
| Comment by Hongchao Zhang [ 24/Jun/21 ] |
|
it seems the reintegration thread was stuck 00000100:00000001:8.0:1623797928.588752:0:88577:0:(client.c:2206:ptlrpc_expired_set()) Process entered 00000100:00000001:8.0:1623797928.588753:0:88577:0:(client.c:2241:ptlrpc_expired_set()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:8.0:1623797928.588755:0:88577:0:(client.c:2294:ptlrpc_set_next_timeout()) Process entered 00000100:00000001:8.0:1623797928.588755:0:88577:0:(client.c:2330:ptlrpc_set_next_timeout()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:8.0:1623797928.588756:0:88577:0:(client.c:1703:ptlrpc_check_set()) Process entered 00000100:00000001:8.0:1623797928.588757:0:88577:0:(client.c:2624:ptlrpc_unregister_reply()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:8.0:1623797928.588758:0:88577:0:(client.c:1172:ptlrpc_import_delay_req()) Process entered 00000100:00000001:8.0:1623797928.588759:0:88577:0:(client.c:1227:ptlrpc_import_delay_req()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:8.0:1623797928.588759:0:88577:0:(client.c:2117:ptlrpc_check_set()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:8.0:1623797928.588760:0:88577:0:(client.c:1703:ptlrpc_check_set()) Process entered 00000100:00000001:8.0:1623797928.588760:0:88577:0:(client.c:2624:ptlrpc_unregister_reply()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:8.0:1623797928.588760:0:88577:0:(client.c:1172:ptlrpc_import_delay_req()) Process entered 00000100:00000001:8.0:1623797928.588761:0:88577:0:(client.c:1227:ptlrpc_import_delay_req()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:8.0:1623797928.588761:0:88577:0:(client.c:2117:ptlrpc_check_set()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:8.0:1623797929.588669:0:88577:0:(client.c:2206:ptlrpc_expired_set()) Process entered 00000100:00000001:8.0:1623797929.588670:0:88577:0:(client.c:2241:ptlrpc_expired_set()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:8.0:1623797929.588671:0:88577:0:(client.c:2294:ptlrpc_set_next_timeout()) Process entered 00000100:00000001:8.0:1623797929.588671:0:88577:0:(client.c:2330:ptlrpc_set_next_timeout()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:8.0:1623797929.588672:0:88577:0:(client.c:1703:ptlrpc_check_set()) Process entered 00000100:00000001:8.0:1623797929.588673:0:88577:0:(client.c:2624:ptlrpc_unregister_reply()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:8.0:1623797929.588673:0:88577:0:(client.c:1172:ptlrpc_import_delay_req()) Process entered 00000100:00000001:8.0:1623797929.588674:0:88577:0:(client.c:1227:ptlrpc_import_delay_req()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:8.0:1623797929.588675:0:88577:0:(client.c:2117:ptlrpc_check_set()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:8.0:1623797929.588675:0:88577:0:(client.c:1703:ptlrpc_check_set()) Process entered 00000100:00000001:8.0:1623797929.588676:0:88577:0:(client.c:2624:ptlrpc_unregister_reply()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:8.0:1623797929.588676:0:88577:0:(client.c:1172:ptlrpc_import_delay_req()) Process entered 00000100:00000001:8.0:1623797929.588676:0:88577:0:(client.c:1227:ptlrpc_import_delay_req()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:8.0:1623797929.588676:0:88577:0:(client.c:2117:ptlrpc_check_set()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:8.0:1623797930.588710:0:88577:0:(client.c:2206:ptlrpc_expired_set()) Process entered 00000100:00000001:8.0:1623797930.588721:0:88577:0:(client.c:2241:ptlrpc_expired_set()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:8.0:1623797930.588723:0:88577:0:(client.c:2294:ptlrpc_set_next_timeout()) Process entered 00000100:00000001:8.0:1623797930.588723:0:88577:0:(client.c:2330:ptlrpc_set_next_timeout()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:8.0:1623797930.588724:0:88577:0:(client.c:1703:ptlrpc_check_set()) Process entered 00000100:00000001:8.0:1623797930.588725:0:88577:0:(client.c:2624:ptlrpc_unregister_reply()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:8.0:1623797930.588725:0:88577:0:(client.c:1172:ptlrpc_import_delay_req()) Process entered 00000100:00000001:8.0:1623797930.588726:0:88577:0:(client.c:1227:ptlrpc_import_delay_req()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:8.0:1623797930.588726:0:88577:0:(client.c:2117:ptlrpc_check_set()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:8.0:1623797930.588727:0:88577:0:(client.c:1703:ptlrpc_check_set()) Process entered 00000100:00000001:8.0:1623797930.588727:0:88577:0:(client.c:2624:ptlrpc_unregister_reply()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:8.0:1623797930.588727:0:88577:0:(client.c:1172:ptlrpc_import_delay_req()) Process entered 00000100:00000001:8.0:1623797930.588727:0:88577:0:(client.c:1227:ptlrpc_import_delay_req()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:8.0:1623797930.588728:0:88577:0:(client.c:2117:ptlrpc_check_set()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:8.0:1623797931.588670:0:88577:0:(client.c:2206:ptlrpc_expired_set()) Process entered 00000100:00000001:8.0:1623797931.588671:0:88577:0:(client.c:2241:ptlrpc_expired_set()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:8.0:1623797931.588673:0:88577:0:(client.c:2294:ptlrpc_set_next_timeout()) Process entered 00000100:00000001:8.0:1623797931.588673:0:88577:0:(client.c:2330:ptlrpc_set_next_timeout()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:8.0:1623797931.588674:0:88577:0:(client.c:1703:ptlrpc_check_set()) Process entered 00000100:00000001:8.0:1623797931.588675:0:88577:0:(client.c:2624:ptlrpc_unregister_reply()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:8.0:1623797931.588676:0:88577:0:(client.c:1172:ptlrpc_import_delay_req()) Process entered 00000100:00000001:8.0:1623797931.588676:0:88577:0:(client.c:1227:ptlrpc_import_delay_req()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:8.0:1623797931.588677:0:88577:0:(client.c:2117:ptlrpc_check_set()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:8.0:1623797931.588677:0:88577:0:(client.c:1703:ptlrpc_check_set()) Process entered 00000100:00000001:8.0:1623797931.588677:0:88577:0:(client.c:2624:ptlrpc_unregister_reply()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:8.0:1623797931.588678:0:88577:0:(client.c:1172:ptlrpc_import_delay_req()) Process entered 00000100:00000001:8.0:1623797931.588678:0:88577:0:(client.c:1227:ptlrpc_import_delay_req()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:8.0:1623797931.588678:0:88577:0:(client.c:2117:ptlrpc_check_set()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:8.0:1623797932.588711:0:88577:0:(client.c:2206:ptlrpc_expired_set()) Process entered 00000100:00000001:8.0:1623797932.588712:0:88577:0:(client.c:2241:ptlrpc_expired_set()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:8.0:1623797932.588714:0:88577:0:(client.c:2294:ptlrpc_set_next_timeout()) Process entered 00000100:00000001:8.0:1623797932.588715:0:88577:0:(client.c:2330:ptlrpc_set_next_timeout()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:8.0:1623797932.588715:0:88577:0:(client.c:1703:ptlrpc_check_set()) Process entered 00000100:00000001:8.0:1623797932.588717:0:88577:0:(client.c:2624:ptlrpc_unregister_reply()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:8.0:1623797932.588717:0:88577:0:(client.c:1172:ptlrpc_import_delay_req()) Process entered 00000100:00000001:8.0:1623797932.588719:0:88577:0:(client.c:1227:ptlrpc_import_delay_req()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:8.0:1623797932.588720:0:88577:0:(client.c:2117:ptlrpc_check_set()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:8.0:1623797932.588720:0:88577:0:(client.c:1703:ptlrpc_check_set()) Process entered 00000100:00000001:8.0:1623797932.588721:0:88577:0:(client.c:2624:ptlrpc_unregister_reply()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:8.0:1623797932.588721:0:88577:0:(client.c:1172:ptlrpc_import_delay_req()) Process entered 00000100:00000001:8.0:1623797932.588722:0:88577:0:(client.c:1227:ptlrpc_import_delay_req()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:8.0:1623797932.588722:0:88577:0:(client.c:2117:ptlrpc_check_set()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:10.0:1623797933.588682:0:88577:0:(client.c:2206:ptlrpc_expired_set()) Process entered 00000100:00000001:10.0:1623797933.588682:0:88577:0:(client.c:2241:ptlrpc_expired_set()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:10.0:1623797933.588683:0:88577:0:(client.c:2294:ptlrpc_set_next_timeout()) Process entered 00000100:00000001:10.0:1623797933.588684:0:88577:0:(client.c:2330:ptlrpc_set_next_timeout()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:10.0:1623797933.588684:0:88577:0:(client.c:1703:ptlrpc_check_set()) Process entered 00000100:00000001:10.0:1623797933.588685:0:88577:0:(client.c:2624:ptlrpc_unregister_reply()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:10.0:1623797933.588686:0:88577:0:(client.c:1172:ptlrpc_import_delay_req()) Process entered 00000100:00000001:10.0:1623797933.588687:0:88577:0:(client.c:1227:ptlrpc_import_delay_req()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:10.0:1623797933.588688:0:88577:0:(client.c:2117:ptlrpc_check_set()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:10.0:1623797933.588688:0:88577:0:(client.c:1703:ptlrpc_check_set()) Process entered 00000100:00000001:10.0:1623797933.588688:0:88577:0:(client.c:2624:ptlrpc_unregister_reply()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:10.0:1623797933.588689:0:88577:0:(client.c:1172:ptlrpc_import_delay_req()) Process entered 00000100:00000001:10.0:1623797933.588689:0:88577:0:(client.c:1227:ptlrpc_import_delay_req()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:10.0:1623797933.588689:0:88577:0:(client.c:2117:ptlrpc_check_set()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:10.0:1623797934.588671:0:88577:0:(client.c:2206:ptlrpc_expired_set()) Process entered 00000100:00000001:10.0:1623797934.588672:0:88577:0:(client.c:2241:ptlrpc_expired_set()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:10.0:1623797934.588674:0:88577:0:(client.c:2294:ptlrpc_set_next_timeout()) Process entered 00000100:00000001:10.0:1623797934.588674:0:88577:0:(client.c:2330:ptlrpc_set_next_timeout()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:10.0:1623797934.588675:0:88577:0:(client.c:1703:ptlrpc_check_set()) Process entered 00000100:00000001:10.0:1623797934.588675:0:88577:0:(client.c:2624:ptlrpc_unregister_reply()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:10.0:1623797934.588676:0:88577:0:(client.c:1172:ptlrpc_import_delay_req()) Process entered 00000100:00000001:10.0:1623797934.588676:0:88577:0:(client.c:1227:ptlrpc_import_delay_req()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:10.0:1623797934.588677:0:88577:0:(client.c:2117:ptlrpc_check_set()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:10.0:1623797934.588677:0:88577:0:(client.c:1703:ptlrpc_check_set()) Process entered 00000100:00000001:10.0:1623797934.588677:0:88577:0:(client.c:2624:ptlrpc_unregister_reply()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:10.0:1623797934.588678:0:88577:0:(client.c:1172:ptlrpc_import_delay_req()) Process entered 00000100:00000001:10.0:1623797934.588678:0:88577:0:(client.c:1227:ptlrpc_import_delay_req()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:10.0:1623797934.588679:0:88577:0:(client.c:2117:ptlrpc_check_set()) Process leaving (rc=0 : 0 : 0) it should be caused by the failed reconnection between OST135 and MDT0000 (using LWP) int ptlrpc_check_set(const struct lu_env *env, struct ptlrpc_request_set *set)
{
...
if (req->rq_phase == RQ_PHASE_RPC) {
...
spin_lock(&imp->imp_lock);
if (ptlrpc_import_delay_req(imp, req, &status)){ <--- here, ptlrpc_import_delay_req return 1
/* put on delay list - only if we wait
* recovery finished - before send */
list_del_init(&req->rq_list);
list_add_tail(&req->rq_list,
&imp->
imp_delayed_list);
spin_unlock(&imp->imp_lock);
continue;
}
...
}
...
Could you please get the value "cat /proc/fs/lustre/mdt/oak-MDT0000/exports/xxx/export" at MDT0000? |
| Comment by Stephane Thiell [ 24/Jun/21 ] |
|
Hi Hongchao, This filesystem has 6 MDTs, index 0 to 5. We can see that for the other MDTs, the OST is connected: # clush -w @mds 'grep -H OST0135 /proc/fs/lustre/mdt/oak-MDT*/exports/10.0.2.104@o2ib5/export' oak-md1-s2: /proc/fs/lustre/mdt/oak-MDT0003/exports/10.0.2.104@o2ib5/export:oak-MDT0003-lwp-OST0135_UUID: oak-md1-s1: /proc/fs/lustre/mdt/oak-MDT0001/exports/10.0.2.104@o2ib5/export:oak-MDT0001-lwp-OST0135_UUID: oak-md1-s1: /proc/fs/lustre/mdt/oak-MDT0002/exports/10.0.2.104@o2ib5/export:oak-MDT0002-lwp-OST0135_UUID: oak-md2-s2: /proc/fs/lustre/mdt/oak-MDT0005/exports/10.0.2.104@o2ib5/export:oak-MDT0005-lwp-OST0135_UUID: oak-md2-s1: /proc/fs/lustre/mdt/oak-MDT0004/exports/10.0.2.104@o2ib5/export:oak-MDT0004-lwp-OST0135_UUID: For example, that is the export with MDT0001: oak-MDT0001-lwp-OST0135_UUID:
name: oak-MDT0001
client: 10.0.2.104@o2ib5
connect_flags: [ version, adaptive_timeouts, lru_resize, fid_is_enabled, full20, lvb_type, lightweight_conn, lfsck, bulk_mbits ]
connect_data:
flags: 0x2041401043000020
instance: 47
target_version: 2.12.6.0
export_flags: [ ]
Is there something more we could do to help troubleshoot this, without having to restart the OST for example, perhaps there is a way to reset the LWP between MDT0000 and OST0135 somehow? Thanks again for taking the time to look at this! |
| Comment by Stephane Thiell [ 24/Jun/21 ] |
|
Also, the osp between MDT0000 and OST0135 seems initialized, unlike the LWP: [root@oak-md1-s2 ~]# cat /sys/fs/lustre/osp/oak-OST0135-osc-MDT0000/active 1 [root@oak-md1-s2 ~]# cat /sys/fs/lustre/osp/oak-OST0135-osc-MDT0000/uuid oak-MDT0000-mdtlov_UUID It seems like this could be a consequence of a config llog problem we have been having that I reported in LU-14695. |
| Comment by Hongchao Zhang [ 25/Jun/21 ] |
|
I have looked the code lines related to LWP, but can't find the way to reconnect forcibly yet, sorry! |
| Comment by Sergey Cheremencev [ 23/Nov/22 ] |
|
If you had pool quotas here, this looks similar with |