[LU-14694] racer timeouts with LU-10948 Created: 20/May/21 Updated: 10/Aug/21 Resolved: 10/Aug/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Upstream |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Alex Zhuravlev | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
many racer runs timeout, bisected that to patch: |
| Comments |
| Comment by Andreas Dilger [ 20/May/21 ] |
|
Alex, what kind of timeout? Stuck processes (with stack?), unable to unmount, oops, other? Single client mountpoint or multiple? Previously Oleg and I had discussed whether opencache locks needed to time out sooner. Could you try running with ldlm.namespaces.*.lru_max_age=30 to see if that helps? |
| Comment by Andreas Dilger [ 20/May/21 ] |
|
Also, is this with opencache enabled, or just with the patch applied? |
| Comment by Alex Zhuravlev [ 20/May/21 ] |
... Resetting fail_loc on all nodes...done. PASS 1 (2720s) == racer test complete, duration 2768 sec ============================================================ 16:47:13 (1621529233) Stopping clients: tmp.xMEAXwSwZi /mnt/lustre (opts:-f) Stopping client tmp.xMEAXwSwZi /mnt/lustre opts:-f Stopping clients: tmp.xMEAXwSwZi /mnt/lustre2 (opts:-f) Stopping client tmp.xMEAXwSwZi /mnt/lustre2 opts:-f Stopping /mnt/lustre-mds1 (opts:-f) on tmp.xMEAXwSwZi modprobe: FATAL: Module dm-flakey not found. Stopping /mnt/lustre-mds2 (opts:-f) on tmp.xMEAXwSwZi then unique traces: schedule,do_select,trace_hardirqs_off_thunk PIDs(1): "sshd":2287 schedule,do_select,_copy_to_user PIDs(1): "sshd":2587 schedule,top_trans_stop,lod_trans_stop,mdd_trans_stop,mdt_mfd_close,mdt_obd_disconnect,class_fail_export,ping_evictor_main PIDs(1): "ll_evictor":7754 schedule,ptlrpc_set_wait,ptlrpc_queue_wait,osp_send_update_req,osp_send_update_thread PIDs(1): "osp_up0-1":7989 schedule,top_trans_stop,lod_trans_stop,mdd_trans_stop,mdt_mfd_close,mdt_obd_disconnect,class_disconnect_export_list,class_disconnect_exports,class_cleanup,class_process_config,class_manual_cleanup,server_put_super,generic_shutdown_super,kill_anon_super,deactivate_locked_super,cleanup_mnt,task_work_run,exit_to_usermode_loop PIDs(1): "umount":535030 |
| Comment by Alex Zhuravlev [ 20/May/21 ] |
|
basically it's MDSCOUNT=2 REFORMAT=yes bash racer.sh |
| Comment by Alex Zhuravlev [ 20/May/21 ] |
this is the latest clean master |
| Comment by Alex Zhuravlev [ 21/May/21 ] |
|
sanity/29 fails as well: == sanity test 29: IT_GETATTR regression ============================================================ 11:52:13 (1621597933) Lustre: DEBUG MARKER: == sanity test 29: IT_GETATTR regression ============================================================ 11:52:13 (1621597933) striped dir -i1 -c2 -H all_char /mnt/lustre/d29 first d29 Lustre: DEBUG MARKER: first d29 total 0 -rw-r--r-- 1 root root 0 May 21 11:52 foo second d29 Lustre: DEBUG MARKER: second d29 total 0 -rw-r--r-- 1 root root 0 May 21 11:52 foo done Lustre: DEBUG MARKER: done sanity test_29: @@@@@@ FAIL: CURRENT: 10 > 9 Lustre: DEBUG MARKER: sanity test_29: @@@@@@ FAIL: CURRENT: 10 > 9 Trace dump: = ./../tests/test-framework.sh:6280:error() = sanity.sh:3398:test_29() = ./../tests/test-framework.sh:6583:run_one() = ./../tests/test-framework.sh:6630:run_one_logged() = ./../tests/test-framework.sh:6457:run_test() = sanity.sh:3410:main() Dumping lctl log to /tmp/ltest-logs/sanity.test_29.*.1621597935.log Dumping logs only on local client. Resetting fail_loc on all nodes...done. FAIL 29 (2s) |
| Comment by Alex Zhuravlev [ 24/May/21 ] |
--- a/lustre/tests/racer.sh
+++ b/lustre/tests/racer.sh
@@ -180,6 +180,7 @@ test_1() {
lss_cleanup
fi
+ cancel_lru_locks mdc
return $rrc
}
run_test 1 "racer on clients: ${CLIENTS:-$(hostname)} DURATION=$DURATION"
this helps.. it looks like openhandle locks are lost at umount, causing MDT to close them at export cleanup. |
| Comment by Alex Zhuravlev [ 24/May/21 ] |
|
basically the problem is that:
/* obd_force == local only */
ldlm_cli_cancel_unused(obd->obd_namespace, NULL,
obd->obd_force ? LCF_LOCAL : 0, NULL);
3) thus MDT has to clear those "lost" locks, so close the directories locally I "solved" the problem disabling opencache for directories: --- a/lustre/mdt/mdt_open.c +++ b/lustre/mdt/mdt_open.c @@ -820,6 +820,9 @@ static int mdt_object_open_lock(struct mdt_thread_info *info, if (req_is_replay(mdt_info_req(info))) RETURN(0); + if (S_ISDIR(lu_object_attr(&obj->mot_obj))) + open_flags &= ~MDS_OPEN_LOCK; + if (S_ISREG(lu_object_attr(&obj->mot_obj))) { but it looks like we've got few real problems to solve here. the most serious one, IMO, is to handle such a close on MDT in a better manner, so that umount doesn't get stuck indefinitely. |
| Comment by Alex Zhuravlev [ 24/May/21 ] |
|
probably a bit better option is to re-use existing infrastructure to skip orphan removal: --- a/lustre/mdt/mdt_handler.c +++ b/lustre/mdt/mdt_handler.c @@ -6438,7 +6438,8 @@ static int mdt_export_cleanup(struct obd_export *exp) rc = mdt_ctxt_add_dirty_flag(&env, info, mfd); /* Don't unlink orphan on failover umount, LU-184 */ - if (exp->exp_flags & OBD_OPT_FAILOVER) { + if (exp->exp_flags & OBD_OPT_FAILOVER || + exp->exp_obd->obd_stopping) { ma->ma_valid = MA_FLAGS; ma->ma_attr_flags |= MDS_KEEP_ORPHAN; } |
| Comment by Alex Zhuravlev [ 25/May/21 ] |
|
I think we need something like above on its own as umount with just evicted client is a legal case. |
| Comment by Gerrit Updater [ 25/May/21 ] |
|
Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43783 |
| Comment by Gerrit Updater [ 10/Aug/21 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/43783/ |
| Comment by Peter Jones [ 10/Aug/21 ] |
|
Landed for 2.15 |