[LU-16105] sanityn: timeout while umounting clients at the end of the tests (on b2_12) Created: 24/Aug/22 Updated: 24/Aug/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
This issue was created by maloo for eaujames <eaujames@ddn.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/cd7e7af7-223c-4f05-8b0e-d56318504469 Client hang on umount: [17478.771636] umount D ffff9862f4a8e8e0 0 2998 2991 0x00000080 [17478.772966] Call Trace: [17478.773430] [<ffffffff84389179>] schedule+0x29/0x70 [17478.774322] [<ffffffff84386d88>] schedule_timeout+0x168/0x2d0 [17478.775359] [<ffffffff83cad8b0>] ? requeue_timers+0x170/0x170 [17478.776404] [<ffffffffc0cfe809>] ll_kill_super+0x99/0x180 [lustre] [17478.777555] [<ffffffffc087b5dd>] lustre_kill_super+0x3d/0x50 [obdclass] [17478.778747] [<ffffffff83e5173e>] deactivate_locked_super+0x4e/0x70 [17478.779860] [<ffffffff83e51ec6>] deactivate_super+0x46/0x60 [17478.780873] [<ffffffff83e712ff>] cleanup_mnt+0x3f/0x80 [17478.781809] [<ffffffff83e71392>] __cleanup_mnt+0x12/0x20 [17478.782766] [<ffffffff83cc299b>] task_work_run+0xbb/0xe0 [17478.783720] [<ffffffff83c2cc65>] do_notify_resume+0xa5/0xc0 [17478.784739] [<ffffffff843962ef>] int_signal+0x12/0x17 Because a statahead thread hang: 17478.577512] ll_sa_29315 S ffff98633b055860 0 29348 2 0x00000080 [17478.578841] Call Trace: [17478.579302] [<ffffffff84389179>] schedule+0x29/0x70 [17478.580201] [<ffffffffc0a2e37e>] ldlm_lock_decref_internal+0x95e/0xa30 [ptlrpc] [17478.581514] [<ffffffff83cc6f50>] ? wake_up_atomic_t+0x30/0x30 [17478.582567] [<ffffffffc0a3bc85>] failed_lock_cleanup.isra.20+0x95/0x230 [ptlrpc] [17478.583922] [<ffffffffc0a3ecec>] ldlm_cli_enqueue_fini+0x14c/0xe00 [ptlrpc] [17478.585363] [<ffffffffc0868721>] ? lprocfs_counter_sub+0xc1/0x130 [obdclass] [17478.586635] [<ffffffffc0a41d91>] ldlm_cli_enqueue+0x421/0x930 [ptlrpc] [17478.587822] [<ffffffffc0a3ce70>] ? ldlm_expired_completion_wait+0x220/0x220 [ptlrpc] [17478.589334] [<ffffffffc0d16a40>] ? ll_md_need_convert+0x1b0/0x1b0 [lustre] [17478.590607] [<ffffffffc0ba73f0>] ? mdc_changelog_cdev_finish+0x210/0x210 [mdc] [17478.591890] [<ffffffffc0ba1a1a>] mdc_enqueue_base+0x33a/0x1cf0 [mdc] [17478.593043] [<ffffffffc0ba3b35>] mdc_intent_lock+0x135/0x560 [mdc] [17478.594174] [<ffffffffc0d16a40>] ? ll_md_need_convert+0x1b0/0x1b0 [lustre] [17478.595431] [<ffffffffc0a3ce70>] ? ldlm_expired_completion_wait+0x220/0x220 [ptlrpc] [17478.596820] [<ffffffffc0ba73f0>] ? mdc_changelog_cdev_finish+0x210/0x210 [mdc] [17478.598114] [<ffffffffc0b91233>] mdc_read_page+0xb3/0x970 [mdc] [17478.599191] [<ffffffff83ce594a>] ? check_preempt_wakeup+0x12a/0x220 [17478.600361] [<ffffffffc09fde23>] lmv_striped_read_page.isra.34+0x528/0xaac [lmv] [17478.601696] [<ffffffffc09ed2e0>] lmv_read_page+0x330/0x3a0 [lmv] [17478.602795] [<ffffffffc0cd97fc>] ll_get_dir_page+0xac/0x1b0 [lustre] [17478.603946] [<ffffffffc0d16a40>] ? ll_md_need_convert+0x1b0/0x1b0 [lustre] [17478.605206] [<ffffffffc0d2b9c7>] ll_statahead_thread+0x357/0x1c70 [lustre] [17478.606454] [<ffffffff83ce482c>] ? update_curr+0x14c/0x1e0 [17478.607456] [<ffffffff83ce22d9>] ? pick_next_entity+0xa9/0x190 [17478.608510] [<ffffffff83ce26ec>] ? set_next_entity+0x3c/0xe0 [17478.609559] [<ffffffff83c2b59e>] ? __switch_to+0xce/0x580 [17478.610545] [<ffffffff83cdadf0>] ? wake_up_state+0x20/0x20 [17478.611563] [<ffffffffc0d2b670>] ? ll_agl_thread+0x4f0/0x4f0 [lustre] [17478.612747] [<ffffffff83cc5e61>] kthread+0xd1/0xe0 [17478.613629] [<ffffffff83cc5d90>] ? insert_kthread_work+0x40/0x40 [17478.614721] [<ffffffff84395df7>] ret_from_fork_nospec_begin+0x21/0x21 [17478.615882] [<ffffffff83cc5d90>] ? insert_kthread_work+0x40/0x40 |
| Comments |
| Comment by Etienne Aujames [ 24/Aug/22 ] |
|
The culprit seems to be "sanityn test_35": [10553.623983] Lustre: *** cfs_fail_loc=317, val=2147484440*** [10553.625165] LustreError: 29348:0:(libcfs_fail.h:165:cfs_race()) cfs_race id 318 sleeping [10553.857801] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait for 29313 29315 for 60 sec... [10554.093025] Lustre: DEBUG MARKER: Wait for 29313 29315 for 60 sec... [10614.124212] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x0 [10614.384327] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null [10614.697061] Lustre: DEBUG MARKER: rc=0; val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1); if [[ $? -eq 0 && $val -ne 0 ]]; then echo $(hostname -s): $val; rc=$val; fi; exit $rc [10616.469762] Lustre: DEBUG MARKER: dmesg [10617.167284] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanityn test 36: handle ESTALE\/open-unlink correctly ============================================== 19:59:47 \(1661284787\) [10617.398565] Lustre: DEBUG MARKER: == sanityn test 36: handle ESTALE/open-unlink correctly ============================================== 19:59:47 (1661284787) Thread 29348 is never woken up by the test_35 (CFS_RACE). This could be mitigate by backporting: |