[LU-16105] sanityn: timeout while umounting clients at the end of the tests (on b2_12) Created: 24/Aug/22  Updated: 24/Aug/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for eaujames <eaujames@ddn.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/cd7e7af7-223c-4f05-8b0e-d56318504469

Client hang on umount:

[17478.771636] umount          D ffff9862f4a8e8e0     0  2998   2991 0x00000080
[17478.772966] Call Trace:
[17478.773430]  [<ffffffff84389179>] schedule+0x29/0x70
[17478.774322]  [<ffffffff84386d88>] schedule_timeout+0x168/0x2d0
[17478.775359]  [<ffffffff83cad8b0>] ? requeue_timers+0x170/0x170
[17478.776404]  [<ffffffffc0cfe809>] ll_kill_super+0x99/0x180 [lustre]
[17478.777555]  [<ffffffffc087b5dd>] lustre_kill_super+0x3d/0x50 [obdclass]
[17478.778747]  [<ffffffff83e5173e>] deactivate_locked_super+0x4e/0x70
[17478.779860]  [<ffffffff83e51ec6>] deactivate_super+0x46/0x60
[17478.780873]  [<ffffffff83e712ff>] cleanup_mnt+0x3f/0x80
[17478.781809]  [<ffffffff83e71392>] __cleanup_mnt+0x12/0x20
[17478.782766]  [<ffffffff83cc299b>] task_work_run+0xbb/0xe0
[17478.783720]  [<ffffffff83c2cc65>] do_notify_resume+0xa5/0xc0
[17478.784739]  [<ffffffff843962ef>] int_signal+0x12/0x17

Because a statahead thread hang:

17478.577512] ll_sa_29315     S ffff98633b055860     0 29348      2 0x00000080
[17478.578841] Call Trace:
[17478.579302]  [<ffffffff84389179>] schedule+0x29/0x70
[17478.580201]  [<ffffffffc0a2e37e>] ldlm_lock_decref_internal+0x95e/0xa30 [ptlrpc]
[17478.581514]  [<ffffffff83cc6f50>] ? wake_up_atomic_t+0x30/0x30
[17478.582567]  [<ffffffffc0a3bc85>] failed_lock_cleanup.isra.20+0x95/0x230 [ptlrpc]
[17478.583922]  [<ffffffffc0a3ecec>] ldlm_cli_enqueue_fini+0x14c/0xe00 [ptlrpc]
[17478.585363]  [<ffffffffc0868721>] ? lprocfs_counter_sub+0xc1/0x130 [obdclass]
[17478.586635]  [<ffffffffc0a41d91>] ldlm_cli_enqueue+0x421/0x930 [ptlrpc]
[17478.587822]  [<ffffffffc0a3ce70>] ? ldlm_expired_completion_wait+0x220/0x220 [ptlrpc]
[17478.589334]  [<ffffffffc0d16a40>] ? ll_md_need_convert+0x1b0/0x1b0 [lustre]
[17478.590607]  [<ffffffffc0ba73f0>] ? mdc_changelog_cdev_finish+0x210/0x210 [mdc]
[17478.591890]  [<ffffffffc0ba1a1a>] mdc_enqueue_base+0x33a/0x1cf0 [mdc]
[17478.593043]  [<ffffffffc0ba3b35>] mdc_intent_lock+0x135/0x560 [mdc]
[17478.594174]  [<ffffffffc0d16a40>] ? ll_md_need_convert+0x1b0/0x1b0 [lustre]
[17478.595431]  [<ffffffffc0a3ce70>] ? ldlm_expired_completion_wait+0x220/0x220 [ptlrpc]
[17478.596820]  [<ffffffffc0ba73f0>] ? mdc_changelog_cdev_finish+0x210/0x210 [mdc]
[17478.598114]  [<ffffffffc0b91233>] mdc_read_page+0xb3/0x970 [mdc]
[17478.599191]  [<ffffffff83ce594a>] ? check_preempt_wakeup+0x12a/0x220
[17478.600361]  [<ffffffffc09fde23>] lmv_striped_read_page.isra.34+0x528/0xaac [lmv]
[17478.601696]  [<ffffffffc09ed2e0>] lmv_read_page+0x330/0x3a0 [lmv]
[17478.602795]  [<ffffffffc0cd97fc>] ll_get_dir_page+0xac/0x1b0 [lustre]
[17478.603946]  [<ffffffffc0d16a40>] ? ll_md_need_convert+0x1b0/0x1b0 [lustre]
[17478.605206]  [<ffffffffc0d2b9c7>] ll_statahead_thread+0x357/0x1c70 [lustre]
[17478.606454]  [<ffffffff83ce482c>] ? update_curr+0x14c/0x1e0
[17478.607456]  [<ffffffff83ce22d9>] ? pick_next_entity+0xa9/0x190
[17478.608510]  [<ffffffff83ce26ec>] ? set_next_entity+0x3c/0xe0
[17478.609559]  [<ffffffff83c2b59e>] ? __switch_to+0xce/0x580
[17478.610545]  [<ffffffff83cdadf0>] ? wake_up_state+0x20/0x20
[17478.611563]  [<ffffffffc0d2b670>] ? ll_agl_thread+0x4f0/0x4f0 [lustre]
[17478.612747]  [<ffffffff83cc5e61>] kthread+0xd1/0xe0
[17478.613629]  [<ffffffff83cc5d90>] ? insert_kthread_work+0x40/0x40
[17478.614721]  [<ffffffff84395df7>] ret_from_fork_nospec_begin+0x21/0x21
[17478.615882]  [<ffffffff83cc5d90>] ? insert_kthread_work+0x40/0x40


 Comments   
Comment by Etienne Aujames [ 24/Aug/22 ]

The culprit seems to be "sanityn test_35":

[10553.623983] Lustre: *** cfs_fail_loc=317, val=2147484440***
[10553.625165] LustreError: 29348:0:(libcfs_fail.h:165:cfs_race()) cfs_race id 318 sleeping
[10553.857801] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait for 29313 29315 for 60 sec...
[10554.093025] Lustre: DEBUG MARKER: Wait for 29313 29315 for 60 sec...
[10614.124212] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x0
[10614.384327] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
[10614.697061] Lustre: DEBUG MARKER: rc=0; val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1); if [[ $? -eq 0 && $val -ne 0 ]]; then echo $(hostname -s): $val; rc=$val; fi; exit $rc
[10616.469762] Lustre: DEBUG MARKER: dmesg
[10617.167284] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanityn test 36: handle ESTALE\/open-unlink correctly ============================================== 19:59:47 \(1661284787\)
[10617.398565] Lustre: DEBUG MARKER: == sanityn test 36: handle ESTALE/open-unlink correctly ============================================== 19:59:47 (1661284787)

Thread 29348 is never woken up by the test_35 (CFS_RACE).

This could be mitigate by backporting:
https://review.whamcloud.com/43161 ("LU-13358 libcfs: add timeout to cfs_race() to fix race")

Generated at Sat Feb 10 03:24:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.