[LU-10904] racer hangs on umount Created: 11/Apr/18  Updated: 23/Nov/23  Resolved: 23/Nov/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.10.4
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None

Issue Links:
Related
is related to LU-10543 racer: /mnt/lustre2 is still busy, wa... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

racer completes all tests, but hangs while unmounting the file system. We see this problem frequently.

 

On the client console, we see the umount command hung

[41678.588785] umount          D 0000000000000000     0 13439  13432 0x00000000
[41678.588786]  ffff880058147da8 ffff880058147de0 ffff880063b84cc0 ffff880058148000
[41678.588788]  ffff880058147de0 00000001009dda10 ffff88007fc10840 0000000000000000
[41678.588789]  ffff880058147dc0 ffffffff81611eb5 ffff88007fc10840 ffff880058147e68
[41678.588789] Call Trace:
[41678.588792]  [<ffffffff81611eb5>] schedule+0x35/0x80
[41678.588794]  [<ffffffff81614c71>] schedule_timeout+0x161/0x2d0
[41678.589164]  [<ffffffffa0d34837>] ll_kill_super+0x77/0x150 [lustre]
[41678.589219]  [<ffffffffa0838054>] lustre_kill_super+0x34/0x40 [obdclass]
[41678.589247]  [<ffffffff8120d4ff>] deactivate_locked_super+0x3f/0x70
[41678.589260]  [<ffffffff812289cb>] cleanup_mnt+0x3b/0x80
[41678.589264]  [<ffffffff8109d198>] task_work_run+0x78/0x90
[41678.589267]  [<ffffffff8107b60f>] exit_to_usermode_loop+0x91/0xc2
[41678.589304]  [<ffffffff81003ae5>] syscall_return_slowpath+0x85/0xa0
[41678.589327]  [<ffffffff816160a8>] int_ret_from_sys_call+0x25/0x9f
[41678.591515] DWARF2 unwinder stuck at int_ret_from_sys_call+0x25/0x9f

 

The stack trace for the hung umount on the client is common to all these failures. What varies in these failures is what is in the other console logs.

 

In some cases, there is little information in the console logs as to why umount is hung. The following failures are example of this

https://testing.hpdd.intel.com/test_sets/1a8349fe-3780-11e8-960d-52540065bddc

https://testing.hpdd.intel.com/test_sets/49efa63a-2e87-11e8-b74b-52540065bddc

 

 

Other racer hangs on unmount has more information in client and MDS console logs.

 

On the MDS console log, we see two hung processes

[41561.767660] jbd2/vda1-8     D ffff880036476eb0     0   263      2 0x00000000
[41561.768587] Call Trace:
[41561.768925]  [<ffffffff816b2060>] ? bit_wait+0x50/0x50
[41561.769652]  [<ffffffff816b40e9>] schedule+0x29/0x70
[41561.770284]  [<ffffffff816b1a49>] schedule_timeout+0x239/0x2c0
[41561.771136]  [<ffffffff812fd2d0>] ? generic_make_request_checks+0x1a0/0x3a0
[41561.772022]  [<ffffffff81063f5e>] ? kvm_clock_get_cycles+0x1e/0x20
[41561.772875]  [<ffffffff816b2060>] ? bit_wait+0x50/0x50
[41561.773511]  [<ffffffff816b35ed>] io_schedule_timeout+0xad/0x130
[41561.774276]  [<ffffffff816b3688>] io_schedule+0x18/0x20
[41561.775031]  [<ffffffff816b2071>] bit_wait_io+0x11/0x50
[41561.775682]  [<ffffffff816b1b97>] __wait_on_bit+0x67/0x90
[41561.776361]  [<ffffffff816b2060>] ? bit_wait+0x50/0x50
[41561.777112]  [<ffffffff816b1c41>] out_of_line_wait_on_bit+0x81/0xb0
[41561.777910]  [<ffffffff810b5080>] ? wake_bit_function+0x40/0x40
[41561.778756]  [<ffffffff8123b3fa>] __wait_on_buffer+0x2a/0x30
[41561.779893]  [<ffffffffc00b87f1>] jbd2_journal_commit_transaction+0x1781/0x19b0 [jbd2]
[41561.780881]  [<ffffffff810c28a0>] ? finish_task_switch+0x50/0x170
[41561.781650]  [<ffffffffc00bdac9>] kjournald2+0xc9/0x260 [jbd2]
[41561.782385]  [<ffffffff810b4fc0>] ? wake_up_atomic_t+0x30/0x30
[41561.783226]  [<ffffffffc00bda00>] ? commit_timeout+0x10/0x10 [jbd2]
[41561.784024]  [<ffffffff810b4031>] kthread+0xd1/0xe0
[41561.784710]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[41561.785478]  [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
[41561.786238]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
…
[41561.874288] auditd          D ffff88007911bf40     0   464      1 0x00000000
[41561.875280] Call Trace:
[41561.875598]  [<ffffffff816b40e9>] schedule+0x29/0x70
[41561.876257]  [<ffffffffc00bd4a5>] jbd2_log_wait_commit+0xc5/0x140 [jbd2]
[41561.877183]  [<ffffffff810b4fc0>] ? wake_up_atomic_t+0x30/0x30
[41561.877931]  [<ffffffffc00bea92>] jbd2_complete_transaction+0x52/0xa0 [jbd2]
[41561.879009]  [<ffffffffc00e1d52>] ext4_sync_file+0x292/0x320 [ext4]
[41561.879805]  [<ffffffff81238477>] do_fsync+0x67/0xb0
[41561.880476]  [<ffffffff816c0655>] ? system_call_after_swapgs+0xa2/0x146
[41561.881309]  [<ffffffff81238760>] SyS_fsync+0x10/0x20
[41561.882029]  [<ffffffff816c0715>] system_call_fastpath+0x1c/0x21
[41561.882794]  [<ffffffff816c0661>] ? system_call_after_swapgs+0xae/0x146

 

Example of this are

https://testing.hpdd.intel.com/test_sets/28935b90-2e9f-11e8-9e0e-52540065bddc

 

Other test hangs have additional information on a client. We see a couple of errors before racer completes

[38788.150606] Lustre: lustre-MDT0000-mdc-ffff880000015000: Connection restored to 10.2.8.165@tcp (at 10.2.8.165@tcp)
[38887.196337] LustreError: 8603:0:(statahead.c:1591:start_statahead_thread()) can't start ll_sa thread, rc: -4
[38920.523725] Lustre: 25748:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1523359053/real 1523359053]  req@ffff8800639b96c0 x1597353644599184/t0(0) o36->lustre-MDT0000-mdc-ffff88006424a000@10.2.8.165@tcp:12/10 lens 608/4768 e 0 to 1 dl 1523359107 ref 2 fl Rpc:IX/0/ffffffff rc 0/-1
[38920.523749] Lustre: 25748:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 2 previous similar messages[38920.523799] Lustre: lustre-MDT0000-mdc-ffff88006424a000: Connection to lustre-MDT0000 (at 10.2.8.165@tcp) was lost; in progress operations using this service will wait for recovery to complete
[38920.523800] Lustre: Skipped 2 previous similar messages
[38920.523857] LustreError: 25748:0:(llite_lib.c:1511:ll_md_setattr()) md_setattr fails: rc = -4
[38920.531836] Lustre: lustre-MDT0000-mdc-ffff88006424a000: Connection restored to 10.2.8.165@tcp (at 10.2.8.165@tcp)

 

Examples of this are at

https://testing.hpdd.intel.com/test_sets/8209f4b2-3d46-11e8-8f8a-52540065bddc

 

 Other client have different information in the console log. Some have many of the following messages

[10286.777823] ll_sa_21479     D ffff880078c30000     0 21498      2 0x00000080
[10286.779544] Call Trace:
[10286.780839]  [<ffffffff816b40e9>] schedule+0x29/0x70
[10286.782475]  [<ffffffffc0cfa6d1>] ll_statahead_thread+0x651/0xbf0 [lustre]
[10286.784173]  [<ffffffffc0cfa080>] ? ll_agl_thread+0x3e0/0x3e0 [lustre]
[10286.785825]  [<ffffffff810b4031>] kthread+0xd1/0xe0
[10286.787342]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[10286.788946]  [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
[10286.790502]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[10286.792104] ll_agl_21479    D ffff880078c35ee0     0 21499      2 0x00000080
[10286.793798] Call Trace:
[10286.795131]  [<ffffffff816b40e9>] schedule+0x29/0x70
[10286.796658]  [<ffffffffc0cf9f6b>] ll_agl_thread+0x2cb/0x3e0 [lustre]
[10286.798302]  [<ffffffffc0cf9ca0>] ? ll_agl_trigger+0x520/0x520 [lustre]
[10286.799953]  [<ffffffff810b4031>] kthread+0xd1/0xe0
[10286.801459]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[10286.803059]  [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
[10286.804593]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40

 

Logs for these types of failures are at

https://testing.hpdd.intel.com/test_sets/847d5946-3152-11e8-b6a0-52540065bddc


Generated at Sat Feb 10 02:39:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.