Loading...

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.10.4
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

racer completes all tests, but hangs while unmounting the file system. We see this problem frequently.

On the client console, we see the umount command hung

[41678.588785] umount          D 0000000000000000     0 13439  13432 0x00000000
[41678.588786]  ffff880058147da8 ffff880058147de0 ffff880063b84cc0 ffff880058148000
[41678.588788]  ffff880058147de0 00000001009dda10 ffff88007fc10840 0000000000000000
[41678.588789]  ffff880058147dc0 ffffffff81611eb5 ffff88007fc10840 ffff880058147e68
[41678.588789] Call Trace:
[41678.588792]  [<ffffffff81611eb5>] schedule+0x35/0x80
[41678.588794]  [<ffffffff81614c71>] schedule_timeout+0x161/0x2d0
[41678.589164]  [<ffffffffa0d34837>] ll_kill_super+0x77/0x150 [lustre]
[41678.589219]  [<ffffffffa0838054>] lustre_kill_super+0x34/0x40 [obdclass]
[41678.589247]  [<ffffffff8120d4ff>] deactivate_locked_super+0x3f/0x70
[41678.589260]  [<ffffffff812289cb>] cleanup_mnt+0x3b/0x80
[41678.589264]  [<ffffffff8109d198>] task_work_run+0x78/0x90
[41678.589267]  [<ffffffff8107b60f>] exit_to_usermode_loop+0x91/0xc2
[41678.589304]  [<ffffffff81003ae5>] syscall_return_slowpath+0x85/0xa0
[41678.589327]  [<ffffffff816160a8>] int_ret_from_sys_call+0x25/0x9f
[41678.591515] DWARF2 unwinder stuck at int_ret_from_sys_call+0x25/0x9f

The stack trace for the hung umount on the client is common to all these failures. What varies in these failures is what is in the other console logs.

In some cases, there is little information in the console logs as to why umount is hung. The following failures are example of this

https://testing.hpdd.intel.com/test_sets/1a8349fe-3780-11e8-960d-52540065bddc

https://testing.hpdd.intel.com/test_sets/49efa63a-2e87-11e8-b74b-52540065bddc

Other racer hangs on unmount has more information in client and MDS console logs.

On the MDS console log, we see two hung processes

[41561.767660] jbd2/vda1-8     D ffff880036476eb0     0   263      2 0x00000000
[41561.768587] Call Trace:
[41561.768925]  [<ffffffff816b2060>] ? bit_wait+0x50/0x50
[41561.769652]  [<ffffffff816b40e9>] schedule+0x29/0x70
[41561.770284]  [<ffffffff816b1a49>] schedule_timeout+0x239/0x2c0
[41561.771136]  [<ffffffff812fd2d0>] ? generic_make_request_checks+0x1a0/0x3a0
[41561.772022]  [<ffffffff81063f5e>] ? kvm_clock_get_cycles+0x1e/0x20
[41561.772875]  [<ffffffff816b2060>] ? bit_wait+0x50/0x50
[41561.773511]  [<ffffffff816b35ed>] io_schedule_timeout+0xad/0x130
[41561.774276]  [<ffffffff816b3688>] io_schedule+0x18/0x20
[41561.775031]  [<ffffffff816b2071>] bit_wait_io+0x11/0x50
[41561.775682]  [<ffffffff816b1b97>] __wait_on_bit+0x67/0x90
[41561.776361]  [<ffffffff816b2060>] ? bit_wait+0x50/0x50
[41561.777112]  [<ffffffff816b1c41>] out_of_line_wait_on_bit+0x81/0xb0
[41561.777910]  [<ffffffff810b5080>] ? wake_bit_function+0x40/0x40
[41561.778756]  [<ffffffff8123b3fa>] __wait_on_buffer+0x2a/0x30
[41561.779893]  [<ffffffffc00b87f1>] jbd2_journal_commit_transaction+0x1781/0x19b0 [jbd2]
[41561.780881]  [<ffffffff810c28a0>] ? finish_task_switch+0x50/0x170
[41561.781650]  [<ffffffffc00bdac9>] kjournald2+0xc9/0x260 [jbd2]
[41561.782385]  [<ffffffff810b4fc0>] ? wake_up_atomic_t+0x30/0x30
[41561.783226]  [<ffffffffc00bda00>] ? commit_timeout+0x10/0x10 [jbd2]
[41561.784024]  [<ffffffff810b4031>] kthread+0xd1/0xe0
[41561.784710]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[41561.785478]  [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
[41561.786238]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
…
[41561.874288] auditd          D ffff88007911bf40     0   464      1 0x00000000
[41561.875280] Call Trace:
[41561.875598]  [<ffffffff816b40e9>] schedule+0x29/0x70
[41561.876257]  [<ffffffffc00bd4a5>] jbd2_log_wait_commit+0xc5/0x140 [jbd2]
[41561.877183]  [<ffffffff810b4fc0>] ? wake_up_atomic_t+0x30/0x30
[41561.877931]  [<ffffffffc00bea92>] jbd2_complete_transaction+0x52/0xa0 [jbd2]
[41561.879009]  [<ffffffffc00e1d52>] ext4_sync_file+0x292/0x320 [ext4]
[41561.879805]  [<ffffffff81238477>] do_fsync+0x67/0xb0
[41561.880476]  [<ffffffff816c0655>] ? system_call_after_swapgs+0xa2/0x146
[41561.881309]  [<ffffffff81238760>] SyS_fsync+0x10/0x20
[41561.882029]  [<ffffffff816c0715>] system_call_fastpath+0x1c/0x21
[41561.882794]  [<ffffffff816c0661>] ? system_call_after_swapgs+0xae/0x146

Example of this are

https://testing.hpdd.intel.com/test_sets/28935b90-2e9f-11e8-9e0e-52540065bddc

Other test hangs have additional information on a client. We see a couple of errors before racer completes

[38788.150606] Lustre: lustre-MDT0000-mdc-ffff880000015000: Connection restored to 10.2.8.165@tcp (at 10.2.8.165@tcp)
[38887.196337] LustreError: 8603:0:(statahead.c:1591:start_statahead_thread()) can't start ll_sa thread, rc: -4
[38920.523725] Lustre: 25748:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1523359053/real 1523359053]  req@ffff8800639b96c0 x1597353644599184/t0(0) o36->lustre-MDT0000-mdc-ffff88006424a000@10.2.8.165@tcp:12/10 lens 608/4768 e 0 to 1 dl 1523359107 ref 2 fl Rpc:IX/0/ffffffff rc 0/-1
[38920.523749] Lustre: 25748:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 2 previous similar messages[38920.523799] Lustre: lustre-MDT0000-mdc-ffff88006424a000: Connection to lustre-MDT0000 (at 10.2.8.165@tcp) was lost; in progress operations using this service will wait for recovery to complete
[38920.523800] Lustre: Skipped 2 previous similar messages
[38920.523857] LustreError: 25748:0:(llite_lib.c:1511:ll_md_setattr()) md_setattr fails: rc = -4
[38920.531836] Lustre: lustre-MDT0000-mdc-ffff88006424a000: Connection restored to 10.2.8.165@tcp (at 10.2.8.165@tcp)

Examples of this are at

https://testing.hpdd.intel.com/test_sets/8209f4b2-3d46-11e8-8f8a-52540065bddc

Other client have different information in the console log. Some have many of the following messages

[10286.777823] ll_sa_21479     D ffff880078c30000     0 21498      2 0x00000080
[10286.779544] Call Trace:
[10286.780839]  [<ffffffff816b40e9>] schedule+0x29/0x70
[10286.782475]  [<ffffffffc0cfa6d1>] ll_statahead_thread+0x651/0xbf0 [lustre]
[10286.784173]  [<ffffffffc0cfa080>] ? ll_agl_thread+0x3e0/0x3e0 [lustre]
[10286.785825]  [<ffffffff810b4031>] kthread+0xd1/0xe0
[10286.787342]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[10286.788946]  [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
[10286.790502]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[10286.792104] ll_agl_21479    D ffff880078c35ee0     0 21499      2 0x00000080
[10286.793798] Call Trace:
[10286.795131]  [<ffffffff816b40e9>] schedule+0x29/0x70
[10286.796658]  [<ffffffffc0cf9f6b>] ll_agl_thread+0x2cb/0x3e0 [lustre]
[10286.798302]  [<ffffffffc0cf9ca0>] ? ll_agl_trigger+0x520/0x520 [lustre]
[10286.799953]  [<ffffffff810b4031>] kthread+0xd1/0xe0
[10286.801459]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[10286.803059]  [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
[10286.804593]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40

Logs for these types of failures are at

https://testing.hpdd.intel.com/test_sets/847d5946-3152-11e8-b6a0-52540065bddc

Attachments

Issue Links

is related to

LU-10543 racer: /mnt/lustre2 is still busy, wait one second

Open

mentioned in: Page Loading...

racer hangs on umount

Details

Description

Attachments

Issue Links

Activity

People

Dates