Details
-
Bug
-
Resolution: Cannot Reproduce
-
Minor
-
None
-
Lustre 2.11.0, Lustre 2.10.4
-
None
-
3
-
9223372036854775807
Description
racer completes all tests, but hangs while unmounting the file system. We see this problem frequently.
On the client console, we see the umount command hung
[41678.588785] umount D 0000000000000000 0 13439 13432 0x00000000 [41678.588786] ffff880058147da8 ffff880058147de0 ffff880063b84cc0 ffff880058148000 [41678.588788] ffff880058147de0 00000001009dda10 ffff88007fc10840 0000000000000000 [41678.588789] ffff880058147dc0 ffffffff81611eb5 ffff88007fc10840 ffff880058147e68 [41678.588789] Call Trace: [41678.588792] [<ffffffff81611eb5>] schedule+0x35/0x80 [41678.588794] [<ffffffff81614c71>] schedule_timeout+0x161/0x2d0 [41678.589164] [<ffffffffa0d34837>] ll_kill_super+0x77/0x150 [lustre] [41678.589219] [<ffffffffa0838054>] lustre_kill_super+0x34/0x40 [obdclass] [41678.589247] [<ffffffff8120d4ff>] deactivate_locked_super+0x3f/0x70 [41678.589260] [<ffffffff812289cb>] cleanup_mnt+0x3b/0x80 [41678.589264] [<ffffffff8109d198>] task_work_run+0x78/0x90 [41678.589267] [<ffffffff8107b60f>] exit_to_usermode_loop+0x91/0xc2 [41678.589304] [<ffffffff81003ae5>] syscall_return_slowpath+0x85/0xa0 [41678.589327] [<ffffffff816160a8>] int_ret_from_sys_call+0x25/0x9f [41678.591515] DWARF2 unwinder stuck at int_ret_from_sys_call+0x25/0x9f
The stack trace for the hung umount on the client is common to all these failures. What varies in these failures is what is in the other console logs.
In some cases, there is little information in the console logs as to why umount is hung. The following failures are example of this
https://testing.hpdd.intel.com/test_sets/1a8349fe-3780-11e8-960d-52540065bddc
https://testing.hpdd.intel.com/test_sets/49efa63a-2e87-11e8-b74b-52540065bddc
Other racer hangs on unmount has more information in client and MDS console logs.
On the MDS console log, we see two hung processes
[41561.767660] jbd2/vda1-8 D ffff880036476eb0 0 263 2 0x00000000 [41561.768587] Call Trace: [41561.768925] [<ffffffff816b2060>] ? bit_wait+0x50/0x50 [41561.769652] [<ffffffff816b40e9>] schedule+0x29/0x70 [41561.770284] [<ffffffff816b1a49>] schedule_timeout+0x239/0x2c0 [41561.771136] [<ffffffff812fd2d0>] ? generic_make_request_checks+0x1a0/0x3a0 [41561.772022] [<ffffffff81063f5e>] ? kvm_clock_get_cycles+0x1e/0x20 [41561.772875] [<ffffffff816b2060>] ? bit_wait+0x50/0x50 [41561.773511] [<ffffffff816b35ed>] io_schedule_timeout+0xad/0x130 [41561.774276] [<ffffffff816b3688>] io_schedule+0x18/0x20 [41561.775031] [<ffffffff816b2071>] bit_wait_io+0x11/0x50 [41561.775682] [<ffffffff816b1b97>] __wait_on_bit+0x67/0x90 [41561.776361] [<ffffffff816b2060>] ? bit_wait+0x50/0x50 [41561.777112] [<ffffffff816b1c41>] out_of_line_wait_on_bit+0x81/0xb0 [41561.777910] [<ffffffff810b5080>] ? wake_bit_function+0x40/0x40 [41561.778756] [<ffffffff8123b3fa>] __wait_on_buffer+0x2a/0x30 [41561.779893] [<ffffffffc00b87f1>] jbd2_journal_commit_transaction+0x1781/0x19b0 [jbd2] [41561.780881] [<ffffffff810c28a0>] ? finish_task_switch+0x50/0x170 [41561.781650] [<ffffffffc00bdac9>] kjournald2+0xc9/0x260 [jbd2] [41561.782385] [<ffffffff810b4fc0>] ? wake_up_atomic_t+0x30/0x30 [41561.783226] [<ffffffffc00bda00>] ? commit_timeout+0x10/0x10 [jbd2] [41561.784024] [<ffffffff810b4031>] kthread+0xd1/0xe0 [41561.784710] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40 [41561.785478] [<ffffffff816c0577>] ret_from_fork+0x77/0xb0 [41561.786238] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40 … [41561.874288] auditd D ffff88007911bf40 0 464 1 0x00000000 [41561.875280] Call Trace: [41561.875598] [<ffffffff816b40e9>] schedule+0x29/0x70 [41561.876257] [<ffffffffc00bd4a5>] jbd2_log_wait_commit+0xc5/0x140 [jbd2] [41561.877183] [<ffffffff810b4fc0>] ? wake_up_atomic_t+0x30/0x30 [41561.877931] [<ffffffffc00bea92>] jbd2_complete_transaction+0x52/0xa0 [jbd2] [41561.879009] [<ffffffffc00e1d52>] ext4_sync_file+0x292/0x320 [ext4] [41561.879805] [<ffffffff81238477>] do_fsync+0x67/0xb0 [41561.880476] [<ffffffff816c0655>] ? system_call_after_swapgs+0xa2/0x146 [41561.881309] [<ffffffff81238760>] SyS_fsync+0x10/0x20 [41561.882029] [<ffffffff816c0715>] system_call_fastpath+0x1c/0x21 [41561.882794] [<ffffffff816c0661>] ? system_call_after_swapgs+0xae/0x146
Example of this are
https://testing.hpdd.intel.com/test_sets/28935b90-2e9f-11e8-9e0e-52540065bddc
Other test hangs have additional information on a client. We see a couple of errors before racer completes
[38788.150606] Lustre: lustre-MDT0000-mdc-ffff880000015000: Connection restored to 10.2.8.165@tcp (at 10.2.8.165@tcp) [38887.196337] LustreError: 8603:0:(statahead.c:1591:start_statahead_thread()) can't start ll_sa thread, rc: -4 [38920.523725] Lustre: 25748:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1523359053/real 1523359053] req@ffff8800639b96c0 x1597353644599184/t0(0) o36->lustre-MDT0000-mdc-ffff88006424a000@10.2.8.165@tcp:12/10 lens 608/4768 e 0 to 1 dl 1523359107 ref 2 fl Rpc:IX/0/ffffffff rc 0/-1 [38920.523749] Lustre: 25748:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 2 previous similar messages[38920.523799] Lustre: lustre-MDT0000-mdc-ffff88006424a000: Connection to lustre-MDT0000 (at 10.2.8.165@tcp) was lost; in progress operations using this service will wait for recovery to complete [38920.523800] Lustre: Skipped 2 previous similar messages [38920.523857] LustreError: 25748:0:(llite_lib.c:1511:ll_md_setattr()) md_setattr fails: rc = -4 [38920.531836] Lustre: lustre-MDT0000-mdc-ffff88006424a000: Connection restored to 10.2.8.165@tcp (at 10.2.8.165@tcp)
Examples of this are at
https://testing.hpdd.intel.com/test_sets/8209f4b2-3d46-11e8-8f8a-52540065bddc
Other client have different information in the console log. Some have many of the following messages
[10286.777823] ll_sa_21479 D ffff880078c30000 0 21498 2 0x00000080 [10286.779544] Call Trace: [10286.780839] [<ffffffff816b40e9>] schedule+0x29/0x70 [10286.782475] [<ffffffffc0cfa6d1>] ll_statahead_thread+0x651/0xbf0 [lustre] [10286.784173] [<ffffffffc0cfa080>] ? ll_agl_thread+0x3e0/0x3e0 [lustre] [10286.785825] [<ffffffff810b4031>] kthread+0xd1/0xe0 [10286.787342] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40 [10286.788946] [<ffffffff816c0577>] ret_from_fork+0x77/0xb0 [10286.790502] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40 [10286.792104] ll_agl_21479 D ffff880078c35ee0 0 21499 2 0x00000080 [10286.793798] Call Trace: [10286.795131] [<ffffffff816b40e9>] schedule+0x29/0x70 [10286.796658] [<ffffffffc0cf9f6b>] ll_agl_thread+0x2cb/0x3e0 [lustre] [10286.798302] [<ffffffffc0cf9ca0>] ? ll_agl_trigger+0x520/0x520 [lustre] [10286.799953] [<ffffffff810b4031>] kthread+0xd1/0xe0 [10286.801459] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40 [10286.803059] [<ffffffff816c0577>] ret_from_fork+0x77/0xb0 [10286.804593] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
Logs for these types of failures are at
https://testing.hpdd.intel.com/test_sets/847d5946-3152-11e8-b6a0-52540065bddc