[LU-11359] racer test 1 times out with client hung in dir_create.sh, ls, … and MDS in ldlm_completion_ast() Created: 10/Sep/18 Updated: 27/Jun/19 Resolved: 01/Jun/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0, Lustre 2.10.7, Lustre 2.12.1 |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.12.3 |
| Type: | Bug | Priority: | Major |
| Reporter: | James Nunez (Inactive) | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | DoM2, ORNL | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||
| Description |
|
Looking at the logs for https://testing.whamcloud.com/test_sets/ada6008e-b433-11e8-b86b-52540065bddc, we see that racer test_1 hangs in in ls, dir_create.sh. On both clients, we see several call to dir_create.sh, multiple ls calls, getfattr, ln and others blocked/hung. From the first client’s (vm9) console log, we see many calls hung ending with “lustre: revalidate FID [0x200007161:0x1:0x0] error: rc = -4” [ 7722.701554] Lustre: DEBUG MARKER: DURATION=300 MDSCOUNT=1 OSTCOUNT=7 RACER_ENABLE_REMOTE_DIRS=false RACER_ENABLE_STRIPED_DIRS=false RACER_ENABLE_MIGRATION=false RACER_ENABLE_PFL=true RACER_ENABLE_DOM=true RACER_ENABLE_FLR=true LFS=/usr/bin/lfs /usr/lib64/lustre/t [ 7722.706903] Lustre: DEBUG MARKER: DURATION=300 MDSCOUNT=1 OSTCOUNT=7 RACER_ENABLE_REMOTE_DIRS=false RACER_ENABLE_STRIPED_DIRS=false RACER_ENABLE_MIGRATION=false RACER_ENABLE_PFL=true RACER_ENABLE_DOM=true RACER_ENABLE_FLR=true LFS=/usr/bin/lfs /usr/lib64/lustre/t [ 7726.780425] 3[19667]: segfault at 0 ip 0000000000403e5f sp 00007fff05523690 error 6 in 3 (deleted)[400000+6000] [ 7738.946240] Lustre: lfs: using old ioctl(LL_IOC_LOV_GETSTRIPE) on [0x200007163:0x167:0x0], use llapi_layout_get_by_path() [ 7858.011870] Lustre: lustre-OST0004-osc-ffff8e6e2480f800: Connection to lustre-OST0004 (at 10.9.4.117@tcp) was lost; in progress operations using this service will wait for recovery to complete [ 7858.015216] LustreError: 167-0: lustre-OST0004-osc-ffff8e6e2480f800: This client was evicted by lustre-OST0004; in progress operations using this service will fail. [ 7858.019049] LustreError: 1426:0:(ldlm_resource.c:1146:ldlm_resource_complain()) lustre-OST0004-osc-ffff8e6e2480f800: namespace resource [0x832e:0x0:0x0].0x0 (ffff8e6e0ce026c0) refcount nonzero (2) after lock cleanup; forcing cleanup. [ 7920.249204] INFO: task dir_create.sh:15717 blocked for more than 120 seconds. [ 7920.250063] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7920.250808] dir_create.sh D ffff8e6e02b14f10 0 15717 15658 0x00000080 [ 7920.251566] Call Trace: [ 7920.251882] [<ffffffffc0db47e2>] ? ll_dcompare+0x72/0x2e0 [lustre] [ 7920.252506] [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70 [ 7920.253246] [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0 [ 7920.253901] [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f [ 7920.254400] [<ffffffffa470bcde>] lookup_slow+0x33/0xa7 [ 7920.254914] [<ffffffffa422927f>] link_path_walk+0x80f/0x8b0 [ 7920.255597] [<ffffffffa422b75d>] ? do_last+0x66d/0x12c0 [ 7920.256108] [<ffffffffa422c465>] path_openat+0xb5/0x640 [ 7920.256614] [<ffffffffa422e01d>] do_filp_open+0x4d/0xb0 [ 7920.257118] [<ffffffffa423b4c4>] ? __alloc_fd+0xc4/0x170 [ 7920.257637] [<ffffffffa421a327>] do_sys_open+0x137/0x240 [ 7920.258159] [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 7920.258783] [<ffffffffa421a44e>] SyS_open+0x1e/0x20 [ 7920.259260] [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21 [ 7920.259819] [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.260552] INFO: task cp:32379 blocked for more than 120 seconds. [ 7920.261136] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7920.261877] cp D ffff8e6e2611af70 0 32379 15806 0x00000080 [ 7920.262581] Call Trace: [ 7920.262857] [<ffffffffc0df4cc5>] ? ll_i2suppgid+0x15/0x40 [lustre] [ 7920.263466] [<ffffffffc0df4d14>] ? ll_i2gids+0x24/0xb0 [lustre] [ 7920.264031] [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70 [ 7920.264635] [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0 [ 7920.265222] [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f [ 7920.265711] [<ffffffffc0de785c>] ll_setattr_raw+0xb6c/0xee0 [lustre] [ 7920.266332] [<ffffffffc0de7c3c>] ll_setattr+0x6c/0xd0 [lustre] [ 7920.266897] [<ffffffffa4239f14>] notify_change+0x2c4/0x420 [ 7920.267428] [<ffffffffa424fab9>] utimes_common+0xd9/0x1c0 [ 7920.267952] [<ffffffffa424fcde>] do_utimes+0x13e/0x180 [ 7920.268452] [<ffffffffa424fe24>] SyS_utimensat+0x64/0xb0 [ 7920.268966] [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.269585] [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21 [ 7920.270156] [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.270775] INFO: task cp:434 blocked for more than 120 seconds. [ 7920.271344] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7920.272068] cp D ffff8e6df7cb8000 0 434 15776 0x00000080 [ 7920.272778] Call Trace: [ 7920.273028] [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70 [ 7920.273636] [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0 [ 7920.274225] [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f [ 7920.274721] [<ffffffffc0e14bc3>] vvp_io_write_start+0x603/0x810 [lustre] [ 7920.275506] [<ffffffffc0978377>] ? cl_lock_request+0x67/0x1f0 [obdclass] [ 7920.276191] [<ffffffffc097a228>] cl_io_start+0x68/0x130 [obdclass] [ 7920.276803] [<ffffffffc097c611>] cl_io_loop+0x131/0xc70 [obdclass] [ 7920.277417] [<ffffffffc0dc9db2>] ll_file_io_generic+0x4e2/0xd10 [lustre] [ 7920.278064] [<ffffffffc0dcab32>] ll_file_aio_write+0x372/0x540 [lustre] [ 7920.278699] [<ffffffffc0dcada4>] ll_file_write+0xa4/0x170 [lustre] [ 7920.279300] [<ffffffffa421b490>] vfs_write+0xc0/0x1f0 [ 7920.279776] [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.280408] [<ffffffffa421c2bf>] SyS_write+0x7f/0xf0 [ 7920.280894] [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.281512] [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21 [ 7920.282096] [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.282719] INFO: task getfattr:1159 blocked for more than 120 seconds. [ 7920.283347] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7920.284090] getfattr D ffff8e6df815af70 0 1159 15746 0x00000080 [ 7920.284799] Call Trace: [ 7920.285054] [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70 [ 7920.285660] [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0 [ 7920.286244] [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f [ 7920.286726] [<ffffffffa470bcde>] lookup_slow+0x33/0xa7 [ 7920.287231] [<ffffffffa42291d4>] link_path_walk+0x764/0x8b0 [ 7920.287770] [<ffffffffc094c943>] ? class_handle2object+0xc3/0x1c0 [obdclass] [ 7920.288453] [<ffffffffa422a24a>] path_lookupat+0x7a/0x8b0 [ 7920.288981] [<ffffffffa41f7b9b>] ? kmem_cache_alloc+0x19b/0x1f0 [ 7920.289547] [<ffffffffa422cd4f>] ? getname_flags+0x4f/0x1a0 [ 7920.290096] [<ffffffffa422aaab>] filename_lookup+0x2b/0xc0 [ 7920.290624] [<ffffffffa422dee7>] user_path_at_empty+0x67/0xc0 [ 7920.291185] [<ffffffffa4125092>] ? from_kgid_munged+0x12/0x20 [ 7920.291740] [<ffffffffa422142f>] ? cp_new_stat+0x14f/0x180 [ 7920.292278] [<ffffffffa422df51>] user_path_at+0x11/0x20 [ 7920.292775] [<ffffffffa4220f23>] vfs_fstatat+0x63/0xc0 [ 7920.293282] [<ffffffffa422148e>] SYSC_newstat+0x2e/0x60 [ 7920.293776] [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.294407] [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 7920.295029] [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.295645] [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 7920.296272] [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.296895] [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 7920.297512] [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.298132] [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 7920.298746] [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.299373] [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 7920.300000] [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.300616] [<ffffffffa422176e>] SyS_newstat+0xe/0x10 [ 7920.301103] [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21 [ 7920.301668] [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.302299] INFO: task getfattr:1656 blocked for more than 120 seconds. [ 7920.302919] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7920.303653] getfattr D ffff8e6e06b6eeb0 0 1656 15679 0x00000080 [ 7920.304366] Call Trace: [ 7920.304606] [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70 [ 7920.305221] [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0 [ 7920.305790] [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f [ 7920.306292] [<ffffffffa470bcde>] lookup_slow+0x33/0xa7 [ 7920.306784] [<ffffffffa422aa08>] path_lookupat+0x838/0x8b0 [ 7920.307321] [<ffffffffa41f7b79>] ? kmem_cache_alloc+0x179/0x1f0 [ 7920.307887] [<ffffffffa422cd4f>] ? getname_flags+0x4f/0x1a0 [ 7920.308418] [<ffffffffa422aaab>] filename_lookup+0x2b/0xc0 [ 7920.308946] [<ffffffffa422dee7>] user_path_at_empty+0x67/0xc0 [ 7920.309494] [<ffffffffa422df51>] user_path_at+0x11/0x20 [ 7920.309998] [<ffffffffa4243f68>] SyS_listxattr+0x48/0xb0 [ 7920.310506] [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 7920.311125] [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.311749] [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21 [ 7920.312320] [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.312939] INFO: task chmod:1674 blocked for more than 120 seconds. [ 7920.313533] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7920.314270] chmod D ffff8e6e02b12f70 0 1674 15675 0x00000080 [ 7920.314990] Call Trace: [ 7920.315238] [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70 [ 7920.315829] [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0 [ 7920.316424] [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f [ 7920.316921] [<ffffffffa470bcde>] lookup_slow+0x33/0xa7 [ 7920.317417] [<ffffffffa422aa08>] path_lookupat+0x838/0x8b0 [ 7920.317955] [<ffffffffa406814e>] ? kvm_clock_get_cycles+0x1e/0x20 [ 7920.318549] [<ffffffffa41f7b79>] ? kmem_cache_alloc+0x179/0x1f0 [ 7920.319125] [<ffffffffa422cd4f>] ? getname_flags+0x4f/0x1a0 [ 7920.319657] [<ffffffffa422aaab>] filename_lookup+0x2b/0xc0 [ 7920.320204] [<ffffffffa422dee7>] user_path_at_empty+0x67/0xc0 [ 7920.320749] [<ffffffffa471b57c>] ? __do_page_fault+0x1bc/0x4f0 [ 7920.321316] [<ffffffffa422df51>] user_path_at+0x11/0x20 [ 7920.321814] [<ffffffffa4219d07>] SyS_fchmodat+0x47/0xc0 [ 7920.322322] [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 7920.322942] [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.323558] [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21 [ 7920.324124] [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.324743] INFO: task getfattr:1834 blocked for more than 120 seconds. [ 7920.325372] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7920.326101] getfattr D ffff8e6df7068000 0 1834 15711 0x00000080 [ 7920.326802] Call Trace: [ 7920.327060] [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70 [ 7920.327661] [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0 [ 7920.328249] [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f [ 7920.328745] [<ffffffffa470bcde>] lookup_slow+0x33/0xa7 [ 7920.329254] [<ffffffffa422aa08>] path_lookupat+0x838/0x8b0 [ 7920.329778] [<ffffffffa4242b12>] ? generic_getxattr+0x52/0x70 [ 7920.330338] [<ffffffffa41f7b79>] ? kmem_cache_alloc+0x179/0x1f0 [ 7920.331029] [<ffffffffa422cd4f>] ? getname_flags+0x4f/0x1a0 [ 7920.331651] [<ffffffffa422aaab>] filename_lookup+0x2b/0xc0 [ 7920.332278] [<ffffffffa422dee7>] user_path_at_empty+0x67/0xc0 [ 7920.332952] [<ffffffffa471b57c>] ? __do_page_fault+0x1bc/0x4f0 [ 7920.333913] [<ffffffffa422df51>] user_path_at+0x11/0x20 [ 7920.335027] [<ffffffffa4243d2a>] SyS_getxattr+0x4a/0xc0 [ 7920.336126] [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 7920.336980] [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.337720] [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21 [ 7920.338301] [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.338928] INFO: task rm:1977 blocked for more than 120 seconds. [ 7920.339503] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7920.340232] rm D ffff8e6e05499fa0 0 1977 15734 0x00000080 [ 7920.340937] Call Trace: [ 7920.341181] [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70 [ 7920.341784] [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0 [ 7920.342374] [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f [ 7920.342857] [<ffffffffa422d265>] do_rmdir+0x165/0x220 [ 7920.343352] [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.343972] [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 7920.344586] [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.345204] [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 7920.345816] [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.346444] [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 7920.347063] [<ffffffffa422e555>] SyS_unlinkat+0x25/0x40 [ 7920.347565] [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21 [ 7920.348130] [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.348747] INFO: task mv:2144 blocked for more than 120 seconds. [ 7920.349323] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7920.350049] mv D ffff8e6e05498000 0 2144 15700 0x00000080 [ 7920.350756] Call Trace: [ 7920.351005] [<ffffffffa422a2f2>] ? path_lookupat+0x122/0x8b0 [ 7920.351547] [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70 [ 7920.352148] [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0 [ 7920.352728] [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f [ 7920.353222] [<ffffffffa42276e5>] lock_rename+0x55/0xe0 [ 7920.353713] [<ffffffffa422d81f>] SYSC_renameat2+0x22f/0x5a0 [ 7920.354255] [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.354871] [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 7920.355489] [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.356110] [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 7920.356732] [<ffffffffa422ea0e>] SyS_renameat2+0xe/0x10 [ 7920.357247] [<ffffffffa422ea4e>] SyS_rename+0x1e/0x20 [ 7920.357731] [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21 [ 7920.358303] [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.358930] INFO: task ln:2148 blocked for more than 120 seconds. [ 7920.359504] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7920.360254] ln D ffff8e6df7cbeeb0 0 2148 15720 0x00000080 [ 7920.360973] Call Trace: [ 7920.361221] [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70 [ 7920.361812] [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0 [ 7920.362402] [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f [ 7920.362896] [<ffffffffa470bcde>] lookup_slow+0x33/0xa7 [ 7920.363390] [<ffffffffa422aa08>] path_lookupat+0x838/0x8b0 [ 7920.363924] [<ffffffffa41944fb>] ? unlock_page+0x2b/0x30 [ 7920.364444] [<ffffffffa41c13f1>] ? do_read_fault.isra.60+0x121/0x1a0 [ 7920.365048] [<ffffffffa41f7b79>] ? kmem_cache_alloc+0x179/0x1f0 [ 7920.365610] [<ffffffffa422cd4f>] ? getname_flags+0x4f/0x1a0 [ 7920.366146] [<ffffffffa422aaab>] filename_lookup+0x2b/0xc0 [ 7920.366671] [<ffffffffa422dee7>] user_path_at_empty+0x67/0xc0 [ 7920.367232] [<ffffffffa41c7c3d>] ? handle_mm_fault+0x39d/0x9b0 [ 7920.367777] [<ffffffffa4125092>] ? from_kgid_munged+0x12/0x20 [ 7920.368329] [<ffffffffa422df51>] user_path_at+0x11/0x20 [ 7920.368821] [<ffffffffa4220f23>] vfs_fstatat+0x63/0xc0 [ 7920.369344] [<ffffffffa42214f1>] SYSC_newlstat+0x31/0x60 [ 7920.369857] [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.370495] [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 7920.371115] [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.371737] [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 7920.372363] [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.372987] [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 7920.373604] [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.374250] [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 7920.374872] [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.375495] [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 7920.376114] [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.376736] [<ffffffffa422177e>] SyS_newlstat+0xe/0x10 [ 7920.377240] [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21 [ 7920.377806] [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146 [ 8022.948194] Lustre: lustre-OST0004-osc-ffff8e6e2480f800: Connection restored to 10.9.4.117@tcp (at 10.9.4.117@tcp) [ 8023.080282] LustreError: 4670:0:(file.c:4366:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200007161:0x1:0x0] error: rc = -4 [ 8023.080306] LustreError: 4664:0:(file.c:4366:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200007161:0x1:0x0] error: rc = -4 From the second client’s (vm10) console log, we see similar calls as in the first client’s call stack [ 7722.357867] Lustre: DEBUG MARKER: == racer test 1: racer on clients: trevis-10vm10,trevis-10vm9.trevis.whamcloud.com DURATION=300 ====== 12:19:23 (1536495563) [ 7722.541747] Lustre: DEBUG MARKER: DURATION=300 MDSCOUNT=1 OSTCOUNT=7 RACER_ENABLE_REMOTE_DIRS=false RACER_ENABLE_STRIPED_DIRS=false RACER_ENABLE_MIGRATION=false RACER_ENABLE_PFL=true RACER_ENABLE_DOM=true RACER_ENABLE_FLR=true LFS=/usr/bin/lfs /usr/lib64/lustre/t [ 7722.542686] Lustre: DEBUG MARKER: DURATION=300 MDSCOUNT=1 OSTCOUNT=7 RACER_ENABLE_REMOTE_DIRS=false RACER_ENABLE_STRIPED_DIRS=false RACER_ENABLE_MIGRATION=false RACER_ENABLE_PFL=true RACER_ENABLE_DOM=true RACER_ENABLE_FLR=true LFS=/usr/bin/lfs /usr/lib64/lustre/t [ 7726.494344] Lustre: lfs: using old ioctl(LL_IOC_LOV_GETSTRIPE) on [0x200007164:0x60:0x0], use llapi_layout_get_by_path() [ 7920.259874] INFO: task dir_create.sh:14445 blocked for more than 120 seconds. [ 7920.260874] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7920.261634] dir_create.sh D ffff9bac39240fd0 0 14445 14433 0x00000080 [ 7920.262414] Call Trace: [ 7920.262749] [<ffffffffc0e427e2>] ? ll_dcompare+0x72/0x2e0 [lustre] [ 7920.263384] [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70 [ 7920.264027] [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0 [ 7920.264623] [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f [ 7920.265151] [<ffffffff88f0bcde>] lookup_slow+0x33/0xa7 [ 7920.265692] [<ffffffff88a2927f>] link_path_walk+0x80f/0x8b0 [ 7920.266253] [<ffffffff88a2b75d>] ? do_last+0x66d/0x12c0 [ 7920.266782] [<ffffffff88a2c465>] path_openat+0xb5/0x640 [ 7920.267314] [<ffffffff88a2e01d>] do_filp_open+0x4d/0xb0 [ 7920.267848] [<ffffffff88a3b4c4>] ? __alloc_fd+0xc4/0x170 [ 7920.268392] [<ffffffff88a1a327>] do_sys_open+0x137/0x240 [ 7920.268943] [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 7920.269595] [<ffffffff88a1a44e>] SyS_open+0x1e/0x20 [ 7920.270104] [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21 [ 7920.270693] [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.271383] INFO: task ls:23458 blocked for more than 120 seconds. [ 7920.271998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7920.272770] ls D ffff9bac1fd82f70 0 23458 14490 0x00000080 [ 7920.273503] Call Trace: [ 7920.273774] [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70 [ 7920.274406] [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0 [ 7920.275021] [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f [ 7920.275525] [<ffffffff88a2b37f>] do_last+0x28f/0x12c0 [ 7920.276048] [<ffffffff88a2c487>] path_openat+0xd7/0x640 [ 7920.276582] [<ffffffffc0e4e929>] ? ll_file_data_put+0x89/0x180 [lustre] [ 7920.277251] [<ffffffff88a2e01d>] do_filp_open+0x4d/0xb0 [ 7920.277784] [<ffffffff88a3b4c4>] ? __alloc_fd+0xc4/0x170 [ 7920.278318] [<ffffffff88a1a327>] do_sys_open+0x137/0x240 [ 7920.278858] [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 7920.279515] [<ffffffff88a1a464>] SyS_openat+0x14/0x20 [ 7920.280033] [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21 [ 7920.280614] [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.281265] INFO: task ls:23462 blocked for more than 120 seconds. [ 7920.281868] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7920.282636] ls D ffff9bac1fc13f40 0 23462 14490 0x00000080 [ 7920.283377] Call Trace: [ 7920.283638] [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70 [ 7920.284279] [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0 [ 7920.284885] [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f [ 7920.285399] [<ffffffff88a2b37f>] do_last+0x28f/0x12c0 [ 7920.285922] [<ffffffff88a2c487>] path_openat+0xd7/0x640 [ 7920.286448] [<ffffffffc0e4e929>] ? ll_file_data_put+0x89/0x180 [lustre] [ 7920.287110] [<ffffffff88a2e01d>] do_filp_open+0x4d/0xb0 [ 7920.287631] [<ffffffff88a3b4c4>] ? __alloc_fd+0xc4/0x170 [ 7920.288173] [<ffffffff88a1a327>] do_sys_open+0x137/0x240 [ 7920.288717] [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 7920.289361] [<ffffffff88a1a464>] SyS_openat+0x14/0x20 [ 7920.289878] [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21 [ 7920.290470] [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.291141] INFO: task ls:23464 blocked for more than 120 seconds. [ 7920.291746] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7920.292497] ls D ffff9bac20318000 0 23464 14490 0x00000080 [ 7920.293256] Call Trace: [ 7920.293500] [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70 [ 7920.294150] [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0 [ 7920.294767] [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f [ 7920.295280] [<ffffffff88a2b37f>] do_last+0x28f/0x12c0 [ 7920.295790] [<ffffffff88a2c487>] path_openat+0xd7/0x640 [ 7920.296327] [<ffffffffc0e4e929>] ? ll_file_data_put+0x89/0x180 [lustre] [ 7920.296990] [<ffffffff88a2e01d>] do_filp_open+0x4d/0xb0 [ 7920.297510] [<ffffffff88a3b4c4>] ? __alloc_fd+0xc4/0x170 [ 7920.298056] [<ffffffff88a1a327>] do_sys_open+0x137/0x240 [ 7920.298587] [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 7920.299256] [<ffffffff88a1a464>] SyS_openat+0x14/0x20 [ 7920.299775] [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21 [ 7920.300367] [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.301015] INFO: task ls:23466 blocked for more than 120 seconds. [ 7920.301612] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7920.302395] ls D ffff9bac2031af70 0 23466 14490 0x00000080 [ 7920.303136] Call Trace: [ 7920.303397] [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70 [ 7920.304035] [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0 [ 7920.304632] [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f [ 7920.305155] [<ffffffff88a2b37f>] do_last+0x28f/0x12c0 [ 7920.305673] [<ffffffff88a2c487>] path_openat+0xd7/0x640 [ 7920.306200] [<ffffffff88a2cd4f>] ? getname_flags+0x4f/0x1a0 [ 7920.306762] [<ffffffffc0e4e929>] ? ll_file_data_put+0x89/0x180 [lustre] [ 7920.307431] [<ffffffff88a2e01d>] do_filp_open+0x4d/0xb0 [ 7920.307963] [<ffffffff88a3b4c4>] ? __alloc_fd+0xc4/0x170 [ 7920.308493] [<ffffffff88a1a327>] do_sys_open+0x137/0x240 [ 7920.309037] [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 7920.309683] [<ffffffff88a1a464>] SyS_openat+0x14/0x20 [ 7920.310196] [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21 [ 7920.310786] [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.311447] INFO: task getfattr:1379 blocked for more than 120 seconds. [ 7920.312100] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7920.312876] getfattr D ffff9bac1d229fa0 0 1379 14538 0x00000080 [ 7920.313611] Call Trace: [ 7920.313884] [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70 [ 7920.314513] [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0 [ 7920.315128] [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f [ 7920.315631] [<ffffffff88f0bcde>] lookup_slow+0x33/0xa7 [ 7920.316162] [<ffffffff88a2aa08>] path_lookupat+0x838/0x8b0 [ 7920.316722] [<ffffffff88a42b12>] ? generic_getxattr+0x52/0x70 [ 7920.317294] [<ffffffff88a42f6a>] ? vfs_getxattr+0x8a/0xb0 [ 7920.317855] [<ffffffff889f7a35>] ? kmem_cache_alloc+0x35/0x1f0 [ 7920.318436] [<ffffffff88a2cd4f>] ? getname_flags+0x4f/0x1a0 [ 7920.319001] [<ffffffff88a2aaab>] filename_lookup+0x2b/0xc0 [ 7920.319544] [<ffffffff88a2dee7>] user_path_at_empty+0x67/0xc0 [ 7920.320131] [<ffffffff88f1b57c>] ? __do_page_fault+0x1bc/0x4f0 [ 7920.320718] [<ffffffff88a2df51>] user_path_at+0x11/0x20 [ 7920.321247] [<ffffffff88a43d2a>] SyS_getxattr+0x4a/0xc0 [ 7920.321777] [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 7920.322439] [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.323091] [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21 [ 7920.323689] [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.324352] INFO: task ln:2464 blocked for more than 120 seconds. [ 7920.324955] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7920.325711] ln D ffff9bac21380fd0 0 2464 14485 0x00000080 [ 7920.326458] Call Trace: [ 7920.326712] [<ffffffff88a2a2f2>] ? path_lookupat+0x122/0x8b0 [ 7920.327296] [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70 [ 7920.327932] [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0 [ 7920.328532] [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f [ 7920.329053] [<ffffffff88a2abc5>] filename_create+0x85/0x180 [ 7920.329608] [<ffffffff889f7a35>] ? kmem_cache_alloc+0x35/0x1f0 [ 7920.330201] [<ffffffff88a2cd4f>] ? getname_flags+0x4f/0x1a0 [ 7920.330775] [<ffffffff88a2cdc4>] ? getname_flags+0xc4/0x1a0 [ 7920.331338] [<ffffffff88a2d071>] user_path_create+0x41/0x60 [ 7920.331898] [<ffffffff88a2e5e8>] SyS_symlinkat+0x58/0x110 [ 7920.332441] [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 7920.333091] [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.333755] [<ffffffff88a2e6b6>] SyS_symlink+0x16/0x20 [ 7920.334279] [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21 [ 7920.334869] [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.335512] INFO: task ls:2582 blocked for more than 120 seconds. [ 7920.336114] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7920.336890] ls D ffff9bac1d329fa0 0 2582 14455 0x00000080 [ 7920.337619] Call Trace: [ 7920.337894] [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70 [ 7920.338519] [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0 [ 7920.339135] [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f [ 7920.339640] [<ffffffff88a2b37f>] do_last+0x28f/0x12c0 [ 7920.340160] [<ffffffff88a2c487>] path_openat+0xd7/0x640 [ 7920.340700] [<ffffffffc0e4e929>] ? ll_file_data_put+0x89/0x180 [lustre] [ 7920.341373] [<ffffffff88a2e01d>] do_filp_open+0x4d/0xb0 [ 7920.341902] [<ffffffff88a3b4c4>] ? __alloc_fd+0xc4/0x170 [ 7920.342438] [<ffffffff88a1a327>] do_sys_open+0x137/0x240 [ 7920.342981] [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 7920.343618] [<ffffffff88a1a464>] SyS_openat+0x14/0x20 [ 7920.344142] [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21 [ 7920.344757] [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.345402] INFO: task ls:2583 blocked for more than 120 seconds. [ 7920.346003] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7920.346774] ls D ffff9bac1d32dee0 0 2583 14455 0x00000080 [ 7920.347511] Call Trace: [ 7920.347787] [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70 [ 7920.348417] [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0 [ 7920.349033] [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f [ 7920.349539] [<ffffffff88a2b37f>] do_last+0x28f/0x12c0 [ 7920.350059] [<ffffffff88a2c487>] path_openat+0xd7/0x640 [ 7920.350587] [<ffffffffc0e4e929>] ? ll_file_data_put+0x89/0x180 [lustre] [ 7920.351271] [<ffffffff88a2e01d>] do_filp_open+0x4d/0xb0 [ 7920.351805] [<ffffffff88a3b4c4>] ? __alloc_fd+0xc4/0x170 [ 7920.352339] [<ffffffff88a1a327>] do_sys_open+0x137/0x240 [ 7920.352881] [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 7920.353524] [<ffffffff88a1a464>] SyS_openat+0x14/0x20 [ 7920.354050] [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21 [ 7920.354631] [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146 [ 7920.355301] INFO: task ls:2584 blocked for more than 120 seconds. [ 7920.355903] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7920.356685] ls D ffff9bac1e540fd0 0 2584 14455 0x00000080 [ 7920.357413] Call Trace: [ 7920.357688] [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70 [ 7920.358315] [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0 [ 7920.358927] [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f [ 7920.359432] [<ffffffff88a2b37f>] do_last+0x28f/0x12c0 [ 7920.359955] [<ffffffff88a2c487>] path_openat+0xd7/0x640 [ 7920.360484] [<ffffffffc0e4e929>] ? ll_file_data_put+0x89/0x180 [lustre] [ 7920.361166] [<ffffffff88a2e01d>] do_filp_open+0x4d/0xb0 [ 7920.361698] [<ffffffff88a3b4c4>] ? __alloc_fd+0xc4/0x170 [ 7920.362231] [<ffffffff88a1a327>] do_sys_open+0x137/0x240 [ 7920.362768] [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 7920.363411] [<ffffffff88a1a464>] SyS_openat+0x14/0x20 [ 7920.363931] [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21 [ 7920.364512] [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146 [ 8022.975327] LustreError: 4302:0:(file.c:4366:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200007161:0x1:0x0] error: rc = -4 [ 8022.975389] LustreError: 4303:0:(file.c:4366:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200007161:0x1:0x0] error: rc = -4 [ 8022.975622] LustreError: 4305:0:(file.c:4366:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200007161:0x1:0x0] error: rc = -4 Description continued in the first comment due to JIRA description line limits. |
| Comments |
| Comment by James Nunez (Inactive) [ 10/Sep/18 ] |
|
From the MDS (vm12) console log, we see two different call traces [ 7639.633530] Lustre: DEBUG MARKER: == racer test 1: racer on clients: trevis-10vm10,trevis-10vm9.trevis.whamcloud.com DURATION=300 ====== 12:19:23 (1536495563) [ 7665.045291] LNetError: 14176:0:(lib-msg.c:794:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0) [ 7665.095106] LNetError: 14176:0:(lib-msg.c:794:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0) [ 7672.008894] Lustre: lustre-MDT0000: Client 5302dedd-995b-ce8d-7b96-eb4bc47e114c (at 10.9.4.116@tcp) reconnecting [ 7672.056678] Lustre: lustre-MDT0000: Client 9f0e7201-adaf-34dc-28ac-c83a77146718 (at 10.9.4.116@tcp) reconnecting [ 7714.299107] LNet: Service thread pid 799 was inactive for 40.12s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 7714.300812] Pid: 799, comm: mdt00_029 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018 [ 7714.301760] Call Trace: [ 7714.302043] [<ffffffffc0d49f61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc] [ 7714.302943] [<ffffffffc0d4ad83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc] [ 7714.303794] [<ffffffffc11a5c7e>] mdt_dom_discard_data+0xfe/0x130 [mdt] [ 7714.304710] [<ffffffffc118355e>] mdt_reint_unlink+0xc2e/0x1490 [mdt] [ 7714.305408] [<ffffffffc1187293>] mdt_reint_rec+0x83/0x210 [mdt] [ 7714.306092] [<ffffffffc11661e2>] mdt_reint_internal+0x6b2/0xa80 [mdt] [ 7714.306798] [<ffffffffc1171447>] mdt_reint+0x67/0x140 [mdt] [ 7714.307401] [<ffffffffc0de831a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [ 7714.308224] [<ffffffffc0d8b48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 7714.309155] [<ffffffffc0d8ecc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc] [ 7714.309868] [<ffffffff842bb621>] kthread+0xd1/0xe0 [ 7714.310415] [<ffffffff849205f7>] ret_from_fork_nospec_end+0x0/0x39 [ 7714.311127] [<ffffffffffffffff>] 0xffffffffffffffff [ 7714.311701] LustreError: dumping log to /tmp/lustre-log.1536495638.799 [ 7714.635443] LNet: Service thread pid 789 was inactive for 40.15s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 7714.637166] Pid: 789, comm: mdt00_019 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018 [ 7714.638114] Call Trace: [ 7714.638375] [<ffffffffc0d49f61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc] [ 7714.639203] [<ffffffffc0d4ad83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc] [ 7714.640025] [<ffffffffc116a3d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt] [ 7714.640791] [<ffffffffc116aa80>] mdt_object_lock_internal+0x70/0x330 [mdt] [ 7714.641511] [<ffffffffc116baea>] mdt_getattr_name_lock+0x83a/0x1c00 [mdt] [ 7714.642266] [<ffffffffc1173ae5>] mdt_intent_getattr+0x2b5/0x480 [mdt] [ 7714.642972] [<ffffffffc11709c8>] mdt_intent_policy+0x2f8/0xd10 [mdt] [ 7714.643703] [<ffffffffc0d30e9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc] [ 7714.644423] [<ffffffffc0d59503>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc] [ 7714.645221] [<ffffffffc0ddf9d2>] tgt_enqueue+0x62/0x210 [ptlrpc] [ 7714.645923] [<ffffffffc0de831a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [ 7714.646752] [<ffffffffc0d8b48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 7714.647555] [<ffffffffc0d8ecc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc] [ 7714.648273] [<ffffffff842bb621>] kthread+0xd1/0xe0 [ 7714.648818] [<ffffffff849205f7>] ret_from_fork_nospec_end+0x0/0x39 [ 7714.649469] [<ffffffffffffffff>] 0xffffffffffffffff [ 7714.650057] LNet: Service thread pid 794 was inactive for 40.16s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 7714.651656] Pid: 794, comm: mdt00_024 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018 [ 7714.652669] Call Trace: [ 7714.652933] [<ffffffffc0d49f61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc] [ 7714.653706] [<ffffffffc0d4ad83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc] [ 7714.654462] [<ffffffffc116a3d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt] [ 7714.655245] [<ffffffffc116aa80>] mdt_object_lock_internal+0x70/0x330 [mdt] [ 7714.656002] [<ffffffffc116baea>] mdt_getattr_name_lock+0x83a/0x1c00 [mdt] [ 7714.656771] [<ffffffffc1173ae5>] mdt_intent_getattr+0x2b5/0x480 [mdt] [ 7714.657451] [<ffffffffc11709c8>] mdt_intent_policy+0x2f8/0xd10 [mdt] [ 7714.658173] [<ffffffffc0d30e9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc] [ 7714.658913] [<ffffffffc0d59503>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc] [ 7714.659703] [<ffffffffc0ddf9d2>] tgt_enqueue+0x62/0x210 [ptlrpc] [ 7714.660370] [<ffffffffc0de831a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [ 7714.661152] [<ffffffffc0d8b48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 7714.661989] [<ffffffffc0d8ecc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc] [ 7714.662754] [<ffffffff842bb621>] kthread+0xd1/0xe0 [ 7714.663276] [<ffffffff849205f7>] ret_from_fork_nospec_end+0x0/0x39 [ 7714.663956] [<ffffffffffffffff>] 0xffffffffffffffff [ 7714.664534] LNet: Service thread pid 23174 was inactive for 40.17s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 7714.666149] Pid: 23174, comm: mdt00_000 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018 [ 7714.667158] Call Trace: [ 7714.667418] [<ffffffffc0d49f61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc] [ 7714.668145] [<ffffffffc0d4ad83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc] [ 7714.668924] [<ffffffffc116a3d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt] [ 7714.669715] [<ffffffffc116aa80>] mdt_object_lock_internal+0x70/0x330 [mdt] [ 7714.670451] [<ffffffffc116baea>] mdt_getattr_name_lock+0x83a/0x1c00 [mdt] [ 7714.671210] [<ffffffffc1173ae5>] mdt_intent_getattr+0x2b5/0x480 [mdt] [ 7714.671919] [<ffffffffc11709c8>] mdt_intent_policy+0x2f8/0xd10 [mdt] [ 7714.672646] [<ffffffffc0d30e9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc] [ 7714.673361] [<ffffffffc0d59503>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc] [ 7714.674149] [<ffffffffc0ddf9d2>] tgt_enqueue+0x62/0x210 [ptlrpc] [ 7714.674837] [<ffffffffc0de831a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [ 7714.675567] [<ffffffffc0d8b48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 7714.676400] [<ffffffffc0d8ecc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc] [ 7714.677113] [<ffffffff842bb621>] kthread+0xd1/0xe0 [ 7714.677694] [<ffffffff849205f7>] ret_from_fork_nospec_end+0x0/0x39 [ 7714.678339] [<ffffffffffffffff>] 0xffffffffffffffff [ 7714.678912] LNet: Service thread pid 780 was inactive for 40.19s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 7714.680500] Pid: 780, comm: mdt00_010 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018 [ 7714.681504] Call Trace: [ 7714.681757] [<ffffffffc0d49f61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc] [ 7714.682485] [<ffffffffc0d4ad83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc] [ 7714.683292] [<ffffffffc116a3d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt] [ 7714.684030] [<ffffffffc116aa80>] mdt_object_lock_internal+0x70/0x330 [mdt] [ 7714.684838] [<ffffffffc116baea>] mdt_getattr_name_lock+0x83a/0x1c00 [mdt] [ 7714.685551] [<ffffffffc1173ae5>] mdt_intent_getattr+0x2b5/0x480 [mdt] [ 7714.686284] [<ffffffffc11709c8>] mdt_intent_policy+0x2f8/0xd10 [mdt] [ 7714.686982] [<ffffffffc0d30e9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc] [ 7714.687737] [<ffffffffc0d59503>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc] [ 7714.688484] [<ffffffffc0ddf9d2>] tgt_enqueue+0x62/0x210 [ptlrpc] [ 7714.689195] [<ffffffffc0de831a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [ 7714.689959] [<ffffffffc0d8b48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 7714.690822] [<ffffffffc0d8ecc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc] [ 7714.691494] [<ffffffff842bb621>] kthread+0xd1/0xe0 [ 7714.692044] [<ffffffff849205f7>] ret_from_fork_nospec_end+0x0/0x39 [ 7714.692782] [<ffffffffffffffff>] 0xffffffffffffffff [ 7714.693316] LNet: Service thread pid 777 was inactive for 40.04s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [ 7714.694582] LNet: Service thread pid 790 was inactive for 40.05s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. … [ 7714.719633] LNet: Service thread pid 26048 was inactive for 40.54s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [ 7974.175116] LustreError: 26048:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1536495598, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-lustre-MDT0000_UUID lock: ffff8840df351d40/0x106c1d2b325048e0 lrc: 3/1,0 mode: --/PR res: [0x200007164:0x28a:0x0].0x0 bits 0x13/0x8 rrc: 7 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 26048 timeout: 0 lvb_type: 0 [ 7974.177105] LustreError: 799:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1536495598, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-lustre-MDT0000_UUID lock: ffff8840d1882000/0x106c1d2b3250492d lrc: 3/0,1 mode: --/PW res: [0x200007164:0x28a:0x0].0x0 bits 0x40/0x0 rrc: 8 type: IBT flags: 0x40010080000000 nid: local remote: 0x0 expref: -99 pid: 799 timeout: 0 lvb_type: 0 … [ 8817.787115] Lustre: 852:0:(service.c:1365:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (4/-122), not sending early reply [ 8817.787115] req@ffff8840e518cf00 x1611124532145872/t0(0) o101->5302dedd-995b-ce8d-7b96-eb4bc47e114c@10.9.4.116@tcp:21/0 lens 696/3384 e 0 to 0 dl 1536496746 ref 2 fl Interpret:/0/0 rc 0/0 [ 8963.195139] Pid: 26045, comm: mdt00_004 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018 [ 8963.196167] Call Trace: [ 8963.196428] [<ffffffffc0d49fed>] ldlm_completion_ast+0x63d/0x920 [ptlrpc] [ 8963.197171] [<ffffffffc0d4ad83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc] [ 8963.197908] [<ffffffffc116a3d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt] [ 8963.198606] [<ffffffffc116aa80>] mdt_object_lock_internal+0x70/0x330 [mdt] [ 8963.199325] [<ffffffffc116baea>] mdt_getattr_name_lock+0x83a/0x1c00 [mdt] [ 8963.200048] [<ffffffffc1173ae5>] mdt_intent_getattr+0x2b5/0x480 [mdt] [ 8963.200908] [<ffffffffc11709c8>] mdt_intent_policy+0x2f8/0xd10 [mdt] [ 8963.201572] [<ffffffffc0d30e9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc] [ 8963.202422] [<ffffffffc0d59503>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc] [ 8963.203254] [<ffffffffc0ddf9d2>] tgt_enqueue+0x62/0x210 [ptlrpc] [ 8963.203961] [<ffffffffc0de831a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [ 8963.204759] [<ffffffffc0d8b48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 8963.205562] [<ffffffffc0d8ecc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc] [ 8963.206288] [<ffffffff842bb621>] kthread+0xd1/0xe0 [ 8963.206849] [<ffffffff849205f7>] ret_from_fork_nospec_end+0x0/0x39 [ 8963.207489] [<ffffffffffffffff>] 0xffffffffffffffff [ 8963.208090] LustreError: dumping log to /tmp/lustre-log.1536496887.26045 [ 9027.173147] Lustre: 890:0:(service.c:1365:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply [ 9027.173147] req@ffff8840fbf86100 x1611131923340304/t0(0) o101->0e5f252a-cf6c-a3fd-2317-8699dfe62e19@10.9.4.115@tcp:231/0 lens 1344/3528 e 0 to 0 dl 1536496956 ref 2 fl Interpret:/0/0 rc 0/0 [ 9027.176014] Lustre: 890:0:(service.c:1365:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply [ 9027.176014] req@ffff8840fbf87600 x1611131923339728/t0(0) o101->0e5f252a-cf6c-a3fd-2317-8699dfe62e19@10.9.4.115@tcp:231/0 lens 696/3384 e 0 to 0 dl 1536496956 ref 2 fl Interpret:/0/0 rc 0/0 [ 9479.291115] Pid: 851, comm: mdt00_033 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018 [ 9479.292156] Call Trace: [ 9479.292409] [<ffffffffc0d49fed>] ldlm_completion_ast+0x63d/0x920 [ptlrpc] [ 9479.293140] [<ffffffffc0d4ad83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc] [ 9479.293865] [<ffffffffc116a3d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt] [ 9479.294551] [<ffffffffc116aa80>] mdt_object_lock_internal+0x70/0x330 [mdt] [ 9479.295253] [<ffffffffc116baea>] mdt_getattr_name_lock+0x83a/0x1c00 [mdt] [ 9479.295970] [<ffffffffc1173ae5>] mdt_intent_getattr+0x2b5/0x480 [mdt] [ 9479.296757] [<ffffffffc11709c8>] mdt_intent_policy+0x2f8/0xd10 [mdt] [ 9479.297414] [<ffffffffc0d30e9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc] [ 9479.298105] [<ffffffffc0d59503>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc] [ 9479.298819] [<ffffffffc0ddf9d2>] tgt_enqueue+0x62/0x210 [ptlrpc] [ 9479.299462] [<ffffffffc0de831a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [ 9479.300176] [<ffffffffc0d8b48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 9479.301195] [<ffffffffc0d8ecc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc] [ 9479.301890] [<ffffffff842bb621>] kthread+0xd1/0xe0 [ 9479.302397] [<ffffffff849205f7>] ret_from_fork_nospec_end+0x0/0x39 [ 9479.303067] [<ffffffffffffffff>] 0xffffffffffffffff … There are several other test sessions that time out with the same call traces. |
| Comment by Peter Jones [ 11/Sep/18 ] |
|
Lai Could you please advise? Thanks Peter |
| Comment by Mikhail Pershin [ 31/Oct/18 ] |
|
I am taking this ticket, it looks related to DOM data discard and Oleg have reported similar issues also. After several local runs and investigations I have found that issue is related to mdt_dom_discard_data() function. While it is ran after all local objects are unlocked, it is still in the middle of unlink RPC processing, so until mdt_dom_discard_data() is finished no reply is being sent for unlink RPC. The problem is that parent directory is taking PW lock during that and lock will not be canceled until REP ACK will be received or local commit happens, so all other operations in the same directory will wait for dom discard lock completion, which is turn may wait for client lock to be canceled. Meanwhile we don't need to wait here at all, the only thing that discard should do is the bl_ast sending to the client. That can be fixed by avoiding waiting in completion callback for that lock or, per discussion with Vitaly, with LDLM_FL_CANCEL_ON_BLOCK flag. The discard lock will issue BL AST to the client and cancel server lock immediately without waiting for the client response. |
| Comment by Gerrit Updater [ 31/Oct/18 ] |
|
Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33522 |
| Comment by Gerrit Updater [ 20/Jan/19 ] |
|
Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34071 |
| Comment by Sarah Liu [ 05/Feb/19 ] |
|
seeing similar error on soak which is running b2_10-ib build #98 on MDS 0, ldiskfs [29779.010372] Lustre: 19293:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 1086809 previous similar messages [29876.244787] LNet: Service thread pid 18585 was inactive for 200.55s. The thread might be hung, or it might only be slow and will resume later. Du mping the stack trace for debugging purposes: [29876.263748] Pid: 18585, comm: mdt00_011 3.10.0-957.el7_lustre.x86_64 #1 SMP Mon Jan 7 20:06:41 UTC 2019 [29876.274243] Call Trace: [29876.276986] [<ffffffffc0f08a11>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc] [29876.284750] [<ffffffffc0f09b53>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc] [29876.292803] [<ffffffffc140ba1c>] mdt_reint_rename_or_migrate.isra.39+0x67c/0x860 [mdt] [29876.301769] [<ffffffffc140bc33>] mdt_reint_rename+0x13/0x20 [mdt] [29876.308686] [<ffffffffc140fc73>] mdt_reint_rec+0x83/0x210 [mdt] [29876.315425] [<ffffffffc13f118b>] mdt_reint_internal+0x5fb/0x9c0 [mdt] [29876.322728] [<ffffffffc13fcf77>] mdt_reint+0x67/0x140 [mdt] [29876.329068] [<ffffffffc0fa12aa>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [29876.336793] [<ffffffffc0f49d5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [29876.345386] [<ffffffffc0f4d4a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [29876.352425] [<ffffffff820c1c31>] kthread+0xd1/0xe0 [29876.357875] [<ffffffff82774c37>] ret_from_fork_nospec_end+0x0/0x39 [29876.364884] [<ffffffffffffffff>] 0xffffffffffffffff [29876.370470] LustreError: dumping log to /tmp/lustre-log.1549315990.18585 [29877.464680] LNet: Service thread pid 12550 was inactive for 201.09s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [29877.483630] Pid: 12550, comm: mdt00_004 3.10.0-957.el7_lustre.x86_64 #1 SMP Mon Jan 7 20:06:41 UTC 2019 [29877.494147] Call Trace: [29877.496902] [<ffffffffc0f08a11>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc] [29877.504696] [<ffffffffc0f09b53>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc] [29877.512742] [<ffffffffc140ba1c>] mdt_reint_rename_or_migrate.isra.39+0x67c/0x860 [mdt] [29877.521727] [<ffffffffc140bc33>] mdt_reint_rename+0x13/0x20 [mdt] [29877.528646] [<ffffffffc140fc73>] mdt_reint_rec+0x83/0x210 [mdt] [29877.535379] [<ffffffffc13f118b>] mdt_reint_internal+0x5fb/0x9c0 [mdt] [29877.542705] [<ffffffffc13fcf77>] mdt_reint+0x67/0x140 [mdt] [29877.549066] [<ffffffffc0fa12aa>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [29877.556816] [<ffffffffc0f49d5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [29877.565461] [<ffffffffc0f4d4a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [29877.572517] [<ffffffff820c1c31>] kthread+0xd1/0xe0 [29877.577977] [<ffffffff82774c37>] ret_from_fork_nospec_end+0x0/0x39 [29877.584994] [<ffffffffffffffff>] 0xffffffffffffffff [29898.915502] Lustre: MGS: haven't heard from client 892c8e1b-d3a7-b366-d6fc-9c8c5cb531f5 (at 192.168.1.110@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff96f12caae800, cur 1549316013 expire 1549315863 last 1549315786 [29926.439830] Lustre: 12363:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1549316014/real 0] req@ffff96ecd3f80f00 x1624544455753472/t0(0) o38->soaked-MDT0002-osp-MDT0000@192.168.1.110@o2ib:24/4 lens 520/544 e 0 to 1 dl 1549316040 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 [29926.470993] Lustre: 12363:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 1867318 previous similar messages [29931.342923] LNet: 12352:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.1.110@o2ib: 1 seconds [29944.343184] LNet: 12352:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.1.110@o2ib: 12 seconds [29956.343414] LNet: 12352:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.1.110@o2ib: 6 seconds [29956.354882] LNet: 12352:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped 2 previous similar messages [29969.343674] LNet: 12352:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.1.110@o2ib: 4 seconds [29969.355136] LNet: 12352:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped 2 previous similar messages [29975.691774] LustreError: 18585:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1549315789, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-soaked-MDT0000_UUID lock: ffff96ece5a0a200/0xcacba66f98c78b0 lrc: 3/0,1 mode: --/EX res: [0x200000004:0x1:0x0].0x0 bits 0x2 rrc: 12 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 18585 timeout: 0 lvb_type: 0 [29975.692225] Lustre: soaked-MDT0000: Received new LWP connection from 192.168.1.109@o2ib, removing former export from same NID [29975.692228] Lustre: Skipped 1 previous similar message [29975.692273] Lustre: soaked-MDT0000: Connection restored to 192.168.1.109@o2ib (at 192.168.1.109@o2ib) [29975.764275] LustreError: dumping log to /tmp/lustre-log.1549316090.18585 [29976.374841] LustreError: 12550:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1549315790, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-soaked-MDT0000_UUID lock: ffff96ece3468a00/0xcacba66f98ca080 lrc: 3/0,1 mode: --/EX res: [0x200000004:0x1:0x0].0x0 bits 0x2 rrc: 12 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 12550 timeout: 0 lvb_type: 0 [29982.343941] LNet: 12352:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.1.110@o2ib: 17 seconds [29982.355508] LNet: 12352:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped 2 previous similar messages [29994.344188] LNet: 12352:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.1.110@o2ib: 4 seconds [29994.355649] LNet: 12352:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped 1 previous similar message [30048.309887] Lustre: MGS: Connection restored to 192.168.1.110@o2ib (at 192.168.1.110@o2ib) [30048.319166] Lustre: Skipped 1 previous similar message [30048.677813] Lustre: soaked-MDT0000: Received new LWP connection from 192.168.1.110@o2ib, removing former export from same NID [30048.690935] Lustre: Skipped 1 previous similar message [30048.696694] Lustre: soaked-MDT0000: Client soaked-MDT0002-mdtlov_UUID (at 192.168.1.110@o2ib) refused connection, still busy with 26 references [30050.367592] LustreError: 19293:0:(ldlm_lockd.c:697:ldlm_handle_ast_error()) ### client (nid 192.168.1.110@o2ib) returned error from completion AST (req@ffff96ecde662d00 x1624544455740352 status -107 rc -107), evict it ns: mdt-soaked-MDT0000_UUID lock: ffff96ece65b3600/0xcacba66f98c8e43 lrc: 3/0,0 mode: EX/EX res: [0x20000cf25:0xa64:0x0].0x0 bits 0x2 rrc: 3 type: IBT flags: 0x40000001000000 nid: 192.168.1.110@o2ib remote: 0x3aa7d720cf16e00 expref: 25 pid: 32934 timeout: 0 lvb_type: 0 [30050.367613] LustreError: 138-a: soaked-MDT0000: A client on nid 192.168.1.110@o2ib was evicted due to a lock completion callback time out: rc -107 [30050.430003] LustreError: 19293:0:(ldlm_lockd.c:697:ldlm_handle_ast_error()) Skipped 1 previous similar message |
| Comment by Patrick Farrell (Inactive) [ 15/Mar/19 ] |
|
Attached is a log of the execution of mdt_dom_discard_data with this patch. Patch does not achieve intended effect of getting another context. We're waiting for the lock request to complete, rather than putting it on the waiting queue, so we're sleeping and executing the completion AST in the original thread context. |
| Comment by Mikhail Pershin [ 15/Mar/19 ] |
|
Patrick, it is not quite so, there is no wait for lock request completion waiting, it only takes lock in the same context and sends blocking ASTs but don't wait for them to complete, that was an idea. Problem was in generic ldlm_completion_ast() which waits for all blocking lock to cancel in execution context. |
| Comment by Patrick Farrell (Inactive) [ 15/Mar/19 ] |
|
Mike, Take a look at the log if you haven't. It waits for them to complete. I added a few more comments in the patch. |
| Comment by Mikhail Pershin [ 15/Mar/19 ] |
|
Patrick, I don't see in log you've sent that it 'waits', it executes lock in the same context, yes, process lock policy, finds conflicting locks, sends blocking ASTs and exits. With generic ldlm_completion_ast() it would also 'waits' for all blocking locks to cancel. That is the main problem because if waits for other client locks to cancel in the current context, but with the patch it doesn't. Often that happens so that client cancels blocking locks right away and server gets cancels earlier than calling completion AST, in that case all is finished in the single context but that is not problem |
| Comment by Patrick Farrell (Inactive) [ 15/Mar/19 ] |
|
In case anyone is not following Gerrit, the details are complicated, but Mike is correct. My comments can be disregarded. |
| Comment by Stephane Thiell [ 26/Apr/19 ] |
|
This is a follow-up to |
| Comment by Peter Jones [ 26/Apr/19 ] |
|
Thanks for the update Stephane - this is really good news! |
| Comment by Stephane Thiell [ 26/Apr/19 ] |
|
Hmm. While we haven't seen new call traces with mdt_dom_discard_data(), we just had an event with a lot of ldlm_completion_ast() and Robinhood was stuck. Filesystem was still usable overall but I noticed a few nodes that couldn't access a specific set of files. I restarted the Robinhood server and after some time the system recover by itself. This may be related to LU-11358 so I'll attach MDS logs there. |
| Comment by Mikhail Pershin [ 27/Apr/19 ] |
|
Stephane, this also can be result of |
| Comment by Stephane Thiell [ 27/Apr/19 ] |
|
Mike, ok thanks. I'll follow this one too. I have seen a few more traces on MDS last night, from time to time (see an example of one of these 'events' below). Not sure about the impact on users yet, but I assume some file access must be blocked, at least temporarily. We stopped Robinhood (which is not critical here) so it's not coming from it I think. Our plan is not to touch anything unless we see a real user-facing problem at this point, but do you think https://review.whamcloud.com/#/c/33077 would be worth trying if ever we see that it has an impact on production? Apr 27 06:59:36 fir-md1-s1 kernel: Lustre: fir-MDT0000: Client 79273348-f676-2905-42da-85a87e1ba2d5 (at 10.9.107.14@o2ib4) reconnecting
Apr 27 06:59:36 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages
Apr 27 07:08:23 fir-md1-s1 kernel: Lustre: fir-MDT0000: Connection restored to (at 10.9.107.14@o2ib4)
Apr 27 07:08:23 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages
Apr 27 07:09:56 fir-md1-s1 kernel: Lustre: fir-MDT0000: Client 79273348-f676-2905-42da-85a87e1ba2d5 (at 10.9.107.14@o2ib4) reconnecting
Apr 27 07:09:56 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages
Apr 27 07:18:43 fir-md1-s1 kernel: Lustre: fir-MDT0000: Connection restored to (at 10.9.107.14@o2ib4)
Apr 27 07:18:43 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages
Apr 27 07:20:16 fir-md1-s1 kernel: Lustre: fir-MDT0000: Client 79273348-f676-2905-42da-85a87e1ba2d5 (at 10.9.107.14@o2ib4) reconnecting
Apr 27 07:20:16 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages
Apr 27 07:29:03 fir-md1-s1 kernel: Lustre: fir-MDT0000: Connection restored to 4a715ce4-44e5-9eb5-fd19-43a9ef3ba9b7 (at 10.9.107.10@o2ib4)
Apr 27 07:29:03 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages
Apr 27 07:30:36 fir-md1-s1 kernel: Lustre: fir-MDT0000: Client 8937ba9c-f8a3-cff7-01bf-a35c8d69bba5 (at 10.9.107.10@o2ib4) reconnecting
Apr 27 07:30:36 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages
Apr 27 07:39:23 fir-md1-s1 kernel: Lustre: fir-MDT0000: Connection restored to 4a715ce4-44e5-9eb5-fd19-43a9ef3ba9b7 (at 10.9.107.10@o2ib4)
Apr 27 07:39:23 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages
Apr 27 07:40:37 fir-md1-s1 kernel: LNet: Service thread pid 114792 completed after 3541.44s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Apr 27 07:40:37 fir-md1-s1 kernel: LNet: Skipped 1 previous similar message
Apr 27 07:40:56 fir-md1-s1 kernel: Lustre: fir-MDT0000: Client 8937ba9c-f8a3-cff7-01bf-a35c8d69bba5 (at 10.9.107.10@o2ib4) reconnecting
Apr 27 07:40:56 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages
Apr 27 07:41:02 fir-md1-s1 kernel: Lustre: 105058:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-5), not sending early reply
req@ffff8b5d7a9ad100 x1631534730383072/t0(0) o101->5735cd86-3a30-362c-bc05-c634d3fa1859@10.9.107.11@o2ib4:7/0 lens 632/0 e 0 to 0 dl 1556376067 ref 2 fl Interpret:/0/ffffffff rc 0/-1
Apr 27 07:41:02 fir-md1-s1 kernel: Lustre: 105058:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 30 previous similar messages
Apr 27 07:42:07 fir-md1-s1 kernel: LustreError: 104389:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1556376037, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8b4d410f0240/0x378007f7dcffbc77 lrc: 3/1,0 mode: --/PR r
Apr 27 07:42:07 fir-md1-s1 kernel: LustreError: 104389:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 2 previous similar messages
Apr 27 07:43:07 fir-md1-s1 kernel: LustreError: 20444:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 150s: evicting client at 10.9.104.19@o2ib4 ns: mdt-fir-MDT0000_UUID lock: ffff8b60e5ace780/0x378007f7ac0a6892 lrc: 3/0,0 mode: PW/PW res: [0x20001a221:0x311:0x0].0x0 bits 0x40/0x0 rrc
Apr 27 07:43:07 fir-md1-s1 kernel: LustreError: 104728:0:(ldlm_lockd.c:1357:ldlm_handle_enqueue0()) ### lock on destroyed export ffff8b5101ad2000 ns: mdt-fir-MDT0000_UUID lock: ffff8b5ededbe300/0x378007f7ac0a6bc5 lrc: 5/0,0 mode: PW/PW res: [0x20001a221:0x311:0x0].0x0 bits 0x40/0x0 rrc: 50 type: IBT flags: 0x5020040
Apr 27 07:43:07 fir-md1-s1 kernel: LNet: Service thread pid 104728 completed after 3691.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Apr 27 07:43:07 fir-md1-s1 kernel: LNet: Service thread pid 105239 completed after 3691.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Apr 27 07:43:32 fir-md1-s1 kernel: Lustre: 105124:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-5), not sending early reply
req@ffff8b6231114500 x1631546041738256/t0(0) o101->5c33d5d2-2621-e59d-0e36-98a7cb2caa9e@10.9.104.17@o2ib4:7/0 lens 568/0 e 0 to 0 dl 1556376217 ref 2 fl Interpret:/0/ffffffff rc 0/-1
Apr 27 07:43:32 fir-md1-s1 kernel: Lustre: 105124:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message
Apr 27 07:43:37 fir-md1-s1 kernel: LustreError: 20444:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 30s: evicting client at 10.9.104.17@o2ib4 ns: mdt-fir-MDT0000_UUID lock: ffff8b57efe82880/0x378007f7ac0a6d3f lrc: 3/0,0 mode: PW/PW res: [0x20001a221:0x311:0x0].0x0 bits 0x40/0x0 rrc:
Apr 27 07:43:37 fir-md1-s1 kernel: LNet: Service thread pid 104337 completed after 3720.97s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Apr 27 07:43:37 fir-md1-s1 kernel: LustreError: 104691:0:(ldlm_lockd.c:1357:ldlm_handle_enqueue0()) ### lock on destroyed export ffff8b5101ad2000 ns: mdt-fir-MDT0000_UUID lock: ffff8b567d303a80/0x378007f7ac0a6e88 lrc: 3/0,0 mode: PR/PR res: [0x20001a221:0x311:0x0].0x0 bits 0x1b/0x0 rrc: 44 type: IBT flags: 0x5020040
Apr 27 07:43:37 fir-md1-s1 kernel: Lustre: 104691:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (3719:2s); client may timeout. req@ffff8b5e67f11200 x1631585913250576/t0(0) o101->8a37f7b1-3efc-30e9-f8d1-739df6680357@10.9.104.19@o2ib4:6/0 lens 584/1168 e 0 to 0 dl 1556376215
Apr 27 07:43:37 fir-md1-s1 kernel: Lustre: 104691:0:(service.c:2165:ptlrpc_server_handle_request()) Skipped 1 previous similar message
Apr 27 07:43:37 fir-md1-s1 kernel: LNet: Skipped 4 previous similar messages
Apr 27 07:43:43 fir-md1-s1 kernel: LustreError: 82844:0:(ldlm_lockd.c:2322:ldlm_cancel_handler()) ldlm_cancel from 10.9.104.17@o2ib4 arrived at 1556376223 with bad export cookie 3999205229385308541
Apr 27 07:43:58 fir-md1-s1 kernel: LNet: Service thread pid 114991 was inactive for 200.64s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Apr 27 07:43:58 fir-md1-s1 kernel: LNet: Skipped 4 previous similar messages
Apr 27 07:43:58 fir-md1-s1 kernel: Pid: 114991, comm: mdt02_103 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Apr 27 07:43:58 fir-md1-s1 kernel: Call Trace:
Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc0d7ae25>] ldlm_completion_ast+0x4e5/0x890 [ptlrpc]
Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc0d7bc8c>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc146b4bb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc146bb40>] mdt_object_lock_internal+0x70/0x3e0 [mdt]
Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc1474e45>] mdt_intent_getxattr+0xb5/0x270 [mdt]
Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc1471a28>] mdt_intent_policy+0x2e8/0xd00 [mdt]
Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc0d61d26>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc0d8a6d7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc0e110b2>] tgt_enqueue+0x62/0x210 [ptlrpc]
Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc0e1810a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc0dbc6db>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc0dc000c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffff9b4c1c31>] kthread+0xd1/0xe0
Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffff9bb74c24>] ret_from_fork_nospec_begin+0xe/0x21
Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
Apr 27 07:43:58 fir-md1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1556376238.114991
Apr 27 07:43:58 fir-md1-s1 kernel: Pid: 104389, comm: mdt01_004 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Apr 27 07:43:58 fir-md1-s1 kernel: Call Trace:
Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc0d7ae25>] ldlm_completion_ast+0x4e5/0x890 [ptlrpc]
Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc0d7bc8c>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc146b4bb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc146bb40>] mdt_object_lock_internal+0x70/0x3e0 [mdt]
Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc1474e45>] mdt_intent_getxattr+0xb5/0x270 [mdt]
Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc1471a28>] mdt_intent_policy+0x2e8/0xd00 [mdt]
Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc0d61d26>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc0d8a6d7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc0e110b2>] tgt_enqueue+0x62/0x210 [ptlrpc]
Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc0e1810a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc0dbc6db>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc0dc000c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffff9b4c1c31>] kthread+0xd1/0xe0
Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffff9bb74c24>] ret_from_fork_nospec_begin+0xe/0x21
Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
Apr 27 07:44:07 fir-md1-s1 kernel: Lustre: 105128:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (3750:1s); client may timeout. req@ffff8b5e67f12700 x1631586050523824/t0(0) o101->0fafc81c-d2f9-5fcc-1c5e-9d205df82025@10.9.104.20@o2ib4:6/0 lens 480/536 e 0 to 0 dl 1556376246
Apr 27 07:44:07 fir-md1-s1 kernel: LustreError: 104957:0:(ldlm_lockd.c:1357:ldlm_handle_enqueue0()) ### lock on destroyed export ffff8b5f75d7a000 ns: mdt-fir-MDT0000_UUID lock: ffff8b52d16cf980/0x378007f7ac0a79a9 lrc: 3/0,0 mode: PR/PR res: [0x20001a221:0x311:0x0].0x0 bits 0x20/0x0 rrc: 33 type: IBT flags: 0x5020000
Apr 27 07:44:07 fir-md1-s1 kernel: LustreError: 104957:0:(ldlm_lockd.c:1357:ldlm_handle_enqueue0()) Skipped 1 previous similar message
Apr 27 07:44:07 fir-md1-s1 kernel: Lustre: 105128:0:(service.c:2165:ptlrpc_server_handle_request()) Skipped 10 previous similar messages
Apr 27 08:25:20 fir-md1-s1 kernel: Lustre: fir-MDT0000: haven't heard from client f8efffa2-3213-b7c8-ccdc-9a5668e213bd (at 10.8.25.13@o2ib6) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8b72fa1f6c00, cur 1556378720 expire 1556378570 last 1556378493
Apr 27 08:25:20 fir-md1-s1 kernel: Lustre: Skipped 2 previous similar messages
Apr 27 08:25:36 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 8bfed59e-0ae2-8175-d0ee-2a61e691b9d0 (at 10.8.25.13@o2ib6)
|
| Comment by Stephane Thiell [ 29/Apr/19 ] |
|
We've completed the patching of all our clients (~1,400) last weekend. We haven't seen any new {{mdt_dom_discard_data }} call traces, so I consider this issue closed. However, we likely hit |
| Comment by Gerrit Updater [ 04/May/19 ] |
|
Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34804 |
| Comment by James A Simmons [ 22/May/19 ] |
|
We are also seeing this issue for our production system. I want to add it can be reproduced with running the io500 script. |
| Comment by Peter Jones [ 25/May/19 ] |
|
James Have you tested the current patch (and Peter |
| Comment by Gerrit Updater [ 01/Jun/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34071/ |
| Comment by Peter Jones [ 01/Jun/19 ] |
|
Landed for 2.13 |
| Comment by James A Simmons [ 01/Jun/19 ] |
|
Sorry missed the message. I have tested the |
| Comment by Stephane Thiell [ 06/Jun/19 ] |
|
Latest patchset of https://review.whamcloud.com/34071/ introduced the use of lu_env_find (due to aa82cc83 |
| Comment by Patrick Farrell (Inactive) [ 06/Jun/19 ] |
|
|
| Comment by Stephane Thiell [ 06/Jun/19 ] |
|
Ah, cool. Thanks! |
| Comment by Gerrit Updater [ 11/Jun/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35197 |
| Comment by Gerrit Updater [ 27/Jun/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35197/ |