[LU-11359] racer test 1 times out with client hung in dir_create.sh, ls, … and MDS in ldlm_completion_ast() Created: 10/Sep/18  Updated: 27/Jun/19  Resolved: 01/Jun/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.10.7, Lustre 2.12.1
Fix Version/s: Lustre 2.13.0, Lustre 2.12.3

Type: Bug Priority: Major
Reporter: James Nunez (Inactive) Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: DoM2, ORNL

Attachments: Text File mdt_dom_discard_data_thread.log    
Issue Links:
Duplicate
is duplicated by LU-12091 DNE/DOM: client evictions -108 Resolved
is duplicated by LU-12121 DOM: BRW to missing obj? Resolved
Related
is related to LU-12037 Possible DNE issue leading to hung fi... Resolved
is related to LU-12321 Unlink speed needs to be improved in ... Resolved
is related to LU-6758 racer test_1: test failed to respond ... Closed
is related to LU-11751 racer deadlocks due to DOM glimpse re... Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Looking at the logs for https://testing.whamcloud.com/test_sets/ada6008e-b433-11e8-b86b-52540065bddc, we see that racer test_1 hangs in in ls, dir_create.sh. On both clients, we see several call to dir_create.sh, multiple ls calls, getfattr, ln and others blocked/hung.

From the first client’s (vm9) console log, we see many calls hung ending with “lustre: revalidate FID [0x200007161:0x1:0x0] error: rc = -4”

[ 7722.701554] Lustre: DEBUG MARKER: DURATION=300 			MDSCOUNT=1 OSTCOUNT=7			RACER_ENABLE_REMOTE_DIRS=false 			RACER_ENABLE_STRIPED_DIRS=false 			RACER_ENABLE_MIGRATION=false 			RACER_ENABLE_PFL=true 			RACER_ENABLE_DOM=true 			RACER_ENABLE_FLR=true 			LFS=/usr/bin/lfs 			/usr/lib64/lustre/t
[ 7722.706903] Lustre: DEBUG MARKER: DURATION=300 			MDSCOUNT=1 OSTCOUNT=7			RACER_ENABLE_REMOTE_DIRS=false 			RACER_ENABLE_STRIPED_DIRS=false 			RACER_ENABLE_MIGRATION=false 			RACER_ENABLE_PFL=true 			RACER_ENABLE_DOM=true 			RACER_ENABLE_FLR=true 			LFS=/usr/bin/lfs 			/usr/lib64/lustre/t
[ 7726.780425] 3[19667]: segfault at 0 ip 0000000000403e5f sp 00007fff05523690 error 6 in 3 (deleted)[400000+6000]
[ 7738.946240] Lustre: lfs: using old ioctl(LL_IOC_LOV_GETSTRIPE) on [0x200007163:0x167:0x0], use llapi_layout_get_by_path()
[ 7858.011870] Lustre: lustre-OST0004-osc-ffff8e6e2480f800: Connection to lustre-OST0004 (at 10.9.4.117@tcp) was lost; in progress operations using this service will wait for recovery to complete
[ 7858.015216] LustreError: 167-0: lustre-OST0004-osc-ffff8e6e2480f800: This client was evicted by lustre-OST0004; in progress operations using this service will fail.
[ 7858.019049] LustreError: 1426:0:(ldlm_resource.c:1146:ldlm_resource_complain()) lustre-OST0004-osc-ffff8e6e2480f800: namespace resource [0x832e:0x0:0x0].0x0 (ffff8e6e0ce026c0) refcount nonzero (2) after lock cleanup; forcing cleanup.
[ 7920.249204] INFO: task dir_create.sh:15717 blocked for more than 120 seconds.
[ 7920.250063] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7920.250808] dir_create.sh   D ffff8e6e02b14f10     0 15717  15658 0x00000080
[ 7920.251566] Call Trace:
[ 7920.251882]  [<ffffffffc0db47e2>] ? ll_dcompare+0x72/0x2e0 [lustre]
[ 7920.252506]  [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70
[ 7920.253246]  [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0
[ 7920.253901]  [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f
[ 7920.254400]  [<ffffffffa470bcde>] lookup_slow+0x33/0xa7
[ 7920.254914]  [<ffffffffa422927f>] link_path_walk+0x80f/0x8b0
[ 7920.255597]  [<ffffffffa422b75d>] ? do_last+0x66d/0x12c0
[ 7920.256108]  [<ffffffffa422c465>] path_openat+0xb5/0x640
[ 7920.256614]  [<ffffffffa422e01d>] do_filp_open+0x4d/0xb0
[ 7920.257118]  [<ffffffffa423b4c4>] ? __alloc_fd+0xc4/0x170
[ 7920.257637]  [<ffffffffa421a327>] do_sys_open+0x137/0x240
[ 7920.258159]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 7920.258783]  [<ffffffffa421a44e>] SyS_open+0x1e/0x20
[ 7920.259260]  [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21
[ 7920.259819]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.260552] INFO: task cp:32379 blocked for more than 120 seconds.
[ 7920.261136] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7920.261877] cp              D ffff8e6e2611af70     0 32379  15806 0x00000080
[ 7920.262581] Call Trace:
[ 7920.262857]  [<ffffffffc0df4cc5>] ? ll_i2suppgid+0x15/0x40 [lustre]
[ 7920.263466]  [<ffffffffc0df4d14>] ? ll_i2gids+0x24/0xb0 [lustre]
[ 7920.264031]  [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70
[ 7920.264635]  [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0
[ 7920.265222]  [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f
[ 7920.265711]  [<ffffffffc0de785c>] ll_setattr_raw+0xb6c/0xee0 [lustre]
[ 7920.266332]  [<ffffffffc0de7c3c>] ll_setattr+0x6c/0xd0 [lustre]
[ 7920.266897]  [<ffffffffa4239f14>] notify_change+0x2c4/0x420
[ 7920.267428]  [<ffffffffa424fab9>] utimes_common+0xd9/0x1c0
[ 7920.267952]  [<ffffffffa424fcde>] do_utimes+0x13e/0x180
[ 7920.268452]  [<ffffffffa424fe24>] SyS_utimensat+0x64/0xb0
[ 7920.268966]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.269585]  [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21
[ 7920.270156]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.270775] INFO: task cp:434 blocked for more than 120 seconds.
[ 7920.271344] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7920.272068] cp              D ffff8e6df7cb8000     0   434  15776 0x00000080
[ 7920.272778] Call Trace:
[ 7920.273028]  [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70
[ 7920.273636]  [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0
[ 7920.274225]  [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f
[ 7920.274721]  [<ffffffffc0e14bc3>] vvp_io_write_start+0x603/0x810 [lustre]
[ 7920.275506]  [<ffffffffc0978377>] ? cl_lock_request+0x67/0x1f0 [obdclass]
[ 7920.276191]  [<ffffffffc097a228>] cl_io_start+0x68/0x130 [obdclass]
[ 7920.276803]  [<ffffffffc097c611>] cl_io_loop+0x131/0xc70 [obdclass]
[ 7920.277417]  [<ffffffffc0dc9db2>] ll_file_io_generic+0x4e2/0xd10 [lustre]
[ 7920.278064]  [<ffffffffc0dcab32>] ll_file_aio_write+0x372/0x540 [lustre]
[ 7920.278699]  [<ffffffffc0dcada4>] ll_file_write+0xa4/0x170 [lustre]
[ 7920.279300]  [<ffffffffa421b490>] vfs_write+0xc0/0x1f0
[ 7920.279776]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.280408]  [<ffffffffa421c2bf>] SyS_write+0x7f/0xf0
[ 7920.280894]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.281512]  [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21
[ 7920.282096]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.282719] INFO: task getfattr:1159 blocked for more than 120 seconds.
[ 7920.283347] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7920.284090] getfattr        D ffff8e6df815af70     0  1159  15746 0x00000080
[ 7920.284799] Call Trace:
[ 7920.285054]  [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70
[ 7920.285660]  [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0
[ 7920.286244]  [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f
[ 7920.286726]  [<ffffffffa470bcde>] lookup_slow+0x33/0xa7
[ 7920.287231]  [<ffffffffa42291d4>] link_path_walk+0x764/0x8b0
[ 7920.287770]  [<ffffffffc094c943>] ? class_handle2object+0xc3/0x1c0 [obdclass]
[ 7920.288453]  [<ffffffffa422a24a>] path_lookupat+0x7a/0x8b0
[ 7920.288981]  [<ffffffffa41f7b9b>] ? kmem_cache_alloc+0x19b/0x1f0
[ 7920.289547]  [<ffffffffa422cd4f>] ? getname_flags+0x4f/0x1a0
[ 7920.290096]  [<ffffffffa422aaab>] filename_lookup+0x2b/0xc0
[ 7920.290624]  [<ffffffffa422dee7>] user_path_at_empty+0x67/0xc0
[ 7920.291185]  [<ffffffffa4125092>] ? from_kgid_munged+0x12/0x20
[ 7920.291740]  [<ffffffffa422142f>] ? cp_new_stat+0x14f/0x180
[ 7920.292278]  [<ffffffffa422df51>] user_path_at+0x11/0x20
[ 7920.292775]  [<ffffffffa4220f23>] vfs_fstatat+0x63/0xc0
[ 7920.293282]  [<ffffffffa422148e>] SYSC_newstat+0x2e/0x60
[ 7920.293776]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.294407]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 7920.295029]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.295645]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 7920.296272]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.296895]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 7920.297512]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.298132]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 7920.298746]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.299373]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 7920.300000]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.300616]  [<ffffffffa422176e>] SyS_newstat+0xe/0x10
[ 7920.301103]  [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21
[ 7920.301668]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.302299] INFO: task getfattr:1656 blocked for more than 120 seconds.
[ 7920.302919] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7920.303653] getfattr        D ffff8e6e06b6eeb0     0  1656  15679 0x00000080
[ 7920.304366] Call Trace:
[ 7920.304606]  [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70
[ 7920.305221]  [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0
[ 7920.305790]  [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f
[ 7920.306292]  [<ffffffffa470bcde>] lookup_slow+0x33/0xa7
[ 7920.306784]  [<ffffffffa422aa08>] path_lookupat+0x838/0x8b0
[ 7920.307321]  [<ffffffffa41f7b79>] ? kmem_cache_alloc+0x179/0x1f0
[ 7920.307887]  [<ffffffffa422cd4f>] ? getname_flags+0x4f/0x1a0
[ 7920.308418]  [<ffffffffa422aaab>] filename_lookup+0x2b/0xc0
[ 7920.308946]  [<ffffffffa422dee7>] user_path_at_empty+0x67/0xc0
[ 7920.309494]  [<ffffffffa422df51>] user_path_at+0x11/0x20
[ 7920.309998]  [<ffffffffa4243f68>] SyS_listxattr+0x48/0xb0
[ 7920.310506]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 7920.311125]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.311749]  [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21
[ 7920.312320]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.312939] INFO: task chmod:1674 blocked for more than 120 seconds.
[ 7920.313533] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7920.314270] chmod           D ffff8e6e02b12f70     0  1674  15675 0x00000080
[ 7920.314990] Call Trace:
[ 7920.315238]  [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70
[ 7920.315829]  [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0
[ 7920.316424]  [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f
[ 7920.316921]  [<ffffffffa470bcde>] lookup_slow+0x33/0xa7
[ 7920.317417]  [<ffffffffa422aa08>] path_lookupat+0x838/0x8b0
[ 7920.317955]  [<ffffffffa406814e>] ? kvm_clock_get_cycles+0x1e/0x20
[ 7920.318549]  [<ffffffffa41f7b79>] ? kmem_cache_alloc+0x179/0x1f0
[ 7920.319125]  [<ffffffffa422cd4f>] ? getname_flags+0x4f/0x1a0
[ 7920.319657]  [<ffffffffa422aaab>] filename_lookup+0x2b/0xc0
[ 7920.320204]  [<ffffffffa422dee7>] user_path_at_empty+0x67/0xc0
[ 7920.320749]  [<ffffffffa471b57c>] ? __do_page_fault+0x1bc/0x4f0
[ 7920.321316]  [<ffffffffa422df51>] user_path_at+0x11/0x20
[ 7920.321814]  [<ffffffffa4219d07>] SyS_fchmodat+0x47/0xc0
[ 7920.322322]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 7920.322942]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.323558]  [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21
[ 7920.324124]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.324743] INFO: task getfattr:1834 blocked for more than 120 seconds.
[ 7920.325372] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7920.326101] getfattr        D ffff8e6df7068000     0  1834  15711 0x00000080
[ 7920.326802] Call Trace:
[ 7920.327060]  [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70
[ 7920.327661]  [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0
[ 7920.328249]  [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f
[ 7920.328745]  [<ffffffffa470bcde>] lookup_slow+0x33/0xa7
[ 7920.329254]  [<ffffffffa422aa08>] path_lookupat+0x838/0x8b0
[ 7920.329778]  [<ffffffffa4242b12>] ? generic_getxattr+0x52/0x70
[ 7920.330338]  [<ffffffffa41f7b79>] ? kmem_cache_alloc+0x179/0x1f0
[ 7920.331029]  [<ffffffffa422cd4f>] ? getname_flags+0x4f/0x1a0
[ 7920.331651]  [<ffffffffa422aaab>] filename_lookup+0x2b/0xc0
[ 7920.332278]  [<ffffffffa422dee7>] user_path_at_empty+0x67/0xc0
[ 7920.332952]  [<ffffffffa471b57c>] ? __do_page_fault+0x1bc/0x4f0
[ 7920.333913]  [<ffffffffa422df51>] user_path_at+0x11/0x20
[ 7920.335027]  [<ffffffffa4243d2a>] SyS_getxattr+0x4a/0xc0
[ 7920.336126]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 7920.336980]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.337720]  [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21
[ 7920.338301]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.338928] INFO: task rm:1977 blocked for more than 120 seconds.
[ 7920.339503] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7920.340232] rm              D ffff8e6e05499fa0     0  1977  15734 0x00000080
[ 7920.340937] Call Trace:
[ 7920.341181]  [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70
[ 7920.341784]  [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0
[ 7920.342374]  [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f
[ 7920.342857]  [<ffffffffa422d265>] do_rmdir+0x165/0x220
[ 7920.343352]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.343972]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 7920.344586]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.345204]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 7920.345816]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.346444]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 7920.347063]  [<ffffffffa422e555>] SyS_unlinkat+0x25/0x40
[ 7920.347565]  [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21
[ 7920.348130]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.348747] INFO: task mv:2144 blocked for more than 120 seconds.
[ 7920.349323] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7920.350049] mv              D ffff8e6e05498000     0  2144  15700 0x00000080
[ 7920.350756] Call Trace:
[ 7920.351005]  [<ffffffffa422a2f2>] ? path_lookupat+0x122/0x8b0
[ 7920.351547]  [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70
[ 7920.352148]  [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0
[ 7920.352728]  [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f
[ 7920.353222]  [<ffffffffa42276e5>] lock_rename+0x55/0xe0
[ 7920.353713]  [<ffffffffa422d81f>] SYSC_renameat2+0x22f/0x5a0
[ 7920.354255]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.354871]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 7920.355489]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.356110]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 7920.356732]  [<ffffffffa422ea0e>] SyS_renameat2+0xe/0x10
[ 7920.357247]  [<ffffffffa422ea4e>] SyS_rename+0x1e/0x20
[ 7920.357731]  [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21
[ 7920.358303]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.358930] INFO: task ln:2148 blocked for more than 120 seconds.
[ 7920.359504] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7920.360254] ln              D ffff8e6df7cbeeb0     0  2148  15720 0x00000080
[ 7920.360973] Call Trace:
[ 7920.361221]  [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70
[ 7920.361812]  [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0
[ 7920.362402]  [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f
[ 7920.362896]  [<ffffffffa470bcde>] lookup_slow+0x33/0xa7
[ 7920.363390]  [<ffffffffa422aa08>] path_lookupat+0x838/0x8b0
[ 7920.363924]  [<ffffffffa41944fb>] ? unlock_page+0x2b/0x30
[ 7920.364444]  [<ffffffffa41c13f1>] ? do_read_fault.isra.60+0x121/0x1a0
[ 7920.365048]  [<ffffffffa41f7b79>] ? kmem_cache_alloc+0x179/0x1f0
[ 7920.365610]  [<ffffffffa422cd4f>] ? getname_flags+0x4f/0x1a0
[ 7920.366146]  [<ffffffffa422aaab>] filename_lookup+0x2b/0xc0
[ 7920.366671]  [<ffffffffa422dee7>] user_path_at_empty+0x67/0xc0
[ 7920.367232]  [<ffffffffa41c7c3d>] ? handle_mm_fault+0x39d/0x9b0
[ 7920.367777]  [<ffffffffa4125092>] ? from_kgid_munged+0x12/0x20
[ 7920.368329]  [<ffffffffa422df51>] user_path_at+0x11/0x20
[ 7920.368821]  [<ffffffffa4220f23>] vfs_fstatat+0x63/0xc0
[ 7920.369344]  [<ffffffffa42214f1>] SYSC_newlstat+0x31/0x60
[ 7920.369857]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.370495]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 7920.371115]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.371737]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 7920.372363]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.372987]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 7920.373604]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.374250]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 7920.374872]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.375495]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 7920.376114]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.376736]  [<ffffffffa422177e>] SyS_newlstat+0xe/0x10
[ 7920.377240]  [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21
[ 7920.377806]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
[ 8022.948194] Lustre: lustre-OST0004-osc-ffff8e6e2480f800: Connection restored to 10.9.4.117@tcp (at 10.9.4.117@tcp)
[ 8023.080282] LustreError: 4670:0:(file.c:4366:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200007161:0x1:0x0] error: rc = -4
[ 8023.080306] LustreError: 4664:0:(file.c:4366:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200007161:0x1:0x0] error: rc = -4

From the second client’s (vm10) console log, we see similar calls as in the first client’s call stack

[ 7722.357867] Lustre: DEBUG MARKER: == racer test 1: racer on clients: trevis-10vm10,trevis-10vm9.trevis.whamcloud.com DURATION=300 ====== 12:19:23 (1536495563)
[ 7722.541747] Lustre: DEBUG MARKER: DURATION=300 			MDSCOUNT=1 OSTCOUNT=7			RACER_ENABLE_REMOTE_DIRS=false 			RACER_ENABLE_STRIPED_DIRS=false 			RACER_ENABLE_MIGRATION=false 			RACER_ENABLE_PFL=true 			RACER_ENABLE_DOM=true 			RACER_ENABLE_FLR=true 			LFS=/usr/bin/lfs 			/usr/lib64/lustre/t
[ 7722.542686] Lustre: DEBUG MARKER: DURATION=300 			MDSCOUNT=1 OSTCOUNT=7			RACER_ENABLE_REMOTE_DIRS=false 			RACER_ENABLE_STRIPED_DIRS=false 			RACER_ENABLE_MIGRATION=false 			RACER_ENABLE_PFL=true 			RACER_ENABLE_DOM=true 			RACER_ENABLE_FLR=true 			LFS=/usr/bin/lfs 			/usr/lib64/lustre/t
[ 7726.494344] Lustre: lfs: using old ioctl(LL_IOC_LOV_GETSTRIPE) on [0x200007164:0x60:0x0], use llapi_layout_get_by_path()
[ 7920.259874] INFO: task dir_create.sh:14445 blocked for more than 120 seconds.
[ 7920.260874] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7920.261634] dir_create.sh   D ffff9bac39240fd0     0 14445  14433 0x00000080
[ 7920.262414] Call Trace:
[ 7920.262749]  [<ffffffffc0e427e2>] ? ll_dcompare+0x72/0x2e0 [lustre]
[ 7920.263384]  [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70
[ 7920.264027]  [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0
[ 7920.264623]  [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f
[ 7920.265151]  [<ffffffff88f0bcde>] lookup_slow+0x33/0xa7
[ 7920.265692]  [<ffffffff88a2927f>] link_path_walk+0x80f/0x8b0
[ 7920.266253]  [<ffffffff88a2b75d>] ? do_last+0x66d/0x12c0
[ 7920.266782]  [<ffffffff88a2c465>] path_openat+0xb5/0x640
[ 7920.267314]  [<ffffffff88a2e01d>] do_filp_open+0x4d/0xb0
[ 7920.267848]  [<ffffffff88a3b4c4>] ? __alloc_fd+0xc4/0x170
[ 7920.268392]  [<ffffffff88a1a327>] do_sys_open+0x137/0x240
[ 7920.268943]  [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 7920.269595]  [<ffffffff88a1a44e>] SyS_open+0x1e/0x20
[ 7920.270104]  [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21
[ 7920.270693]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.271383] INFO: task ls:23458 blocked for more than 120 seconds.
[ 7920.271998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7920.272770] ls              D ffff9bac1fd82f70     0 23458  14490 0x00000080
[ 7920.273503] Call Trace:
[ 7920.273774]  [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70
[ 7920.274406]  [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0
[ 7920.275021]  [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f
[ 7920.275525]  [<ffffffff88a2b37f>] do_last+0x28f/0x12c0
[ 7920.276048]  [<ffffffff88a2c487>] path_openat+0xd7/0x640
[ 7920.276582]  [<ffffffffc0e4e929>] ? ll_file_data_put+0x89/0x180 [lustre]
[ 7920.277251]  [<ffffffff88a2e01d>] do_filp_open+0x4d/0xb0
[ 7920.277784]  [<ffffffff88a3b4c4>] ? __alloc_fd+0xc4/0x170
[ 7920.278318]  [<ffffffff88a1a327>] do_sys_open+0x137/0x240
[ 7920.278858]  [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 7920.279515]  [<ffffffff88a1a464>] SyS_openat+0x14/0x20
[ 7920.280033]  [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21
[ 7920.280614]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.281265] INFO: task ls:23462 blocked for more than 120 seconds.
[ 7920.281868] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7920.282636] ls              D ffff9bac1fc13f40     0 23462  14490 0x00000080
[ 7920.283377] Call Trace:
[ 7920.283638]  [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70
[ 7920.284279]  [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0
[ 7920.284885]  [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f
[ 7920.285399]  [<ffffffff88a2b37f>] do_last+0x28f/0x12c0
[ 7920.285922]  [<ffffffff88a2c487>] path_openat+0xd7/0x640
[ 7920.286448]  [<ffffffffc0e4e929>] ? ll_file_data_put+0x89/0x180 [lustre]
[ 7920.287110]  [<ffffffff88a2e01d>] do_filp_open+0x4d/0xb0
[ 7920.287631]  [<ffffffff88a3b4c4>] ? __alloc_fd+0xc4/0x170
[ 7920.288173]  [<ffffffff88a1a327>] do_sys_open+0x137/0x240
[ 7920.288717]  [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 7920.289361]  [<ffffffff88a1a464>] SyS_openat+0x14/0x20
[ 7920.289878]  [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21
[ 7920.290470]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.291141] INFO: task ls:23464 blocked for more than 120 seconds.
[ 7920.291746] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7920.292497] ls              D ffff9bac20318000     0 23464  14490 0x00000080
[ 7920.293256] Call Trace:
[ 7920.293500]  [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70
[ 7920.294150]  [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0
[ 7920.294767]  [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f
[ 7920.295280]  [<ffffffff88a2b37f>] do_last+0x28f/0x12c0
[ 7920.295790]  [<ffffffff88a2c487>] path_openat+0xd7/0x640
[ 7920.296327]  [<ffffffffc0e4e929>] ? ll_file_data_put+0x89/0x180 [lustre]
[ 7920.296990]  [<ffffffff88a2e01d>] do_filp_open+0x4d/0xb0
[ 7920.297510]  [<ffffffff88a3b4c4>] ? __alloc_fd+0xc4/0x170
[ 7920.298056]  [<ffffffff88a1a327>] do_sys_open+0x137/0x240
[ 7920.298587]  [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 7920.299256]  [<ffffffff88a1a464>] SyS_openat+0x14/0x20
[ 7920.299775]  [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21
[ 7920.300367]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.301015] INFO: task ls:23466 blocked for more than 120 seconds.
[ 7920.301612] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7920.302395] ls              D ffff9bac2031af70     0 23466  14490 0x00000080
[ 7920.303136] Call Trace:
[ 7920.303397]  [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70
[ 7920.304035]  [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0
[ 7920.304632]  [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f
[ 7920.305155]  [<ffffffff88a2b37f>] do_last+0x28f/0x12c0
[ 7920.305673]  [<ffffffff88a2c487>] path_openat+0xd7/0x640
[ 7920.306200]  [<ffffffff88a2cd4f>] ? getname_flags+0x4f/0x1a0
[ 7920.306762]  [<ffffffffc0e4e929>] ? ll_file_data_put+0x89/0x180 [lustre]
[ 7920.307431]  [<ffffffff88a2e01d>] do_filp_open+0x4d/0xb0
[ 7920.307963]  [<ffffffff88a3b4c4>] ? __alloc_fd+0xc4/0x170
[ 7920.308493]  [<ffffffff88a1a327>] do_sys_open+0x137/0x240
[ 7920.309037]  [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 7920.309683]  [<ffffffff88a1a464>] SyS_openat+0x14/0x20
[ 7920.310196]  [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21
[ 7920.310786]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.311447] INFO: task getfattr:1379 blocked for more than 120 seconds.
[ 7920.312100] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7920.312876] getfattr        D ffff9bac1d229fa0     0  1379  14538 0x00000080
[ 7920.313611] Call Trace:
[ 7920.313884]  [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70
[ 7920.314513]  [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0
[ 7920.315128]  [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f
[ 7920.315631]  [<ffffffff88f0bcde>] lookup_slow+0x33/0xa7
[ 7920.316162]  [<ffffffff88a2aa08>] path_lookupat+0x838/0x8b0
[ 7920.316722]  [<ffffffff88a42b12>] ? generic_getxattr+0x52/0x70
[ 7920.317294]  [<ffffffff88a42f6a>] ? vfs_getxattr+0x8a/0xb0
[ 7920.317855]  [<ffffffff889f7a35>] ? kmem_cache_alloc+0x35/0x1f0
[ 7920.318436]  [<ffffffff88a2cd4f>] ? getname_flags+0x4f/0x1a0
[ 7920.319001]  [<ffffffff88a2aaab>] filename_lookup+0x2b/0xc0
[ 7920.319544]  [<ffffffff88a2dee7>] user_path_at_empty+0x67/0xc0
[ 7920.320131]  [<ffffffff88f1b57c>] ? __do_page_fault+0x1bc/0x4f0
[ 7920.320718]  [<ffffffff88a2df51>] user_path_at+0x11/0x20
[ 7920.321247]  [<ffffffff88a43d2a>] SyS_getxattr+0x4a/0xc0
[ 7920.321777]  [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 7920.322439]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.323091]  [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21
[ 7920.323689]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.324352] INFO: task ln:2464 blocked for more than 120 seconds.
[ 7920.324955] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7920.325711] ln              D ffff9bac21380fd0     0  2464  14485 0x00000080
[ 7920.326458] Call Trace:
[ 7920.326712]  [<ffffffff88a2a2f2>] ? path_lookupat+0x122/0x8b0
[ 7920.327296]  [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70
[ 7920.327932]  [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0
[ 7920.328532]  [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f
[ 7920.329053]  [<ffffffff88a2abc5>] filename_create+0x85/0x180
[ 7920.329608]  [<ffffffff889f7a35>] ? kmem_cache_alloc+0x35/0x1f0
[ 7920.330201]  [<ffffffff88a2cd4f>] ? getname_flags+0x4f/0x1a0
[ 7920.330775]  [<ffffffff88a2cdc4>] ? getname_flags+0xc4/0x1a0
[ 7920.331338]  [<ffffffff88a2d071>] user_path_create+0x41/0x60
[ 7920.331898]  [<ffffffff88a2e5e8>] SyS_symlinkat+0x58/0x110
[ 7920.332441]  [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 7920.333091]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.333755]  [<ffffffff88a2e6b6>] SyS_symlink+0x16/0x20
[ 7920.334279]  [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21
[ 7920.334869]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.335512] INFO: task ls:2582 blocked for more than 120 seconds.
[ 7920.336114] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7920.336890] ls              D ffff9bac1d329fa0     0  2582  14455 0x00000080
[ 7920.337619] Call Trace:
[ 7920.337894]  [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70
[ 7920.338519]  [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0
[ 7920.339135]  [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f
[ 7920.339640]  [<ffffffff88a2b37f>] do_last+0x28f/0x12c0
[ 7920.340160]  [<ffffffff88a2c487>] path_openat+0xd7/0x640
[ 7920.340700]  [<ffffffffc0e4e929>] ? ll_file_data_put+0x89/0x180 [lustre]
[ 7920.341373]  [<ffffffff88a2e01d>] do_filp_open+0x4d/0xb0
[ 7920.341902]  [<ffffffff88a3b4c4>] ? __alloc_fd+0xc4/0x170
[ 7920.342438]  [<ffffffff88a1a327>] do_sys_open+0x137/0x240
[ 7920.342981]  [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 7920.343618]  [<ffffffff88a1a464>] SyS_openat+0x14/0x20
[ 7920.344142]  [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21
[ 7920.344757]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.345402] INFO: task ls:2583 blocked for more than 120 seconds.
[ 7920.346003] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7920.346774] ls              D ffff9bac1d32dee0     0  2583  14455 0x00000080
[ 7920.347511] Call Trace:
[ 7920.347787]  [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70
[ 7920.348417]  [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0
[ 7920.349033]  [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f
[ 7920.349539]  [<ffffffff88a2b37f>] do_last+0x28f/0x12c0
[ 7920.350059]  [<ffffffff88a2c487>] path_openat+0xd7/0x640
[ 7920.350587]  [<ffffffffc0e4e929>] ? ll_file_data_put+0x89/0x180 [lustre]
[ 7920.351271]  [<ffffffff88a2e01d>] do_filp_open+0x4d/0xb0
[ 7920.351805]  [<ffffffff88a3b4c4>] ? __alloc_fd+0xc4/0x170
[ 7920.352339]  [<ffffffff88a1a327>] do_sys_open+0x137/0x240
[ 7920.352881]  [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 7920.353524]  [<ffffffff88a1a464>] SyS_openat+0x14/0x20
[ 7920.354050]  [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21
[ 7920.354631]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
[ 7920.355301] INFO: task ls:2584 blocked for more than 120 seconds.
[ 7920.355903] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7920.356685] ls              D ffff9bac1e540fd0     0  2584  14455 0x00000080
[ 7920.357413] Call Trace:
[ 7920.357688]  [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70
[ 7920.358315]  [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0
[ 7920.358927]  [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f
[ 7920.359432]  [<ffffffff88a2b37f>] do_last+0x28f/0x12c0
[ 7920.359955]  [<ffffffff88a2c487>] path_openat+0xd7/0x640
[ 7920.360484]  [<ffffffffc0e4e929>] ? ll_file_data_put+0x89/0x180 [lustre]
[ 7920.361166]  [<ffffffff88a2e01d>] do_filp_open+0x4d/0xb0
[ 7920.361698]  [<ffffffff88a3b4c4>] ? __alloc_fd+0xc4/0x170
[ 7920.362231]  [<ffffffff88a1a327>] do_sys_open+0x137/0x240
[ 7920.362768]  [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 7920.363411]  [<ffffffff88a1a464>] SyS_openat+0x14/0x20
[ 7920.363931]  [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21
[ 7920.364512]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
[ 8022.975327] LustreError: 4302:0:(file.c:4366:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200007161:0x1:0x0] error: rc = -4
[ 8022.975389] LustreError: 4303:0:(file.c:4366:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200007161:0x1:0x0] error: rc = -4
[ 8022.975622] LustreError: 4305:0:(file.c:4366:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200007161:0x1:0x0] error: rc = -4

Description continued in the first comment due to JIRA description line limits.



 Comments   
Comment by James Nunez (Inactive) [ 10/Sep/18 ]

From the MDS (vm12) console log, we see two different call traces

[ 7639.633530] Lustre: DEBUG MARKER: == racer test 1: racer on clients: trevis-10vm10,trevis-10vm9.trevis.whamcloud.com DURATION=300 ====== 12:19:23 (1536495563)
[ 7665.045291] LNetError: 14176:0:(lib-msg.c:794:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)
[ 7665.095106] LNetError: 14176:0:(lib-msg.c:794:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)
[ 7672.008894] Lustre: lustre-MDT0000: Client 5302dedd-995b-ce8d-7b96-eb4bc47e114c (at 10.9.4.116@tcp) reconnecting
[ 7672.056678] Lustre: lustre-MDT0000: Client 9f0e7201-adaf-34dc-28ac-c83a77146718 (at 10.9.4.116@tcp) reconnecting
[ 7714.299107] LNet: Service thread pid 799 was inactive for 40.12s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[ 7714.300812] Pid: 799, comm: mdt00_029 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018
[ 7714.301760] Call Trace:
[ 7714.302043]  [<ffffffffc0d49f61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
[ 7714.302943]  [<ffffffffc0d4ad83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
[ 7714.303794]  [<ffffffffc11a5c7e>] mdt_dom_discard_data+0xfe/0x130 [mdt]
[ 7714.304710]  [<ffffffffc118355e>] mdt_reint_unlink+0xc2e/0x1490 [mdt]
[ 7714.305408]  [<ffffffffc1187293>] mdt_reint_rec+0x83/0x210 [mdt]
[ 7714.306092]  [<ffffffffc11661e2>] mdt_reint_internal+0x6b2/0xa80 [mdt]
[ 7714.306798]  [<ffffffffc1171447>] mdt_reint+0x67/0x140 [mdt]
[ 7714.307401]  [<ffffffffc0de831a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[ 7714.308224]  [<ffffffffc0d8b48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[ 7714.309155]  [<ffffffffc0d8ecc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]
[ 7714.309868]  [<ffffffff842bb621>] kthread+0xd1/0xe0
[ 7714.310415]  [<ffffffff849205f7>] ret_from_fork_nospec_end+0x0/0x39
[ 7714.311127]  [<ffffffffffffffff>] 0xffffffffffffffff
[ 7714.311701] LustreError: dumping log to /tmp/lustre-log.1536495638.799
[ 7714.635443] LNet: Service thread pid 789 was inactive for 40.15s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[ 7714.637166] Pid: 789, comm: mdt00_019 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018
[ 7714.638114] Call Trace:
[ 7714.638375]  [<ffffffffc0d49f61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
[ 7714.639203]  [<ffffffffc0d4ad83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
[ 7714.640025]  [<ffffffffc116a3d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt]
[ 7714.640791]  [<ffffffffc116aa80>] mdt_object_lock_internal+0x70/0x330 [mdt]
[ 7714.641511]  [<ffffffffc116baea>] mdt_getattr_name_lock+0x83a/0x1c00 [mdt]
[ 7714.642266]  [<ffffffffc1173ae5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
[ 7714.642972]  [<ffffffffc11709c8>] mdt_intent_policy+0x2f8/0xd10 [mdt]
[ 7714.643703]  [<ffffffffc0d30e9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc]
[ 7714.644423]  [<ffffffffc0d59503>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc]
[ 7714.645221]  [<ffffffffc0ddf9d2>] tgt_enqueue+0x62/0x210 [ptlrpc]
[ 7714.645923]  [<ffffffffc0de831a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[ 7714.646752]  [<ffffffffc0d8b48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[ 7714.647555]  [<ffffffffc0d8ecc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]
[ 7714.648273]  [<ffffffff842bb621>] kthread+0xd1/0xe0
[ 7714.648818]  [<ffffffff849205f7>] ret_from_fork_nospec_end+0x0/0x39
[ 7714.649469]  [<ffffffffffffffff>] 0xffffffffffffffff
[ 7714.650057] LNet: Service thread pid 794 was inactive for 40.16s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[ 7714.651656] Pid: 794, comm: mdt00_024 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018
[ 7714.652669] Call Trace:
[ 7714.652933]  [<ffffffffc0d49f61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
[ 7714.653706]  [<ffffffffc0d4ad83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
[ 7714.654462]  [<ffffffffc116a3d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt]
[ 7714.655245]  [<ffffffffc116aa80>] mdt_object_lock_internal+0x70/0x330 [mdt]
[ 7714.656002]  [<ffffffffc116baea>] mdt_getattr_name_lock+0x83a/0x1c00 [mdt]
[ 7714.656771]  [<ffffffffc1173ae5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
[ 7714.657451]  [<ffffffffc11709c8>] mdt_intent_policy+0x2f8/0xd10 [mdt]
[ 7714.658173]  [<ffffffffc0d30e9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc]
[ 7714.658913]  [<ffffffffc0d59503>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc]
[ 7714.659703]  [<ffffffffc0ddf9d2>] tgt_enqueue+0x62/0x210 [ptlrpc]
[ 7714.660370]  [<ffffffffc0de831a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[ 7714.661152]  [<ffffffffc0d8b48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[ 7714.661989]  [<ffffffffc0d8ecc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]
[ 7714.662754]  [<ffffffff842bb621>] kthread+0xd1/0xe0
[ 7714.663276]  [<ffffffff849205f7>] ret_from_fork_nospec_end+0x0/0x39
[ 7714.663956]  [<ffffffffffffffff>] 0xffffffffffffffff
[ 7714.664534] LNet: Service thread pid 23174 was inactive for 40.17s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[ 7714.666149] Pid: 23174, comm: mdt00_000 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018
[ 7714.667158] Call Trace:
[ 7714.667418]  [<ffffffffc0d49f61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
[ 7714.668145]  [<ffffffffc0d4ad83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
[ 7714.668924]  [<ffffffffc116a3d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt]
[ 7714.669715]  [<ffffffffc116aa80>] mdt_object_lock_internal+0x70/0x330 [mdt]
[ 7714.670451]  [<ffffffffc116baea>] mdt_getattr_name_lock+0x83a/0x1c00 [mdt]
[ 7714.671210]  [<ffffffffc1173ae5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
[ 7714.671919]  [<ffffffffc11709c8>] mdt_intent_policy+0x2f8/0xd10 [mdt]
[ 7714.672646]  [<ffffffffc0d30e9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc]
[ 7714.673361]  [<ffffffffc0d59503>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc]
[ 7714.674149]  [<ffffffffc0ddf9d2>] tgt_enqueue+0x62/0x210 [ptlrpc]
[ 7714.674837]  [<ffffffffc0de831a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[ 7714.675567]  [<ffffffffc0d8b48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[ 7714.676400]  [<ffffffffc0d8ecc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]
[ 7714.677113]  [<ffffffff842bb621>] kthread+0xd1/0xe0
[ 7714.677694]  [<ffffffff849205f7>] ret_from_fork_nospec_end+0x0/0x39
[ 7714.678339]  [<ffffffffffffffff>] 0xffffffffffffffff
[ 7714.678912] LNet: Service thread pid 780 was inactive for 40.19s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[ 7714.680500] Pid: 780, comm: mdt00_010 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018
[ 7714.681504] Call Trace:
[ 7714.681757]  [<ffffffffc0d49f61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
[ 7714.682485]  [<ffffffffc0d4ad83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
[ 7714.683292]  [<ffffffffc116a3d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt]
[ 7714.684030]  [<ffffffffc116aa80>] mdt_object_lock_internal+0x70/0x330 [mdt]
[ 7714.684838]  [<ffffffffc116baea>] mdt_getattr_name_lock+0x83a/0x1c00 [mdt]
[ 7714.685551]  [<ffffffffc1173ae5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
[ 7714.686284]  [<ffffffffc11709c8>] mdt_intent_policy+0x2f8/0xd10 [mdt]
[ 7714.686982]  [<ffffffffc0d30e9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc]
[ 7714.687737]  [<ffffffffc0d59503>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc]
[ 7714.688484]  [<ffffffffc0ddf9d2>] tgt_enqueue+0x62/0x210 [ptlrpc]
[ 7714.689195]  [<ffffffffc0de831a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[ 7714.689959]  [<ffffffffc0d8b48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[ 7714.690822]  [<ffffffffc0d8ecc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]
[ 7714.691494]  [<ffffffff842bb621>] kthread+0xd1/0xe0
[ 7714.692044]  [<ffffffff849205f7>] ret_from_fork_nospec_end+0x0/0x39
[ 7714.692782]  [<ffffffffffffffff>] 0xffffffffffffffff
[ 7714.693316] LNet: Service thread pid 777 was inactive for 40.04s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
[ 7714.694582] LNet: Service thread pid 790 was inactive for 40.05s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
…
[ 7714.719633] LNet: Service thread pid 26048 was inactive for 40.54s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
[ 7974.175116] LustreError: 26048:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1536495598, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-lustre-MDT0000_UUID lock: ffff8840df351d40/0x106c1d2b325048e0 lrc: 3/1,0 mode: --/PR res: [0x200007164:0x28a:0x0].0x0 bits 0x13/0x8 rrc: 7 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 26048 timeout: 0 lvb_type: 0
[ 7974.177105] LustreError: 799:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1536495598, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-lustre-MDT0000_UUID lock: ffff8840d1882000/0x106c1d2b3250492d lrc: 3/0,1 mode: --/PW res: [0x200007164:0x28a:0x0].0x0 bits 0x40/0x0 rrc: 8 type: IBT flags: 0x40010080000000 nid: local remote: 0x0 expref: -99 pid: 799 timeout: 0 lvb_type: 0
…
[ 8817.787115] Lustre: 852:0:(service.c:1365:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (4/-122), not sending early reply
[ 8817.787115]   req@ffff8840e518cf00 x1611124532145872/t0(0) o101->5302dedd-995b-ce8d-7b96-eb4bc47e114c@10.9.4.116@tcp:21/0 lens 696/3384 e 0 to 0 dl 1536496746 ref 2 fl Interpret:/0/0 rc 0/0
[ 8963.195139] Pid: 26045, comm: mdt00_004 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018
[ 8963.196167] Call Trace:
[ 8963.196428]  [<ffffffffc0d49fed>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
[ 8963.197171]  [<ffffffffc0d4ad83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
[ 8963.197908]  [<ffffffffc116a3d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt]
[ 8963.198606]  [<ffffffffc116aa80>] mdt_object_lock_internal+0x70/0x330 [mdt]
[ 8963.199325]  [<ffffffffc116baea>] mdt_getattr_name_lock+0x83a/0x1c00 [mdt]
[ 8963.200048]  [<ffffffffc1173ae5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
[ 8963.200908]  [<ffffffffc11709c8>] mdt_intent_policy+0x2f8/0xd10 [mdt]
[ 8963.201572]  [<ffffffffc0d30e9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc]
[ 8963.202422]  [<ffffffffc0d59503>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc]
[ 8963.203254]  [<ffffffffc0ddf9d2>] tgt_enqueue+0x62/0x210 [ptlrpc]
[ 8963.203961]  [<ffffffffc0de831a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[ 8963.204759]  [<ffffffffc0d8b48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[ 8963.205562]  [<ffffffffc0d8ecc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]
[ 8963.206288]  [<ffffffff842bb621>] kthread+0xd1/0xe0
[ 8963.206849]  [<ffffffff849205f7>] ret_from_fork_nospec_end+0x0/0x39
[ 8963.207489]  [<ffffffffffffffff>] 0xffffffffffffffff
[ 8963.208090] LustreError: dumping log to /tmp/lustre-log.1536496887.26045
[ 9027.173147] Lustre: 890:0:(service.c:1365:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply
[ 9027.173147]   req@ffff8840fbf86100 x1611131923340304/t0(0) o101->0e5f252a-cf6c-a3fd-2317-8699dfe62e19@10.9.4.115@tcp:231/0 lens 1344/3528 e 0 to 0 dl 1536496956 ref 2 fl Interpret:/0/0 rc 0/0
[ 9027.176014] Lustre: 890:0:(service.c:1365:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply
[ 9027.176014]   req@ffff8840fbf87600 x1611131923339728/t0(0) o101->0e5f252a-cf6c-a3fd-2317-8699dfe62e19@10.9.4.115@tcp:231/0 lens 696/3384 e 0 to 0 dl 1536496956 ref 2 fl Interpret:/0/0 rc 0/0
[ 9479.291115] Pid: 851, comm: mdt00_033 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018
[ 9479.292156] Call Trace:
[ 9479.292409]  [<ffffffffc0d49fed>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
[ 9479.293140]  [<ffffffffc0d4ad83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
[ 9479.293865]  [<ffffffffc116a3d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt]
[ 9479.294551]  [<ffffffffc116aa80>] mdt_object_lock_internal+0x70/0x330 [mdt]
[ 9479.295253]  [<ffffffffc116baea>] mdt_getattr_name_lock+0x83a/0x1c00 [mdt]
[ 9479.295970]  [<ffffffffc1173ae5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
[ 9479.296757]  [<ffffffffc11709c8>] mdt_intent_policy+0x2f8/0xd10 [mdt]
[ 9479.297414]  [<ffffffffc0d30e9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc]
[ 9479.298105]  [<ffffffffc0d59503>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc]
[ 9479.298819]  [<ffffffffc0ddf9d2>] tgt_enqueue+0x62/0x210 [ptlrpc]
[ 9479.299462]  [<ffffffffc0de831a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[ 9479.300176]  [<ffffffffc0d8b48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[ 9479.301195]  [<ffffffffc0d8ecc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]
[ 9479.301890]  [<ffffffff842bb621>] kthread+0xd1/0xe0
[ 9479.302397]  [<ffffffff849205f7>] ret_from_fork_nospec_end+0x0/0x39
[ 9479.303067]  [<ffffffffffffffff>] 0xffffffffffffffff
…

There are several other test sessions that time out with the same call traces.
https://testing.whamcloud.com/test_sets/5dd05488-b401-11e8-b86b-52540065bddc (The stack call with ‘mdt_dom_discard_data’ is not hung )
https://testing.whamcloud.com/test_sets/abf52952-b2a5-11e8-b86b-52540065bddc

Comment by Peter Jones [ 11/Sep/18 ]

Lai

Could you please advise?

Thanks

Peter

Comment by Mikhail Pershin [ 31/Oct/18 ]

I am taking this ticket, it looks related to DOM data discard and Oleg have reported similar issues also. After several local runs and investigations I have found that issue is related to mdt_dom_discard_data() function. While it is ran after all local objects are unlocked, it is still in the middle of unlink RPC processing, so until mdt_dom_discard_data() is finished no reply is being sent for unlink RPC. The problem is that parent directory is taking PW lock during that and lock will not be canceled until REP ACK will be received or local commit happens, so all other operations in the same directory will wait for dom discard lock completion, which is turn may wait for client lock to be canceled. Meanwhile we don't need to wait here at all, the only thing that discard should do is the bl_ast sending to the client. That can be fixed by avoiding waiting in completion callback for that lock or, per discussion with Vitaly, with LDLM_FL_CANCEL_ON_BLOCK flag. The discard lock will issue BL AST to the client and cancel server lock immediately without waiting for the client response. 

Comment by Gerrit Updater [ 31/Oct/18 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33522
Subject: LU-11359 mdt: fix mdt_dom_discard_data() timeouts
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0a947c24473666c9f6511ad786f147f8fc90c1ec

Comment by Gerrit Updater [ 20/Jan/19 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34071
Subject: LU-11359 mdt: fix mdt_dom_discard_data() timeouts
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e5b2a1998bb845a02ef788eda526221898bfec96

Comment by Sarah Liu [ 05/Feb/19 ]

seeing similar error on soak which is running b2_10-ib build #98

on MDS 0, ldiskfs

[29779.010372] Lustre: 19293:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 1086809 previous similar messages
[29876.244787] LNet: Service thread pid 18585 was inactive for 200.55s. The thread might be hung, or it might only be slow and will resume later. Du
mping the stack trace for debugging purposes:
[29876.263748] Pid: 18585, comm: mdt00_011 3.10.0-957.el7_lustre.x86_64 #1 SMP Mon Jan 7 20:06:41 UTC 2019
[29876.274243] Call Trace:
[29876.276986]  [<ffffffffc0f08a11>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
[29876.284750]  [<ffffffffc0f09b53>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
[29876.292803]  [<ffffffffc140ba1c>] mdt_reint_rename_or_migrate.isra.39+0x67c/0x860 [mdt]
[29876.301769]  [<ffffffffc140bc33>] mdt_reint_rename+0x13/0x20 [mdt]
[29876.308686]  [<ffffffffc140fc73>] mdt_reint_rec+0x83/0x210 [mdt]
[29876.315425]  [<ffffffffc13f118b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
[29876.322728]  [<ffffffffc13fcf77>] mdt_reint+0x67/0x140 [mdt]
[29876.329068]  [<ffffffffc0fa12aa>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[29876.336793]  [<ffffffffc0f49d5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[29876.345386]  [<ffffffffc0f4d4a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[29876.352425]  [<ffffffff820c1c31>] kthread+0xd1/0xe0
[29876.357875]  [<ffffffff82774c37>] ret_from_fork_nospec_end+0x0/0x39
[29876.364884]  [<ffffffffffffffff>] 0xffffffffffffffff
[29876.370470] LustreError: dumping log to /tmp/lustre-log.1549315990.18585
[29877.464680] LNet: Service thread pid 12550 was inactive for 201.09s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[29877.483630] Pid: 12550, comm: mdt00_004 3.10.0-957.el7_lustre.x86_64 #1 SMP Mon Jan 7 20:06:41 UTC 2019
[29877.494147] Call Trace:
[29877.496902]  [<ffffffffc0f08a11>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
[29877.504696]  [<ffffffffc0f09b53>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
[29877.512742]  [<ffffffffc140ba1c>] mdt_reint_rename_or_migrate.isra.39+0x67c/0x860 [mdt]
[29877.521727]  [<ffffffffc140bc33>] mdt_reint_rename+0x13/0x20 [mdt]
[29877.528646]  [<ffffffffc140fc73>] mdt_reint_rec+0x83/0x210 [mdt]
[29877.535379]  [<ffffffffc13f118b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
[29877.542705]  [<ffffffffc13fcf77>] mdt_reint+0x67/0x140 [mdt]
[29877.549066]  [<ffffffffc0fa12aa>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[29877.556816]  [<ffffffffc0f49d5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[29877.565461]  [<ffffffffc0f4d4a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[29877.572517]  [<ffffffff820c1c31>] kthread+0xd1/0xe0
[29877.577977]  [<ffffffff82774c37>] ret_from_fork_nospec_end+0x0/0x39
[29877.584994]  [<ffffffffffffffff>] 0xffffffffffffffff
[29898.915502] Lustre: MGS: haven't heard from client 892c8e1b-d3a7-b366-d6fc-9c8c5cb531f5 (at 192.168.1.110@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff96f12caae800, cur 1549316013 expire 1549315863 last 1549315786
[29926.439830] Lustre: 12363:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1549316014/real 0]  req@ffff96ecd3f80f00 x1624544455753472/t0(0) o38->soaked-MDT0002-osp-MDT0000@192.168.1.110@o2ib:24/4 lens 520/544 e 0 to 1 dl 1549316040 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
[29926.470993] Lustre: 12363:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 1867318 previous similar messages
[29931.342923] LNet: 12352:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.1.110@o2ib: 1 seconds
[29944.343184] LNet: 12352:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.1.110@o2ib: 12 seconds
[29956.343414] LNet: 12352:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.1.110@o2ib: 6 seconds
[29956.354882] LNet: 12352:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped 2 previous similar messages
[29969.343674] LNet: 12352:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.1.110@o2ib: 4 seconds
[29969.355136] LNet: 12352:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped 2 previous similar messages
[29975.691774] LustreError: 18585:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1549315789, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-soaked-MDT0000_UUID lock: ffff96ece5a0a200/0xcacba66f98c78b0 lrc: 3/0,1 mode: --/EX res: [0x200000004:0x1:0x0].0x0 bits 0x2 rrc: 12 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 18585 timeout: 0 lvb_type: 0
[29975.692225] Lustre: soaked-MDT0000: Received new LWP connection from 192.168.1.109@o2ib, removing former export from same NID
[29975.692228] Lustre: Skipped 1 previous similar message
[29975.692273] Lustre: soaked-MDT0000: Connection restored to 192.168.1.109@o2ib (at 192.168.1.109@o2ib)
[29975.764275] LustreError: dumping log to /tmp/lustre-log.1549316090.18585
[29976.374841] LustreError: 12550:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1549315790, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-soaked-MDT0000_UUID lock: ffff96ece3468a00/0xcacba66f98ca080 lrc: 3/0,1 mode: --/EX res: [0x200000004:0x1:0x0].0x0 bits 0x2 rrc: 12 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 12550 timeout: 0 lvb_type: 0
[29982.343941] LNet: 12352:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.1.110@o2ib: 17 seconds
[29982.355508] LNet: 12352:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped 2 previous similar messages
[29994.344188] LNet: 12352:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.1.110@o2ib: 4 seconds
[29994.355649] LNet: 12352:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped 1 previous similar message
[30048.309887] Lustre: MGS: Connection restored to 192.168.1.110@o2ib (at 192.168.1.110@o2ib)
[30048.319166] Lustre: Skipped 1 previous similar message
[30048.677813] Lustre: soaked-MDT0000: Received new LWP connection from 192.168.1.110@o2ib, removing former export from same NID
[30048.690935] Lustre: Skipped 1 previous similar message
[30048.696694] Lustre: soaked-MDT0000: Client soaked-MDT0002-mdtlov_UUID (at 192.168.1.110@o2ib) refused connection, still busy with 26 references
[30050.367592] LustreError: 19293:0:(ldlm_lockd.c:697:ldlm_handle_ast_error()) ### client (nid 192.168.1.110@o2ib) returned error from completion AST (req@ffff96ecde662d00 x1624544455740352 status -107 rc -107), evict it ns: mdt-soaked-MDT0000_UUID lock: ffff96ece65b3600/0xcacba66f98c8e43 lrc: 3/0,0 mode: EX/EX res: [0x20000cf25:0xa64:0x0].0x0 bits 0x2 rrc: 3 type: IBT flags: 0x40000001000000 nid: 192.168.1.110@o2ib remote: 0x3aa7d720cf16e00 expref: 25 pid: 32934 timeout: 0 lvb_type: 0
[30050.367613] LustreError: 138-a: soaked-MDT0000: A client on nid 192.168.1.110@o2ib was evicted due to a lock completion callback time out: rc -107
[30050.430003] LustreError: 19293:0:(ldlm_lockd.c:697:ldlm_handle_ast_error()) Skipped 1 previous similar message
Comment by Patrick Farrell (Inactive) [ 15/Mar/19 ]

Attached is a log of the execution of mdt_dom_discard_data with this patch.  Patch does not achieve intended effect of getting another context.  We're waiting for the lock request to complete, rather than putting it on the waiting queue, so we're sleeping and executing the completion AST in the original thread context.

Comment by Mikhail Pershin [ 15/Mar/19 ]

Patrick, it is not quite so, there is no wait for lock request completion waiting, it only takes lock in the same context and sends blocking ASTs but don't wait for them to complete, that was an idea. Problem was in generic ldlm_completion_ast() which waits for all blocking lock to cancel in execution context. 

Comment by Patrick Farrell (Inactive) [ 15/Mar/19 ]

Mike,

Take a look at the log if you haven't.  It waits for them to complete.  I added a few more comments in the patch.

Comment by Mikhail Pershin [ 15/Mar/19 ]

Patrick, I don't see in log you've sent that it 'waits', it executes lock in the same context, yes, process lock policy, finds conflicting locks, sends blocking ASTs and exits. With generic ldlm_completion_ast() it would also 'waits' for all blocking locks to cancel. That is the main problem because if waits for other client locks to cancel in the current context, but with the patch it doesn't. Often that happens so that client cancels blocking locks right away and server gets cancels earlier than calling completion AST, in that case all is finished in the single context but that is not problem

Comment by Patrick Farrell (Inactive) [ 15/Mar/19 ]

In case anyone is not following Gerrit, the details are complicated, but Mike is correct.  My comments can be disregarded. 

Comment by Stephane Thiell [ 26/Apr/19 ]

This is a follow-up to LU-12037. We're using this patch and I wanted to provide a first update before the weekend. The patch above has simply changed our life. While we still have 3 unpatched clients on Sherlock (~1400 node cluster), that are running long jobs, I can already say that this has been a tremendous change, in a very good way!!! After having patched about 1200 nodes (about ~200 nodes remaining), we started to see a significant drop in the number of
ldlm_completion_ast() call traces on the MDS (after at least one with mdt_dom_discard_data()). This morning, Robinhood was still blocked (perhaps I missed this part but we couldn't run Robinhood, or at least not for long, the lustre client there kept disconnecting/reconnecting from the MDTs)... So this morning I restarted all Fir's 4 MDTs and now logs are so pristine clean I'm even wondering if Lustre is running ...). If someone is doing the same, I recommend: (1)patching servers, (2)patching clients, (3)restarting all MDTs to clean dead locks. Perhaps this should be added to the 2.12.1 changelogs. Anyway, I'll back report again next week to confirm that everything is alright in that aspect. Thanks so much!!!

Comment by Peter Jones [ 26/Apr/19 ]

Thanks for the update Stephane - this is really good news!

Comment by Stephane Thiell [ 26/Apr/19 ]

Hmm. While we haven't seen new call traces with mdt_dom_discard_data(), we just had an event with a lot of ldlm_completion_ast() and Robinhood was stuck. Filesystem was still usable overall but I noticed a few nodes that couldn't access a specific set of files. I restarted the Robinhood server and after some time the system recover by itself. This may be related to LU-11358 so I'll attach MDS logs there.

Comment by Mikhail Pershin [ 27/Apr/19 ]

Stephane, this also can be result of LU-11751

Comment by Stephane Thiell [ 27/Apr/19 ]

Mike, ok thanks. I'll follow this one too. I have seen a few more traces on MDS last night, from time to time (see an example of one of these 'events' below). Not sure about the impact on users yet, but I assume some file access must be blocked, at least temporarily. We stopped Robinhood (which is not critical here) so it's not coming from it I think. Our plan is not to touch anything unless we see a real user-facing problem at this point, but do you think https://review.whamcloud.com/#/c/33077 would be worth trying if ever we see that it has an impact on production?

Apr 27 06:59:36 fir-md1-s1 kernel: Lustre: fir-MDT0000: Client 79273348-f676-2905-42da-85a87e1ba2d5 (at 10.9.107.14@o2ib4) reconnecting
Apr 27 06:59:36 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages
Apr 27 07:08:23 fir-md1-s1 kernel: Lustre: fir-MDT0000: Connection restored to  (at 10.9.107.14@o2ib4)
Apr 27 07:08:23 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages
Apr 27 07:09:56 fir-md1-s1 kernel: Lustre: fir-MDT0000: Client 79273348-f676-2905-42da-85a87e1ba2d5 (at 10.9.107.14@o2ib4) reconnecting
Apr 27 07:09:56 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages
Apr 27 07:18:43 fir-md1-s1 kernel: Lustre: fir-MDT0000: Connection restored to  (at 10.9.107.14@o2ib4)
Apr 27 07:18:43 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages
Apr 27 07:20:16 fir-md1-s1 kernel: Lustre: fir-MDT0000: Client 79273348-f676-2905-42da-85a87e1ba2d5 (at 10.9.107.14@o2ib4) reconnecting
Apr 27 07:20:16 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages
Apr 27 07:29:03 fir-md1-s1 kernel: Lustre: fir-MDT0000: Connection restored to 4a715ce4-44e5-9eb5-fd19-43a9ef3ba9b7 (at 10.9.107.10@o2ib4)
Apr 27 07:29:03 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages
Apr 27 07:30:36 fir-md1-s1 kernel: Lustre: fir-MDT0000: Client 8937ba9c-f8a3-cff7-01bf-a35c8d69bba5 (at 10.9.107.10@o2ib4) reconnecting
Apr 27 07:30:36 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages
Apr 27 07:39:23 fir-md1-s1 kernel: Lustre: fir-MDT0000: Connection restored to 4a715ce4-44e5-9eb5-fd19-43a9ef3ba9b7 (at 10.9.107.10@o2ib4)
Apr 27 07:39:23 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages
Apr 27 07:40:37 fir-md1-s1 kernel: LNet: Service thread pid 114792 completed after 3541.44s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Apr 27 07:40:37 fir-md1-s1 kernel: LNet: Skipped 1 previous similar message
Apr 27 07:40:56 fir-md1-s1 kernel: Lustre: fir-MDT0000: Client 8937ba9c-f8a3-cff7-01bf-a35c8d69bba5 (at 10.9.107.10@o2ib4) reconnecting
Apr 27 07:40:56 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages
Apr 27 07:41:02 fir-md1-s1 kernel: Lustre: 105058:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-5), not sending early reply
                                     req@ffff8b5d7a9ad100 x1631534730383072/t0(0) o101->5735cd86-3a30-362c-bc05-c634d3fa1859@10.9.107.11@o2ib4:7/0 lens 632/0 e 0 to 0 dl 1556376067 ref 2 fl Interpret:/0/ffffffff rc 0/-1
Apr 27 07:41:02 fir-md1-s1 kernel: Lustre: 105058:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 30 previous similar messages
Apr 27 07:42:07 fir-md1-s1 kernel: LustreError: 104389:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1556376037, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8b4d410f0240/0x378007f7dcffbc77 lrc: 3/1,0 mode: --/PR r
Apr 27 07:42:07 fir-md1-s1 kernel: LustreError: 104389:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 2 previous similar messages
Apr 27 07:43:07 fir-md1-s1 kernel: LustreError: 20444:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 150s: evicting client at 10.9.104.19@o2ib4  ns: mdt-fir-MDT0000_UUID lock: ffff8b60e5ace780/0x378007f7ac0a6892 lrc: 3/0,0 mode: PW/PW res: [0x20001a221:0x311:0x0].0x0 bits 0x40/0x0 rrc
Apr 27 07:43:07 fir-md1-s1 kernel: LustreError: 104728:0:(ldlm_lockd.c:1357:ldlm_handle_enqueue0()) ### lock on destroyed export ffff8b5101ad2000 ns: mdt-fir-MDT0000_UUID lock: ffff8b5ededbe300/0x378007f7ac0a6bc5 lrc: 5/0,0 mode: PW/PW res: [0x20001a221:0x311:0x0].0x0 bits 0x40/0x0 rrc: 50 type: IBT flags: 0x5020040
Apr 27 07:43:07 fir-md1-s1 kernel: LNet: Service thread pid 104728 completed after 3691.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Apr 27 07:43:07 fir-md1-s1 kernel: LNet: Service thread pid 105239 completed after 3691.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Apr 27 07:43:32 fir-md1-s1 kernel: Lustre: 105124:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-5), not sending early reply
                                     req@ffff8b6231114500 x1631546041738256/t0(0) o101->5c33d5d2-2621-e59d-0e36-98a7cb2caa9e@10.9.104.17@o2ib4:7/0 lens 568/0 e 0 to 0 dl 1556376217 ref 2 fl Interpret:/0/ffffffff rc 0/-1
Apr 27 07:43:32 fir-md1-s1 kernel: Lustre: 105124:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message
Apr 27 07:43:37 fir-md1-s1 kernel: LustreError: 20444:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 30s: evicting client at 10.9.104.17@o2ib4  ns: mdt-fir-MDT0000_UUID lock: ffff8b57efe82880/0x378007f7ac0a6d3f lrc: 3/0,0 mode: PW/PW res: [0x20001a221:0x311:0x0].0x0 bits 0x40/0x0 rrc:
Apr 27 07:43:37 fir-md1-s1 kernel: LNet: Service thread pid 104337 completed after 3720.97s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Apr 27 07:43:37 fir-md1-s1 kernel: LustreError: 104691:0:(ldlm_lockd.c:1357:ldlm_handle_enqueue0()) ### lock on destroyed export ffff8b5101ad2000 ns: mdt-fir-MDT0000_UUID lock: ffff8b567d303a80/0x378007f7ac0a6e88 lrc: 3/0,0 mode: PR/PR res: [0x20001a221:0x311:0x0].0x0 bits 0x1b/0x0 rrc: 44 type: IBT flags: 0x5020040
Apr 27 07:43:37 fir-md1-s1 kernel: Lustre: 104691:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (3719:2s); client may timeout.  req@ffff8b5e67f11200 x1631585913250576/t0(0) o101->8a37f7b1-3efc-30e9-f8d1-739df6680357@10.9.104.19@o2ib4:6/0 lens 584/1168 e 0 to 0 dl 1556376215
Apr 27 07:43:37 fir-md1-s1 kernel: Lustre: 104691:0:(service.c:2165:ptlrpc_server_handle_request()) Skipped 1 previous similar message
Apr 27 07:43:37 fir-md1-s1 kernel: LNet: Skipped 4 previous similar messages
Apr 27 07:43:43 fir-md1-s1 kernel: LustreError: 82844:0:(ldlm_lockd.c:2322:ldlm_cancel_handler()) ldlm_cancel from 10.9.104.17@o2ib4 arrived at 1556376223 with bad export cookie 3999205229385308541
Apr 27 07:43:58 fir-md1-s1 kernel: LNet: Service thread pid 114991 was inactive for 200.64s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Apr 27 07:43:58 fir-md1-s1 kernel: LNet: Skipped 4 previous similar messages
Apr 27 07:43:58 fir-md1-s1 kernel: Pid: 114991, comm: mdt02_103 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Apr 27 07:43:58 fir-md1-s1 kernel: Call Trace:
Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc0d7ae25>] ldlm_completion_ast+0x4e5/0x890 [ptlrpc]
Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc0d7bc8c>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc146b4bb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc146bb40>] mdt_object_lock_internal+0x70/0x3e0 [mdt]
Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc1474e45>] mdt_intent_getxattr+0xb5/0x270 [mdt]
Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc1471a28>] mdt_intent_policy+0x2e8/0xd00 [mdt]
Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc0d61d26>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc0d8a6d7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc0e110b2>] tgt_enqueue+0x62/0x210 [ptlrpc]
Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc0e1810a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc0dbc6db>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc0dc000c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffff9b4c1c31>] kthread+0xd1/0xe0
Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffff9bb74c24>] ret_from_fork_nospec_begin+0xe/0x21
Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Apr 27 07:43:58 fir-md1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1556376238.114991
Apr 27 07:43:58 fir-md1-s1 kernel: Pid: 104389, comm: mdt01_004 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Apr 27 07:43:58 fir-md1-s1 kernel: Call Trace:
Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc0d7ae25>] ldlm_completion_ast+0x4e5/0x890 [ptlrpc]
Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc0d7bc8c>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc146b4bb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc146bb40>] mdt_object_lock_internal+0x70/0x3e0 [mdt]
Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc1474e45>] mdt_intent_getxattr+0xb5/0x270 [mdt]
Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc1471a28>] mdt_intent_policy+0x2e8/0xd00 [mdt]
Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc0d61d26>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc0d8a6d7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc0e110b2>] tgt_enqueue+0x62/0x210 [ptlrpc]
Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc0e1810a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc0dbc6db>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc0dc000c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffff9b4c1c31>] kthread+0xd1/0xe0
Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffff9bb74c24>] ret_from_fork_nospec_begin+0xe/0x21
Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Apr 27 07:44:07 fir-md1-s1 kernel: Lustre: 105128:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (3750:1s); client may timeout.  req@ffff8b5e67f12700 x1631586050523824/t0(0) o101->0fafc81c-d2f9-5fcc-1c5e-9d205df82025@10.9.104.20@o2ib4:6/0 lens 480/536 e 0 to 0 dl 1556376246 
Apr 27 07:44:07 fir-md1-s1 kernel: LustreError: 104957:0:(ldlm_lockd.c:1357:ldlm_handle_enqueue0()) ### lock on destroyed export ffff8b5f75d7a000 ns: mdt-fir-MDT0000_UUID lock: ffff8b52d16cf980/0x378007f7ac0a79a9 lrc: 3/0,0 mode: PR/PR res: [0x20001a221:0x311:0x0].0x0 bits 0x20/0x0 rrc: 33 type: IBT flags: 0x5020000
Apr 27 07:44:07 fir-md1-s1 kernel: LustreError: 104957:0:(ldlm_lockd.c:1357:ldlm_handle_enqueue0()) Skipped 1 previous similar message
Apr 27 07:44:07 fir-md1-s1 kernel: Lustre: 105128:0:(service.c:2165:ptlrpc_server_handle_request()) Skipped 10 previous similar messages
Apr 27 08:25:20 fir-md1-s1 kernel: Lustre: fir-MDT0000: haven't heard from client f8efffa2-3213-b7c8-ccdc-9a5668e213bd (at 10.8.25.13@o2ib6) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8b72fa1f6c00, cur 1556378720 expire 1556378570 last 1556378493
Apr 27 08:25:20 fir-md1-s1 kernel: Lustre: Skipped 2 previous similar messages
Apr 27 08:25:36 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 8bfed59e-0ae2-8175-d0ee-2a61e691b9d0 (at 10.8.25.13@o2ib6)
Comment by Stephane Thiell [ 29/Apr/19 ]

We've completed the patching of all our clients (~1,400) last weekend. We haven't seen any new {{mdt_dom_discard_data }} call traces, so I consider this issue closed. However, we likely hit LU-11751 last weekend and I did already follow-up there. Thanks!

Comment by Gerrit Updater [ 04/May/19 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34804
Subject: LU-11359 mdt: more debug for lu_object_put() issue
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a67945fc08744fa70c7b56a31adbc35ffa412dce

Comment by James A Simmons [ 22/May/19 ]

We are also seeing this issue for our production system. I want to add it can be reproduced with running the io500 script.

Comment by Peter Jones [ 25/May/19 ]

James

Have you tested the current patch (and LU-11285)?

Peter

Comment by Gerrit Updater [ 01/Jun/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34071/
Subject: LU-11359 mdt: fix mdt_dom_discard_data() timeouts
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9c028e74c2202a8a481557c4cb22225734aaf19f

Comment by Peter Jones [ 01/Jun/19 ]

Landed for 2.13

Comment by James A Simmons [ 01/Jun/19 ]

Sorry missed the message. I have tested the LU-11359 patch but not LU-11285.

Comment by Stephane Thiell [ 06/Jun/19 ]

Latest patchset of https://review.whamcloud.com/34071/ introduced the use of lu_env_find (due to aa82cc83 LU-12034 obdclass: put all service's env on the list ) and thus it doesn't compile on b2_12 anymore. I wanted to rebase our branch against 2.12.2 and also use the patch that landed master (patchset 20) instead of patchset 12 that we are currently using in production. What is your recommendation? Or could you please backport it to b2_12? Thanks much!

Comment by Patrick Farrell (Inactive) [ 06/Jun/19 ]

LU-12034 is being ported to b2_12, Stephane - Note the LTS12 tag there.

Comment by Stephane Thiell [ 06/Jun/19 ]

Ah, cool. Thanks!

Comment by Gerrit Updater [ 11/Jun/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35197
Subject: LU-11359 mdt: fix mdt_dom_discard_data() timeouts
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 69bc9c24f5ea8e9a952ad5ab76dc5e36f6b24b66

Comment by Gerrit Updater [ 27/Jun/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35197/
Subject: LU-11359 mdt: fix mdt_dom_discard_data() timeouts
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: e5810126b3fb488a3fed37e085e3ca4ae585324c

Generated at Sat Feb 10 02:43:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.