Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11359

racer test 1 times out with client hung in dir_create.sh, ls, … and MDS in ldlm_completion_ast()

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.13.0, Lustre 2.12.3
    • Lustre 2.12.0, Lustre 2.10.7, Lustre 2.12.1
    • 3
    • 9223372036854775807

    Description

      Looking at the logs for https://testing.whamcloud.com/test_sets/ada6008e-b433-11e8-b86b-52540065bddc, we see that racer test_1 hangs in in ls, dir_create.sh. On both clients, we see several call to dir_create.sh, multiple ls calls, getfattr, ln and others blocked/hung.

      From the first client’s (vm9) console log, we see many calls hung ending with “lustre: revalidate FID [0x200007161:0x1:0x0] error: rc = -4”

      [ 7722.701554] Lustre: DEBUG MARKER: DURATION=300 			MDSCOUNT=1 OSTCOUNT=7			RACER_ENABLE_REMOTE_DIRS=false 			RACER_ENABLE_STRIPED_DIRS=false 			RACER_ENABLE_MIGRATION=false 			RACER_ENABLE_PFL=true 			RACER_ENABLE_DOM=true 			RACER_ENABLE_FLR=true 			LFS=/usr/bin/lfs 			/usr/lib64/lustre/t
      [ 7722.706903] Lustre: DEBUG MARKER: DURATION=300 			MDSCOUNT=1 OSTCOUNT=7			RACER_ENABLE_REMOTE_DIRS=false 			RACER_ENABLE_STRIPED_DIRS=false 			RACER_ENABLE_MIGRATION=false 			RACER_ENABLE_PFL=true 			RACER_ENABLE_DOM=true 			RACER_ENABLE_FLR=true 			LFS=/usr/bin/lfs 			/usr/lib64/lustre/t
      [ 7726.780425] 3[19667]: segfault at 0 ip 0000000000403e5f sp 00007fff05523690 error 6 in 3 (deleted)[400000+6000]
      [ 7738.946240] Lustre: lfs: using old ioctl(LL_IOC_LOV_GETSTRIPE) on [0x200007163:0x167:0x0], use llapi_layout_get_by_path()
      [ 7858.011870] Lustre: lustre-OST0004-osc-ffff8e6e2480f800: Connection to lustre-OST0004 (at 10.9.4.117@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [ 7858.015216] LustreError: 167-0: lustre-OST0004-osc-ffff8e6e2480f800: This client was evicted by lustre-OST0004; in progress operations using this service will fail.
      [ 7858.019049] LustreError: 1426:0:(ldlm_resource.c:1146:ldlm_resource_complain()) lustre-OST0004-osc-ffff8e6e2480f800: namespace resource [0x832e:0x0:0x0].0x0 (ffff8e6e0ce026c0) refcount nonzero (2) after lock cleanup; forcing cleanup.
      [ 7920.249204] INFO: task dir_create.sh:15717 blocked for more than 120 seconds.
      [ 7920.250063] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.250808] dir_create.sh   D ffff8e6e02b14f10     0 15717  15658 0x00000080
      [ 7920.251566] Call Trace:
      [ 7920.251882]  [<ffffffffc0db47e2>] ? ll_dcompare+0x72/0x2e0 [lustre]
      [ 7920.252506]  [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.253246]  [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.253901]  [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f
      [ 7920.254400]  [<ffffffffa470bcde>] lookup_slow+0x33/0xa7
      [ 7920.254914]  [<ffffffffa422927f>] link_path_walk+0x80f/0x8b0
      [ 7920.255597]  [<ffffffffa422b75d>] ? do_last+0x66d/0x12c0
      [ 7920.256108]  [<ffffffffa422c465>] path_openat+0xb5/0x640
      [ 7920.256614]  [<ffffffffa422e01d>] do_filp_open+0x4d/0xb0
      [ 7920.257118]  [<ffffffffa423b4c4>] ? __alloc_fd+0xc4/0x170
      [ 7920.257637]  [<ffffffffa421a327>] do_sys_open+0x137/0x240
      [ 7920.258159]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.258783]  [<ffffffffa421a44e>] SyS_open+0x1e/0x20
      [ 7920.259260]  [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21
      [ 7920.259819]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.260552] INFO: task cp:32379 blocked for more than 120 seconds.
      [ 7920.261136] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.261877] cp              D ffff8e6e2611af70     0 32379  15806 0x00000080
      [ 7920.262581] Call Trace:
      [ 7920.262857]  [<ffffffffc0df4cc5>] ? ll_i2suppgid+0x15/0x40 [lustre]
      [ 7920.263466]  [<ffffffffc0df4d14>] ? ll_i2gids+0x24/0xb0 [lustre]
      [ 7920.264031]  [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.264635]  [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.265222]  [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f
      [ 7920.265711]  [<ffffffffc0de785c>] ll_setattr_raw+0xb6c/0xee0 [lustre]
      [ 7920.266332]  [<ffffffffc0de7c3c>] ll_setattr+0x6c/0xd0 [lustre]
      [ 7920.266897]  [<ffffffffa4239f14>] notify_change+0x2c4/0x420
      [ 7920.267428]  [<ffffffffa424fab9>] utimes_common+0xd9/0x1c0
      [ 7920.267952]  [<ffffffffa424fcde>] do_utimes+0x13e/0x180
      [ 7920.268452]  [<ffffffffa424fe24>] SyS_utimensat+0x64/0xb0
      [ 7920.268966]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.269585]  [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21
      [ 7920.270156]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.270775] INFO: task cp:434 blocked for more than 120 seconds.
      [ 7920.271344] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.272068] cp              D ffff8e6df7cb8000     0   434  15776 0x00000080
      [ 7920.272778] Call Trace:
      [ 7920.273028]  [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.273636]  [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.274225]  [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f
      [ 7920.274721]  [<ffffffffc0e14bc3>] vvp_io_write_start+0x603/0x810 [lustre]
      [ 7920.275506]  [<ffffffffc0978377>] ? cl_lock_request+0x67/0x1f0 [obdclass]
      [ 7920.276191]  [<ffffffffc097a228>] cl_io_start+0x68/0x130 [obdclass]
      [ 7920.276803]  [<ffffffffc097c611>] cl_io_loop+0x131/0xc70 [obdclass]
      [ 7920.277417]  [<ffffffffc0dc9db2>] ll_file_io_generic+0x4e2/0xd10 [lustre]
      [ 7920.278064]  [<ffffffffc0dcab32>] ll_file_aio_write+0x372/0x540 [lustre]
      [ 7920.278699]  [<ffffffffc0dcada4>] ll_file_write+0xa4/0x170 [lustre]
      [ 7920.279300]  [<ffffffffa421b490>] vfs_write+0xc0/0x1f0
      [ 7920.279776]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.280408]  [<ffffffffa421c2bf>] SyS_write+0x7f/0xf0
      [ 7920.280894]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.281512]  [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21
      [ 7920.282096]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.282719] INFO: task getfattr:1159 blocked for more than 120 seconds.
      [ 7920.283347] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.284090] getfattr        D ffff8e6df815af70     0  1159  15746 0x00000080
      [ 7920.284799] Call Trace:
      [ 7920.285054]  [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.285660]  [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.286244]  [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f
      [ 7920.286726]  [<ffffffffa470bcde>] lookup_slow+0x33/0xa7
      [ 7920.287231]  [<ffffffffa42291d4>] link_path_walk+0x764/0x8b0
      [ 7920.287770]  [<ffffffffc094c943>] ? class_handle2object+0xc3/0x1c0 [obdclass]
      [ 7920.288453]  [<ffffffffa422a24a>] path_lookupat+0x7a/0x8b0
      [ 7920.288981]  [<ffffffffa41f7b9b>] ? kmem_cache_alloc+0x19b/0x1f0
      [ 7920.289547]  [<ffffffffa422cd4f>] ? getname_flags+0x4f/0x1a0
      [ 7920.290096]  [<ffffffffa422aaab>] filename_lookup+0x2b/0xc0
      [ 7920.290624]  [<ffffffffa422dee7>] user_path_at_empty+0x67/0xc0
      [ 7920.291185]  [<ffffffffa4125092>] ? from_kgid_munged+0x12/0x20
      [ 7920.291740]  [<ffffffffa422142f>] ? cp_new_stat+0x14f/0x180
      [ 7920.292278]  [<ffffffffa422df51>] user_path_at+0x11/0x20
      [ 7920.292775]  [<ffffffffa4220f23>] vfs_fstatat+0x63/0xc0
      [ 7920.293282]  [<ffffffffa422148e>] SYSC_newstat+0x2e/0x60
      [ 7920.293776]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.294407]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.295029]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.295645]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.296272]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.296895]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.297512]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.298132]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.298746]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.299373]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.300000]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.300616]  [<ffffffffa422176e>] SyS_newstat+0xe/0x10
      [ 7920.301103]  [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21
      [ 7920.301668]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.302299] INFO: task getfattr:1656 blocked for more than 120 seconds.
      [ 7920.302919] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.303653] getfattr        D ffff8e6e06b6eeb0     0  1656  15679 0x00000080
      [ 7920.304366] Call Trace:
      [ 7920.304606]  [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.305221]  [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.305790]  [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f
      [ 7920.306292]  [<ffffffffa470bcde>] lookup_slow+0x33/0xa7
      [ 7920.306784]  [<ffffffffa422aa08>] path_lookupat+0x838/0x8b0
      [ 7920.307321]  [<ffffffffa41f7b79>] ? kmem_cache_alloc+0x179/0x1f0
      [ 7920.307887]  [<ffffffffa422cd4f>] ? getname_flags+0x4f/0x1a0
      [ 7920.308418]  [<ffffffffa422aaab>] filename_lookup+0x2b/0xc0
      [ 7920.308946]  [<ffffffffa422dee7>] user_path_at_empty+0x67/0xc0
      [ 7920.309494]  [<ffffffffa422df51>] user_path_at+0x11/0x20
      [ 7920.309998]  [<ffffffffa4243f68>] SyS_listxattr+0x48/0xb0
      [ 7920.310506]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.311125]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.311749]  [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21
      [ 7920.312320]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.312939] INFO: task chmod:1674 blocked for more than 120 seconds.
      [ 7920.313533] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.314270] chmod           D ffff8e6e02b12f70     0  1674  15675 0x00000080
      [ 7920.314990] Call Trace:
      [ 7920.315238]  [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.315829]  [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.316424]  [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f
      [ 7920.316921]  [<ffffffffa470bcde>] lookup_slow+0x33/0xa7
      [ 7920.317417]  [<ffffffffa422aa08>] path_lookupat+0x838/0x8b0
      [ 7920.317955]  [<ffffffffa406814e>] ? kvm_clock_get_cycles+0x1e/0x20
      [ 7920.318549]  [<ffffffffa41f7b79>] ? kmem_cache_alloc+0x179/0x1f0
      [ 7920.319125]  [<ffffffffa422cd4f>] ? getname_flags+0x4f/0x1a0
      [ 7920.319657]  [<ffffffffa422aaab>] filename_lookup+0x2b/0xc0
      [ 7920.320204]  [<ffffffffa422dee7>] user_path_at_empty+0x67/0xc0
      [ 7920.320749]  [<ffffffffa471b57c>] ? __do_page_fault+0x1bc/0x4f0
      [ 7920.321316]  [<ffffffffa422df51>] user_path_at+0x11/0x20
      [ 7920.321814]  [<ffffffffa4219d07>] SyS_fchmodat+0x47/0xc0
      [ 7920.322322]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.322942]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.323558]  [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21
      [ 7920.324124]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.324743] INFO: task getfattr:1834 blocked for more than 120 seconds.
      [ 7920.325372] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.326101] getfattr        D ffff8e6df7068000     0  1834  15711 0x00000080
      [ 7920.326802] Call Trace:
      [ 7920.327060]  [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.327661]  [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.328249]  [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f
      [ 7920.328745]  [<ffffffffa470bcde>] lookup_slow+0x33/0xa7
      [ 7920.329254]  [<ffffffffa422aa08>] path_lookupat+0x838/0x8b0
      [ 7920.329778]  [<ffffffffa4242b12>] ? generic_getxattr+0x52/0x70
      [ 7920.330338]  [<ffffffffa41f7b79>] ? kmem_cache_alloc+0x179/0x1f0
      [ 7920.331029]  [<ffffffffa422cd4f>] ? getname_flags+0x4f/0x1a0
      [ 7920.331651]  [<ffffffffa422aaab>] filename_lookup+0x2b/0xc0
      [ 7920.332278]  [<ffffffffa422dee7>] user_path_at_empty+0x67/0xc0
      [ 7920.332952]  [<ffffffffa471b57c>] ? __do_page_fault+0x1bc/0x4f0
      [ 7920.333913]  [<ffffffffa422df51>] user_path_at+0x11/0x20
      [ 7920.335027]  [<ffffffffa4243d2a>] SyS_getxattr+0x4a/0xc0
      [ 7920.336126]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.336980]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.337720]  [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21
      [ 7920.338301]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.338928] INFO: task rm:1977 blocked for more than 120 seconds.
      [ 7920.339503] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.340232] rm              D ffff8e6e05499fa0     0  1977  15734 0x00000080
      [ 7920.340937] Call Trace:
      [ 7920.341181]  [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.341784]  [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.342374]  [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f
      [ 7920.342857]  [<ffffffffa422d265>] do_rmdir+0x165/0x220
      [ 7920.343352]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.343972]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.344586]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.345204]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.345816]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.346444]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.347063]  [<ffffffffa422e555>] SyS_unlinkat+0x25/0x40
      [ 7920.347565]  [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21
      [ 7920.348130]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.348747] INFO: task mv:2144 blocked for more than 120 seconds.
      [ 7920.349323] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.350049] mv              D ffff8e6e05498000     0  2144  15700 0x00000080
      [ 7920.350756] Call Trace:
      [ 7920.351005]  [<ffffffffa422a2f2>] ? path_lookupat+0x122/0x8b0
      [ 7920.351547]  [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.352148]  [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.352728]  [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f
      [ 7920.353222]  [<ffffffffa42276e5>] lock_rename+0x55/0xe0
      [ 7920.353713]  [<ffffffffa422d81f>] SYSC_renameat2+0x22f/0x5a0
      [ 7920.354255]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.354871]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.355489]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.356110]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.356732]  [<ffffffffa422ea0e>] SyS_renameat2+0xe/0x10
      [ 7920.357247]  [<ffffffffa422ea4e>] SyS_rename+0x1e/0x20
      [ 7920.357731]  [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21
      [ 7920.358303]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.358930] INFO: task ln:2148 blocked for more than 120 seconds.
      [ 7920.359504] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.360254] ln              D ffff8e6df7cbeeb0     0  2148  15720 0x00000080
      [ 7920.360973] Call Trace:
      [ 7920.361221]  [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.361812]  [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.362402]  [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f
      [ 7920.362896]  [<ffffffffa470bcde>] lookup_slow+0x33/0xa7
      [ 7920.363390]  [<ffffffffa422aa08>] path_lookupat+0x838/0x8b0
      [ 7920.363924]  [<ffffffffa41944fb>] ? unlock_page+0x2b/0x30
      [ 7920.364444]  [<ffffffffa41c13f1>] ? do_read_fault.isra.60+0x121/0x1a0
      [ 7920.365048]  [<ffffffffa41f7b79>] ? kmem_cache_alloc+0x179/0x1f0
      [ 7920.365610]  [<ffffffffa422cd4f>] ? getname_flags+0x4f/0x1a0
      [ 7920.366146]  [<ffffffffa422aaab>] filename_lookup+0x2b/0xc0
      [ 7920.366671]  [<ffffffffa422dee7>] user_path_at_empty+0x67/0xc0
      [ 7920.367232]  [<ffffffffa41c7c3d>] ? handle_mm_fault+0x39d/0x9b0
      [ 7920.367777]  [<ffffffffa4125092>] ? from_kgid_munged+0x12/0x20
      [ 7920.368329]  [<ffffffffa422df51>] user_path_at+0x11/0x20
      [ 7920.368821]  [<ffffffffa4220f23>] vfs_fstatat+0x63/0xc0
      [ 7920.369344]  [<ffffffffa42214f1>] SYSC_newlstat+0x31/0x60
      [ 7920.369857]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.370495]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.371115]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.371737]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.372363]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.372987]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.373604]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.374250]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.374872]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.375495]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.376114]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.376736]  [<ffffffffa422177e>] SyS_newlstat+0xe/0x10
      [ 7920.377240]  [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21
      [ 7920.377806]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 8022.948194] Lustre: lustre-OST0004-osc-ffff8e6e2480f800: Connection restored to 10.9.4.117@tcp (at 10.9.4.117@tcp)
      [ 8023.080282] LustreError: 4670:0:(file.c:4366:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200007161:0x1:0x0] error: rc = -4
      [ 8023.080306] LustreError: 4664:0:(file.c:4366:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200007161:0x1:0x0] error: rc = -4
      

      From the second client’s (vm10) console log, we see similar calls as in the first client’s call stack

      [ 7722.357867] Lustre: DEBUG MARKER: == racer test 1: racer on clients: trevis-10vm10,trevis-10vm9.trevis.whamcloud.com DURATION=300 ====== 12:19:23 (1536495563)
      [ 7722.541747] Lustre: DEBUG MARKER: DURATION=300 			MDSCOUNT=1 OSTCOUNT=7			RACER_ENABLE_REMOTE_DIRS=false 			RACER_ENABLE_STRIPED_DIRS=false 			RACER_ENABLE_MIGRATION=false 			RACER_ENABLE_PFL=true 			RACER_ENABLE_DOM=true 			RACER_ENABLE_FLR=true 			LFS=/usr/bin/lfs 			/usr/lib64/lustre/t
      [ 7722.542686] Lustre: DEBUG MARKER: DURATION=300 			MDSCOUNT=1 OSTCOUNT=7			RACER_ENABLE_REMOTE_DIRS=false 			RACER_ENABLE_STRIPED_DIRS=false 			RACER_ENABLE_MIGRATION=false 			RACER_ENABLE_PFL=true 			RACER_ENABLE_DOM=true 			RACER_ENABLE_FLR=true 			LFS=/usr/bin/lfs 			/usr/lib64/lustre/t
      [ 7726.494344] Lustre: lfs: using old ioctl(LL_IOC_LOV_GETSTRIPE) on [0x200007164:0x60:0x0], use llapi_layout_get_by_path()
      [ 7920.259874] INFO: task dir_create.sh:14445 blocked for more than 120 seconds.
      [ 7920.260874] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.261634] dir_create.sh   D ffff9bac39240fd0     0 14445  14433 0x00000080
      [ 7920.262414] Call Trace:
      [ 7920.262749]  [<ffffffffc0e427e2>] ? ll_dcompare+0x72/0x2e0 [lustre]
      [ 7920.263384]  [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.264027]  [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.264623]  [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f
      [ 7920.265151]  [<ffffffff88f0bcde>] lookup_slow+0x33/0xa7
      [ 7920.265692]  [<ffffffff88a2927f>] link_path_walk+0x80f/0x8b0
      [ 7920.266253]  [<ffffffff88a2b75d>] ? do_last+0x66d/0x12c0
      [ 7920.266782]  [<ffffffff88a2c465>] path_openat+0xb5/0x640
      [ 7920.267314]  [<ffffffff88a2e01d>] do_filp_open+0x4d/0xb0
      [ 7920.267848]  [<ffffffff88a3b4c4>] ? __alloc_fd+0xc4/0x170
      [ 7920.268392]  [<ffffffff88a1a327>] do_sys_open+0x137/0x240
      [ 7920.268943]  [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.269595]  [<ffffffff88a1a44e>] SyS_open+0x1e/0x20
      [ 7920.270104]  [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21
      [ 7920.270693]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.271383] INFO: task ls:23458 blocked for more than 120 seconds.
      [ 7920.271998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.272770] ls              D ffff9bac1fd82f70     0 23458  14490 0x00000080
      [ 7920.273503] Call Trace:
      [ 7920.273774]  [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.274406]  [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.275021]  [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f
      [ 7920.275525]  [<ffffffff88a2b37f>] do_last+0x28f/0x12c0
      [ 7920.276048]  [<ffffffff88a2c487>] path_openat+0xd7/0x640
      [ 7920.276582]  [<ffffffffc0e4e929>] ? ll_file_data_put+0x89/0x180 [lustre]
      [ 7920.277251]  [<ffffffff88a2e01d>] do_filp_open+0x4d/0xb0
      [ 7920.277784]  [<ffffffff88a3b4c4>] ? __alloc_fd+0xc4/0x170
      [ 7920.278318]  [<ffffffff88a1a327>] do_sys_open+0x137/0x240
      [ 7920.278858]  [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.279515]  [<ffffffff88a1a464>] SyS_openat+0x14/0x20
      [ 7920.280033]  [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21
      [ 7920.280614]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.281265] INFO: task ls:23462 blocked for more than 120 seconds.
      [ 7920.281868] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.282636] ls              D ffff9bac1fc13f40     0 23462  14490 0x00000080
      [ 7920.283377] Call Trace:
      [ 7920.283638]  [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.284279]  [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.284885]  [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f
      [ 7920.285399]  [<ffffffff88a2b37f>] do_last+0x28f/0x12c0
      [ 7920.285922]  [<ffffffff88a2c487>] path_openat+0xd7/0x640
      [ 7920.286448]  [<ffffffffc0e4e929>] ? ll_file_data_put+0x89/0x180 [lustre]
      [ 7920.287110]  [<ffffffff88a2e01d>] do_filp_open+0x4d/0xb0
      [ 7920.287631]  [<ffffffff88a3b4c4>] ? __alloc_fd+0xc4/0x170
      [ 7920.288173]  [<ffffffff88a1a327>] do_sys_open+0x137/0x240
      [ 7920.288717]  [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.289361]  [<ffffffff88a1a464>] SyS_openat+0x14/0x20
      [ 7920.289878]  [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21
      [ 7920.290470]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.291141] INFO: task ls:23464 blocked for more than 120 seconds.
      [ 7920.291746] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.292497] ls              D ffff9bac20318000     0 23464  14490 0x00000080
      [ 7920.293256] Call Trace:
      [ 7920.293500]  [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.294150]  [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.294767]  [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f
      [ 7920.295280]  [<ffffffff88a2b37f>] do_last+0x28f/0x12c0
      [ 7920.295790]  [<ffffffff88a2c487>] path_openat+0xd7/0x640
      [ 7920.296327]  [<ffffffffc0e4e929>] ? ll_file_data_put+0x89/0x180 [lustre]
      [ 7920.296990]  [<ffffffff88a2e01d>] do_filp_open+0x4d/0xb0
      [ 7920.297510]  [<ffffffff88a3b4c4>] ? __alloc_fd+0xc4/0x170
      [ 7920.298056]  [<ffffffff88a1a327>] do_sys_open+0x137/0x240
      [ 7920.298587]  [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.299256]  [<ffffffff88a1a464>] SyS_openat+0x14/0x20
      [ 7920.299775]  [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21
      [ 7920.300367]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.301015] INFO: task ls:23466 blocked for more than 120 seconds.
      [ 7920.301612] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.302395] ls              D ffff9bac2031af70     0 23466  14490 0x00000080
      [ 7920.303136] Call Trace:
      [ 7920.303397]  [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.304035]  [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.304632]  [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f
      [ 7920.305155]  [<ffffffff88a2b37f>] do_last+0x28f/0x12c0
      [ 7920.305673]  [<ffffffff88a2c487>] path_openat+0xd7/0x640
      [ 7920.306200]  [<ffffffff88a2cd4f>] ? getname_flags+0x4f/0x1a0
      [ 7920.306762]  [<ffffffffc0e4e929>] ? ll_file_data_put+0x89/0x180 [lustre]
      [ 7920.307431]  [<ffffffff88a2e01d>] do_filp_open+0x4d/0xb0
      [ 7920.307963]  [<ffffffff88a3b4c4>] ? __alloc_fd+0xc4/0x170
      [ 7920.308493]  [<ffffffff88a1a327>] do_sys_open+0x137/0x240
      [ 7920.309037]  [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.309683]  [<ffffffff88a1a464>] SyS_openat+0x14/0x20
      [ 7920.310196]  [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21
      [ 7920.310786]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.311447] INFO: task getfattr:1379 blocked for more than 120 seconds.
      [ 7920.312100] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.312876] getfattr        D ffff9bac1d229fa0     0  1379  14538 0x00000080
      [ 7920.313611] Call Trace:
      [ 7920.313884]  [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.314513]  [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.315128]  [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f
      [ 7920.315631]  [<ffffffff88f0bcde>] lookup_slow+0x33/0xa7
      [ 7920.316162]  [<ffffffff88a2aa08>] path_lookupat+0x838/0x8b0
      [ 7920.316722]  [<ffffffff88a42b12>] ? generic_getxattr+0x52/0x70
      [ 7920.317294]  [<ffffffff88a42f6a>] ? vfs_getxattr+0x8a/0xb0
      [ 7920.317855]  [<ffffffff889f7a35>] ? kmem_cache_alloc+0x35/0x1f0
      [ 7920.318436]  [<ffffffff88a2cd4f>] ? getname_flags+0x4f/0x1a0
      [ 7920.319001]  [<ffffffff88a2aaab>] filename_lookup+0x2b/0xc0
      [ 7920.319544]  [<ffffffff88a2dee7>] user_path_at_empty+0x67/0xc0
      [ 7920.320131]  [<ffffffff88f1b57c>] ? __do_page_fault+0x1bc/0x4f0
      [ 7920.320718]  [<ffffffff88a2df51>] user_path_at+0x11/0x20
      [ 7920.321247]  [<ffffffff88a43d2a>] SyS_getxattr+0x4a/0xc0
      [ 7920.321777]  [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.322439]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.323091]  [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21
      [ 7920.323689]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.324352] INFO: task ln:2464 blocked for more than 120 seconds.
      [ 7920.324955] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.325711] ln              D ffff9bac21380fd0     0  2464  14485 0x00000080
      [ 7920.326458] Call Trace:
      [ 7920.326712]  [<ffffffff88a2a2f2>] ? path_lookupat+0x122/0x8b0
      [ 7920.327296]  [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.327932]  [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.328532]  [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f
      [ 7920.329053]  [<ffffffff88a2abc5>] filename_create+0x85/0x180
      [ 7920.329608]  [<ffffffff889f7a35>] ? kmem_cache_alloc+0x35/0x1f0
      [ 7920.330201]  [<ffffffff88a2cd4f>] ? getname_flags+0x4f/0x1a0
      [ 7920.330775]  [<ffffffff88a2cdc4>] ? getname_flags+0xc4/0x1a0
      [ 7920.331338]  [<ffffffff88a2d071>] user_path_create+0x41/0x60
      [ 7920.331898]  [<ffffffff88a2e5e8>] SyS_symlinkat+0x58/0x110
      [ 7920.332441]  [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.333091]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.333755]  [<ffffffff88a2e6b6>] SyS_symlink+0x16/0x20
      [ 7920.334279]  [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21
      [ 7920.334869]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.335512] INFO: task ls:2582 blocked for more than 120 seconds.
      [ 7920.336114] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.336890] ls              D ffff9bac1d329fa0     0  2582  14455 0x00000080
      [ 7920.337619] Call Trace:
      [ 7920.337894]  [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.338519]  [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.339135]  [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f
      [ 7920.339640]  [<ffffffff88a2b37f>] do_last+0x28f/0x12c0
      [ 7920.340160]  [<ffffffff88a2c487>] path_openat+0xd7/0x640
      [ 7920.340700]  [<ffffffffc0e4e929>] ? ll_file_data_put+0x89/0x180 [lustre]
      [ 7920.341373]  [<ffffffff88a2e01d>] do_filp_open+0x4d/0xb0
      [ 7920.341902]  [<ffffffff88a3b4c4>] ? __alloc_fd+0xc4/0x170
      [ 7920.342438]  [<ffffffff88a1a327>] do_sys_open+0x137/0x240
      [ 7920.342981]  [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.343618]  [<ffffffff88a1a464>] SyS_openat+0x14/0x20
      [ 7920.344142]  [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21
      [ 7920.344757]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.345402] INFO: task ls:2583 blocked for more than 120 seconds.
      [ 7920.346003] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.346774] ls              D ffff9bac1d32dee0     0  2583  14455 0x00000080
      [ 7920.347511] Call Trace:
      [ 7920.347787]  [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.348417]  [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.349033]  [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f
      [ 7920.349539]  [<ffffffff88a2b37f>] do_last+0x28f/0x12c0
      [ 7920.350059]  [<ffffffff88a2c487>] path_openat+0xd7/0x640
      [ 7920.350587]  [<ffffffffc0e4e929>] ? ll_file_data_put+0x89/0x180 [lustre]
      [ 7920.351271]  [<ffffffff88a2e01d>] do_filp_open+0x4d/0xb0
      [ 7920.351805]  [<ffffffff88a3b4c4>] ? __alloc_fd+0xc4/0x170
      [ 7920.352339]  [<ffffffff88a1a327>] do_sys_open+0x137/0x240
      [ 7920.352881]  [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.353524]  [<ffffffff88a1a464>] SyS_openat+0x14/0x20
      [ 7920.354050]  [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21
      [ 7920.354631]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.355301] INFO: task ls:2584 blocked for more than 120 seconds.
      [ 7920.355903] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.356685] ls              D ffff9bac1e540fd0     0  2584  14455 0x00000080
      [ 7920.357413] Call Trace:
      [ 7920.357688]  [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.358315]  [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.358927]  [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f
      [ 7920.359432]  [<ffffffff88a2b37f>] do_last+0x28f/0x12c0
      [ 7920.359955]  [<ffffffff88a2c487>] path_openat+0xd7/0x640
      [ 7920.360484]  [<ffffffffc0e4e929>] ? ll_file_data_put+0x89/0x180 [lustre]
      [ 7920.361166]  [<ffffffff88a2e01d>] do_filp_open+0x4d/0xb0
      [ 7920.361698]  [<ffffffff88a3b4c4>] ? __alloc_fd+0xc4/0x170
      [ 7920.362231]  [<ffffffff88a1a327>] do_sys_open+0x137/0x240
      [ 7920.362768]  [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.363411]  [<ffffffff88a1a464>] SyS_openat+0x14/0x20
      [ 7920.363931]  [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21
      [ 7920.364512]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 8022.975327] LustreError: 4302:0:(file.c:4366:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200007161:0x1:0x0] error: rc = -4
      [ 8022.975389] LustreError: 4303:0:(file.c:4366:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200007161:0x1:0x0] error: rc = -4
      [ 8022.975622] LustreError: 4305:0:(file.c:4366:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200007161:0x1:0x0] error: rc = -4
      

      Description continued in the first comment due to JIRA description line limits.

      Attachments

        Issue Links

          Activity

            [LU-11359] racer test 1 times out with client hung in dir_create.sh, ls, … and MDS in ldlm_completion_ast()

            Ah, cool. Thanks!

            sthiell Stephane Thiell added a comment - Ah, cool. Thanks!

            LU-12034 is being ported to b2_12, Stephane - Note the LTS12 tag there.

            pfarrell Patrick Farrell (Inactive) added a comment - LU-12034 is being ported to b2_12, Stephane - Note the LTS12 tag there.

            Latest patchset of https://review.whamcloud.com/34071/ introduced the use of lu_env_find (due to aa82cc83 LU-12034 obdclass: put all service's env on the list ) and thus it doesn't compile on b2_12 anymore. I wanted to rebase our branch against 2.12.2 and also use the patch that landed master (patchset 20) instead of patchset 12 that we are currently using in production. What is your recommendation? Or could you please backport it to b2_12? Thanks much!

            sthiell Stephane Thiell added a comment - Latest patchset of https://review.whamcloud.com/34071/ introduced the use of lu_env_find (due to aa82cc83 LU-12034 obdclass: put all service's env on the list ) and thus it doesn't compile on b2_12 anymore. I wanted to rebase our branch against 2.12.2 and also use the patch that landed master (patchset 20) instead of patchset 12 that we are currently using in production. What is your recommendation? Or could you please backport it to b2_12? Thanks much!

            Sorry missed the message. I have tested the LU-11359 patch but not LU-11285.

            simmonsja James A Simmons added a comment - Sorry missed the message. I have tested the LU-11359 patch but not LU-11285 .
            pjones Peter Jones added a comment -

            Landed for 2.13

            pjones Peter Jones added a comment - Landed for 2.13

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34071/
            Subject: LU-11359 mdt: fix mdt_dom_discard_data() timeouts
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 9c028e74c2202a8a481557c4cb22225734aaf19f

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34071/ Subject: LU-11359 mdt: fix mdt_dom_discard_data() timeouts Project: fs/lustre-release Branch: master Current Patch Set: Commit: 9c028e74c2202a8a481557c4cb22225734aaf19f
            pjones Peter Jones added a comment -

            James

            Have you tested the current patch (and LU-11285)?

            Peter

            pjones Peter Jones added a comment - James Have you tested the current patch (and LU-11285 )? Peter
            simmonsja James A Simmons added a comment - - edited

            We are also seeing this issue for our production system. I want to add it can be reproduced with running the io500 script.

            simmonsja James A Simmons added a comment - - edited We are also seeing this issue for our production system. I want to add it can be reproduced with running the io500 script.

            Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34804
            Subject: LU-11359 mdt: more debug for lu_object_put() issue
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a67945fc08744fa70c7b56a31adbc35ffa412dce

            gerrit Gerrit Updater added a comment - Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34804 Subject: LU-11359 mdt: more debug for lu_object_put() issue Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a67945fc08744fa70c7b56a31adbc35ffa412dce

            We've completed the patching of all our clients (~1,400) last weekend. We haven't seen any new {{mdt_dom_discard_data }} call traces, so I consider this issue closed. However, we likely hit LU-11751 last weekend and I did already follow-up there. Thanks!

            sthiell Stephane Thiell added a comment - We've completed the patching of all our clients (~1,400) last weekend. We haven't seen any new {{mdt_dom_discard_data }} call traces, so I consider this issue closed. However, we likely hit LU-11751 last weekend and I did already follow-up there. Thanks!

            Mike, ok thanks. I'll follow this one too. I have seen a few more traces on MDS last night, from time to time (see an example of one of these 'events' below). Not sure about the impact on users yet, but I assume some file access must be blocked, at least temporarily. We stopped Robinhood (which is not critical here) so it's not coming from it I think. Our plan is not to touch anything unless we see a real user-facing problem at this point, but do you think https://review.whamcloud.com/#/c/33077 would be worth trying if ever we see that it has an impact on production?

            Apr 27 06:59:36 fir-md1-s1 kernel: Lustre: fir-MDT0000: Client 79273348-f676-2905-42da-85a87e1ba2d5 (at 10.9.107.14@o2ib4) reconnecting
            Apr 27 06:59:36 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages
            Apr 27 07:08:23 fir-md1-s1 kernel: Lustre: fir-MDT0000: Connection restored to  (at 10.9.107.14@o2ib4)
            Apr 27 07:08:23 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages
            Apr 27 07:09:56 fir-md1-s1 kernel: Lustre: fir-MDT0000: Client 79273348-f676-2905-42da-85a87e1ba2d5 (at 10.9.107.14@o2ib4) reconnecting
            Apr 27 07:09:56 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages
            Apr 27 07:18:43 fir-md1-s1 kernel: Lustre: fir-MDT0000: Connection restored to  (at 10.9.107.14@o2ib4)
            Apr 27 07:18:43 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages
            Apr 27 07:20:16 fir-md1-s1 kernel: Lustre: fir-MDT0000: Client 79273348-f676-2905-42da-85a87e1ba2d5 (at 10.9.107.14@o2ib4) reconnecting
            Apr 27 07:20:16 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages
            Apr 27 07:29:03 fir-md1-s1 kernel: Lustre: fir-MDT0000: Connection restored to 4a715ce4-44e5-9eb5-fd19-43a9ef3ba9b7 (at 10.9.107.10@o2ib4)
            Apr 27 07:29:03 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages
            Apr 27 07:30:36 fir-md1-s1 kernel: Lustre: fir-MDT0000: Client 8937ba9c-f8a3-cff7-01bf-a35c8d69bba5 (at 10.9.107.10@o2ib4) reconnecting
            Apr 27 07:30:36 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages
            Apr 27 07:39:23 fir-md1-s1 kernel: Lustre: fir-MDT0000: Connection restored to 4a715ce4-44e5-9eb5-fd19-43a9ef3ba9b7 (at 10.9.107.10@o2ib4)
            Apr 27 07:39:23 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages
            Apr 27 07:40:37 fir-md1-s1 kernel: LNet: Service thread pid 114792 completed after 3541.44s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            Apr 27 07:40:37 fir-md1-s1 kernel: LNet: Skipped 1 previous similar message
            Apr 27 07:40:56 fir-md1-s1 kernel: Lustre: fir-MDT0000: Client 8937ba9c-f8a3-cff7-01bf-a35c8d69bba5 (at 10.9.107.10@o2ib4) reconnecting
            Apr 27 07:40:56 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages
            Apr 27 07:41:02 fir-md1-s1 kernel: Lustre: 105058:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-5), not sending early reply
                                                 req@ffff8b5d7a9ad100 x1631534730383072/t0(0) o101->5735cd86-3a30-362c-bc05-c634d3fa1859@10.9.107.11@o2ib4:7/0 lens 632/0 e 0 to 0 dl 1556376067 ref 2 fl Interpret:/0/ffffffff rc 0/-1
            Apr 27 07:41:02 fir-md1-s1 kernel: Lustre: 105058:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 30 previous similar messages
            Apr 27 07:42:07 fir-md1-s1 kernel: LustreError: 104389:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1556376037, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8b4d410f0240/0x378007f7dcffbc77 lrc: 3/1,0 mode: --/PR r
            Apr 27 07:42:07 fir-md1-s1 kernel: LustreError: 104389:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 2 previous similar messages
            Apr 27 07:43:07 fir-md1-s1 kernel: LustreError: 20444:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 150s: evicting client at 10.9.104.19@o2ib4  ns: mdt-fir-MDT0000_UUID lock: ffff8b60e5ace780/0x378007f7ac0a6892 lrc: 3/0,0 mode: PW/PW res: [0x20001a221:0x311:0x0].0x0 bits 0x40/0x0 rrc
            Apr 27 07:43:07 fir-md1-s1 kernel: LustreError: 104728:0:(ldlm_lockd.c:1357:ldlm_handle_enqueue0()) ### lock on destroyed export ffff8b5101ad2000 ns: mdt-fir-MDT0000_UUID lock: ffff8b5ededbe300/0x378007f7ac0a6bc5 lrc: 5/0,0 mode: PW/PW res: [0x20001a221:0x311:0x0].0x0 bits 0x40/0x0 rrc: 50 type: IBT flags: 0x5020040
            Apr 27 07:43:07 fir-md1-s1 kernel: LNet: Service thread pid 104728 completed after 3691.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            Apr 27 07:43:07 fir-md1-s1 kernel: LNet: Service thread pid 105239 completed after 3691.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            Apr 27 07:43:32 fir-md1-s1 kernel: Lustre: 105124:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-5), not sending early reply
                                                 req@ffff8b6231114500 x1631546041738256/t0(0) o101->5c33d5d2-2621-e59d-0e36-98a7cb2caa9e@10.9.104.17@o2ib4:7/0 lens 568/0 e 0 to 0 dl 1556376217 ref 2 fl Interpret:/0/ffffffff rc 0/-1
            Apr 27 07:43:32 fir-md1-s1 kernel: Lustre: 105124:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message
            Apr 27 07:43:37 fir-md1-s1 kernel: LustreError: 20444:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 30s: evicting client at 10.9.104.17@o2ib4  ns: mdt-fir-MDT0000_UUID lock: ffff8b57efe82880/0x378007f7ac0a6d3f lrc: 3/0,0 mode: PW/PW res: [0x20001a221:0x311:0x0].0x0 bits 0x40/0x0 rrc:
            Apr 27 07:43:37 fir-md1-s1 kernel: LNet: Service thread pid 104337 completed after 3720.97s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            Apr 27 07:43:37 fir-md1-s1 kernel: LustreError: 104691:0:(ldlm_lockd.c:1357:ldlm_handle_enqueue0()) ### lock on destroyed export ffff8b5101ad2000 ns: mdt-fir-MDT0000_UUID lock: ffff8b567d303a80/0x378007f7ac0a6e88 lrc: 3/0,0 mode: PR/PR res: [0x20001a221:0x311:0x0].0x0 bits 0x1b/0x0 rrc: 44 type: IBT flags: 0x5020040
            Apr 27 07:43:37 fir-md1-s1 kernel: Lustre: 104691:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (3719:2s); client may timeout.  req@ffff8b5e67f11200 x1631585913250576/t0(0) o101->8a37f7b1-3efc-30e9-f8d1-739df6680357@10.9.104.19@o2ib4:6/0 lens 584/1168 e 0 to 0 dl 1556376215
            Apr 27 07:43:37 fir-md1-s1 kernel: Lustre: 104691:0:(service.c:2165:ptlrpc_server_handle_request()) Skipped 1 previous similar message
            Apr 27 07:43:37 fir-md1-s1 kernel: LNet: Skipped 4 previous similar messages
            Apr 27 07:43:43 fir-md1-s1 kernel: LustreError: 82844:0:(ldlm_lockd.c:2322:ldlm_cancel_handler()) ldlm_cancel from 10.9.104.17@o2ib4 arrived at 1556376223 with bad export cookie 3999205229385308541
            Apr 27 07:43:58 fir-md1-s1 kernel: LNet: Service thread pid 114991 was inactive for 200.64s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
            Apr 27 07:43:58 fir-md1-s1 kernel: LNet: Skipped 4 previous similar messages
            Apr 27 07:43:58 fir-md1-s1 kernel: Pid: 114991, comm: mdt02_103 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
            Apr 27 07:43:58 fir-md1-s1 kernel: Call Trace:
            Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc0d7ae25>] ldlm_completion_ast+0x4e5/0x890 [ptlrpc]
            Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc0d7bc8c>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
            Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc146b4bb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
            Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc146bb40>] mdt_object_lock_internal+0x70/0x3e0 [mdt]
            Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc1474e45>] mdt_intent_getxattr+0xb5/0x270 [mdt]
            Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc1471a28>] mdt_intent_policy+0x2e8/0xd00 [mdt]
            Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc0d61d26>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
            Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc0d8a6d7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
            Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc0e110b2>] tgt_enqueue+0x62/0x210 [ptlrpc]
            Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc0e1810a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
            Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc0dbc6db>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc0dc000c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
            Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffff9b4c1c31>] kthread+0xd1/0xe0
            Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffff9bb74c24>] ret_from_fork_nospec_begin+0xe/0x21
            Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
            Apr 27 07:43:58 fir-md1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1556376238.114991
            Apr 27 07:43:58 fir-md1-s1 kernel: Pid: 104389, comm: mdt01_004 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
            Apr 27 07:43:58 fir-md1-s1 kernel: Call Trace:
            Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc0d7ae25>] ldlm_completion_ast+0x4e5/0x890 [ptlrpc]
            Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc0d7bc8c>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
            Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc146b4bb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
            Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc146bb40>] mdt_object_lock_internal+0x70/0x3e0 [mdt]
            Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc1474e45>] mdt_intent_getxattr+0xb5/0x270 [mdt]
            Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc1471a28>] mdt_intent_policy+0x2e8/0xd00 [mdt]
            Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc0d61d26>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
            Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc0d8a6d7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
            Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc0e110b2>] tgt_enqueue+0x62/0x210 [ptlrpc]
            Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc0e1810a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
            Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc0dbc6db>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffc0dc000c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
            Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffff9b4c1c31>] kthread+0xd1/0xe0
            Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffff9bb74c24>] ret_from_fork_nospec_begin+0xe/0x21
            Apr 27 07:43:58 fir-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
            Apr 27 07:44:07 fir-md1-s1 kernel: Lustre: 105128:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (3750:1s); client may timeout.  req@ffff8b5e67f12700 x1631586050523824/t0(0) o101->0fafc81c-d2f9-5fcc-1c5e-9d205df82025@10.9.104.20@o2ib4:6/0 lens 480/536 e 0 to 0 dl 1556376246 
            Apr 27 07:44:07 fir-md1-s1 kernel: LustreError: 104957:0:(ldlm_lockd.c:1357:ldlm_handle_enqueue0()) ### lock on destroyed export ffff8b5f75d7a000 ns: mdt-fir-MDT0000_UUID lock: ffff8b52d16cf980/0x378007f7ac0a79a9 lrc: 3/0,0 mode: PR/PR res: [0x20001a221:0x311:0x0].0x0 bits 0x20/0x0 rrc: 33 type: IBT flags: 0x5020000
            Apr 27 07:44:07 fir-md1-s1 kernel: LustreError: 104957:0:(ldlm_lockd.c:1357:ldlm_handle_enqueue0()) Skipped 1 previous similar message
            Apr 27 07:44:07 fir-md1-s1 kernel: Lustre: 105128:0:(service.c:2165:ptlrpc_server_handle_request()) Skipped 10 previous similar messages
            Apr 27 08:25:20 fir-md1-s1 kernel: Lustre: fir-MDT0000: haven't heard from client f8efffa2-3213-b7c8-ccdc-9a5668e213bd (at 10.8.25.13@o2ib6) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8b72fa1f6c00, cur 1556378720 expire 1556378570 last 1556378493
            Apr 27 08:25:20 fir-md1-s1 kernel: Lustre: Skipped 2 previous similar messages
            Apr 27 08:25:36 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 8bfed59e-0ae2-8175-d0ee-2a61e691b9d0 (at 10.8.25.13@o2ib6)
            
            sthiell Stephane Thiell added a comment - Mike, ok thanks. I'll follow this one too. I have seen a few more traces on MDS last night, from time to time (see an example of one of these 'events' below). Not sure about the impact on users yet, but I assume some file access must be blocked, at least temporarily. We stopped Robinhood (which is not critical here) so it's not coming from it I think. Our plan is not to touch anything unless we see a real user-facing problem at this point, but do you think https://review.whamcloud.com/#/c/33077 would be worth trying if ever we see that it has an impact on production? Apr 27 06:59:36 fir-md1-s1 kernel: Lustre: fir-MDT0000: Client 79273348-f676-2905-42da-85a87e1ba2d5 (at 10.9.107.14@o2ib4) reconnecting Apr 27 06:59:36 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages Apr 27 07:08:23 fir-md1-s1 kernel: Lustre: fir-MDT0000: Connection restored to (at 10.9.107.14@o2ib4) Apr 27 07:08:23 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages Apr 27 07:09:56 fir-md1-s1 kernel: Lustre: fir-MDT0000: Client 79273348-f676-2905-42da-85a87e1ba2d5 (at 10.9.107.14@o2ib4) reconnecting Apr 27 07:09:56 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages Apr 27 07:18:43 fir-md1-s1 kernel: Lustre: fir-MDT0000: Connection restored to (at 10.9.107.14@o2ib4) Apr 27 07:18:43 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages Apr 27 07:20:16 fir-md1-s1 kernel: Lustre: fir-MDT0000: Client 79273348-f676-2905-42da-85a87e1ba2d5 (at 10.9.107.14@o2ib4) reconnecting Apr 27 07:20:16 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages Apr 27 07:29:03 fir-md1-s1 kernel: Lustre: fir-MDT0000: Connection restored to 4a715ce4-44e5-9eb5-fd19-43a9ef3ba9b7 (at 10.9.107.10@o2ib4) Apr 27 07:29:03 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages Apr 27 07:30:36 fir-md1-s1 kernel: Lustre: fir-MDT0000: Client 8937ba9c-f8a3-cff7-01bf-a35c8d69bba5 (at 10.9.107.10@o2ib4) reconnecting Apr 27 07:30:36 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages Apr 27 07:39:23 fir-md1-s1 kernel: Lustre: fir-MDT0000: Connection restored to 4a715ce4-44e5-9eb5-fd19-43a9ef3ba9b7 (at 10.9.107.10@o2ib4) Apr 27 07:39:23 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages Apr 27 07:40:37 fir-md1-s1 kernel: LNet: Service thread pid 114792 completed after 3541.44s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Apr 27 07:40:37 fir-md1-s1 kernel: LNet: Skipped 1 previous similar message Apr 27 07:40:56 fir-md1-s1 kernel: Lustre: fir-MDT0000: Client 8937ba9c-f8a3-cff7-01bf-a35c8d69bba5 (at 10.9.107.10@o2ib4) reconnecting Apr 27 07:40:56 fir-md1-s1 kernel: Lustre: Skipped 299 previous similar messages Apr 27 07:41:02 fir-md1-s1 kernel: Lustre: 105058:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-5), not sending early reply req@ffff8b5d7a9ad100 x1631534730383072/t0(0) o101->5735cd86-3a30-362c-bc05-c634d3fa1859@10.9.107.11@o2ib4:7/0 lens 632/0 e 0 to 0 dl 1556376067 ref 2 fl Interpret:/0/ffffffff rc 0/-1 Apr 27 07:41:02 fir-md1-s1 kernel: Lustre: 105058:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 30 previous similar messages Apr 27 07:42:07 fir-md1-s1 kernel: LustreError: 104389:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1556376037, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8b4d410f0240/0x378007f7dcffbc77 lrc: 3/1,0 mode: --/PR r Apr 27 07:42:07 fir-md1-s1 kernel: LustreError: 104389:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 2 previous similar messages Apr 27 07:43:07 fir-md1-s1 kernel: LustreError: 20444:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 150s: evicting client at 10.9.104.19@o2ib4 ns: mdt-fir-MDT0000_UUID lock: ffff8b60e5ace780/0x378007f7ac0a6892 lrc: 3/0,0 mode: PW/PW res: [0x20001a221:0x311:0x0].0x0 bits 0x40/0x0 rrc Apr 27 07:43:07 fir-md1-s1 kernel: LustreError: 104728:0:(ldlm_lockd.c:1357:ldlm_handle_enqueue0()) ### lock on destroyed export ffff8b5101ad2000 ns: mdt-fir-MDT0000_UUID lock: ffff8b5ededbe300/0x378007f7ac0a6bc5 lrc: 5/0,0 mode: PW/PW res: [0x20001a221:0x311:0x0].0x0 bits 0x40/0x0 rrc: 50 type: IBT flags: 0x5020040 Apr 27 07:43:07 fir-md1-s1 kernel: LNet: Service thread pid 104728 completed after 3691.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Apr 27 07:43:07 fir-md1-s1 kernel: LNet: Service thread pid 105239 completed after 3691.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Apr 27 07:43:32 fir-md1-s1 kernel: Lustre: 105124:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-5), not sending early reply req@ffff8b6231114500 x1631546041738256/t0(0) o101->5c33d5d2-2621-e59d-0e36-98a7cb2caa9e@10.9.104.17@o2ib4:7/0 lens 568/0 e 0 to 0 dl 1556376217 ref 2 fl Interpret:/0/ffffffff rc 0/-1 Apr 27 07:43:32 fir-md1-s1 kernel: Lustre: 105124:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message Apr 27 07:43:37 fir-md1-s1 kernel: LustreError: 20444:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 30s: evicting client at 10.9.104.17@o2ib4 ns: mdt-fir-MDT0000_UUID lock: ffff8b57efe82880/0x378007f7ac0a6d3f lrc: 3/0,0 mode: PW/PW res: [0x20001a221:0x311:0x0].0x0 bits 0x40/0x0 rrc: Apr 27 07:43:37 fir-md1-s1 kernel: LNet: Service thread pid 104337 completed after 3720.97s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Apr 27 07:43:37 fir-md1-s1 kernel: LustreError: 104691:0:(ldlm_lockd.c:1357:ldlm_handle_enqueue0()) ### lock on destroyed export ffff8b5101ad2000 ns: mdt-fir-MDT0000_UUID lock: ffff8b567d303a80/0x378007f7ac0a6e88 lrc: 3/0,0 mode: PR/PR res: [0x20001a221:0x311:0x0].0x0 bits 0x1b/0x0 rrc: 44 type: IBT flags: 0x5020040 Apr 27 07:43:37 fir-md1-s1 kernel: Lustre: 104691:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (3719:2s); client may timeout. req@ffff8b5e67f11200 x1631585913250576/t0(0) o101->8a37f7b1-3efc-30e9-f8d1-739df6680357@10.9.104.19@o2ib4:6/0 lens 584/1168 e 0 to 0 dl 1556376215 Apr 27 07:43:37 fir-md1-s1 kernel: Lustre: 104691:0:(service.c:2165:ptlrpc_server_handle_request()) Skipped 1 previous similar message Apr 27 07:43:37 fir-md1-s1 kernel: LNet: Skipped 4 previous similar messages Apr 27 07:43:43 fir-md1-s1 kernel: LustreError: 82844:0:(ldlm_lockd.c:2322:ldlm_cancel_handler()) ldlm_cancel from 10.9.104.17@o2ib4 arrived at 1556376223 with bad export cookie 3999205229385308541 Apr 27 07:43:58 fir-md1-s1 kernel: LNet: Service thread pid 114991 was inactive for 200.64s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Apr 27 07:43:58 fir-md1-s1 kernel: LNet: Skipped 4 previous similar messages Apr 27 07:43:58 fir-md1-s1 kernel: Pid: 114991, comm: mdt02_103 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 Apr 27 07:43:58 fir-md1-s1 kernel: Call Trace: Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc0d7ae25>] ldlm_completion_ast+0x4e5/0x890 [ptlrpc] Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc0d7bc8c>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc] Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc146b4bb>] mdt_object_local_lock+0x50b/0xb20 [mdt] Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc146bb40>] mdt_object_lock_internal+0x70/0x3e0 [mdt] Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc1474e45>] mdt_intent_getxattr+0xb5/0x270 [mdt] Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc1471a28>] mdt_intent_policy+0x2e8/0xd00 [mdt] Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc0d61d26>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc] Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc0d8a6d7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc] Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc0e110b2>] tgt_enqueue+0x62/0x210 [ptlrpc] Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc0e1810a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc0dbc6db>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc0dc000c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffff9b4c1c31>] kthread+0xd1/0xe0 Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffff9bb74c24>] ret_from_fork_nospec_begin+0xe/0x21 Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Apr 27 07:43:58 fir-md1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1556376238.114991 Apr 27 07:43:58 fir-md1-s1 kernel: Pid: 104389, comm: mdt01_004 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 Apr 27 07:43:58 fir-md1-s1 kernel: Call Trace: Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc0d7ae25>] ldlm_completion_ast+0x4e5/0x890 [ptlrpc] Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc0d7bc8c>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc] Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc146b4bb>] mdt_object_local_lock+0x50b/0xb20 [mdt] Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc146bb40>] mdt_object_lock_internal+0x70/0x3e0 [mdt] Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc1474e45>] mdt_intent_getxattr+0xb5/0x270 [mdt] Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc1471a28>] mdt_intent_policy+0x2e8/0xd00 [mdt] Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc0d61d26>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc] Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc0d8a6d7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc] Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc0e110b2>] tgt_enqueue+0x62/0x210 [ptlrpc] Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc0e1810a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc0dbc6db>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffc0dc000c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffff9b4c1c31>] kthread+0xd1/0xe0 Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffff9bb74c24>] ret_from_fork_nospec_begin+0xe/0x21 Apr 27 07:43:58 fir-md1-s1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Apr 27 07:44:07 fir-md1-s1 kernel: Lustre: 105128:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (3750:1s); client may timeout. req@ffff8b5e67f12700 x1631586050523824/t0(0) o101->0fafc81c-d2f9-5fcc-1c5e-9d205df82025@10.9.104.20@o2ib4:6/0 lens 480/536 e 0 to 0 dl 1556376246 Apr 27 07:44:07 fir-md1-s1 kernel: LustreError: 104957:0:(ldlm_lockd.c:1357:ldlm_handle_enqueue0()) ### lock on destroyed export ffff8b5f75d7a000 ns: mdt-fir-MDT0000_UUID lock: ffff8b52d16cf980/0x378007f7ac0a79a9 lrc: 3/0,0 mode: PR/PR res: [0x20001a221:0x311:0x0].0x0 bits 0x20/0x0 rrc: 33 type: IBT flags: 0x5020000 Apr 27 07:44:07 fir-md1-s1 kernel: LustreError: 104957:0:(ldlm_lockd.c:1357:ldlm_handle_enqueue0()) Skipped 1 previous similar message Apr 27 07:44:07 fir-md1-s1 kernel: Lustre: 105128:0:(service.c:2165:ptlrpc_server_handle_request()) Skipped 10 previous similar messages Apr 27 08:25:20 fir-md1-s1 kernel: Lustre: fir-MDT0000: haven't heard from client f8efffa2-3213-b7c8-ccdc-9a5668e213bd (at 10.8.25.13@o2ib6) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8b72fa1f6c00, cur 1556378720 expire 1556378570 last 1556378493 Apr 27 08:25:20 fir-md1-s1 kernel: Lustre: Skipped 2 previous similar messages Apr 27 08:25:36 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 8bfed59e-0ae2-8175-d0ee-2a61e691b9d0 (at 10.8.25.13@o2ib6)

            People

              tappro Mikhail Pershin
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: