Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11359

racer test 1 times out with client hung in dir_create.sh, ls, … and MDS in ldlm_completion_ast()

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: Lustre 2.12.0, Lustre 2.10.7, Lustre 2.12.1
    • Fix Version/s: Lustre 2.13.0, Lustre 2.12.3
    • Labels:
    • Severity:
      3
    • Rank (Obsolete):
      9223372036854775807

      Description

      Looking at the logs for https://testing.whamcloud.com/test_sets/ada6008e-b433-11e8-b86b-52540065bddc, we see that racer test_1 hangs in in ls, dir_create.sh. On both clients, we see several call to dir_create.sh, multiple ls calls, getfattr, ln and others blocked/hung.

      From the first client’s (vm9) console log, we see many calls hung ending with “lustre: revalidate FID [0x200007161:0x1:0x0] error: rc = -4”

      [ 7722.701554] Lustre: DEBUG MARKER: DURATION=300 			MDSCOUNT=1 OSTCOUNT=7			RACER_ENABLE_REMOTE_DIRS=false 			RACER_ENABLE_STRIPED_DIRS=false 			RACER_ENABLE_MIGRATION=false 			RACER_ENABLE_PFL=true 			RACER_ENABLE_DOM=true 			RACER_ENABLE_FLR=true 			LFS=/usr/bin/lfs 			/usr/lib64/lustre/t
      [ 7722.706903] Lustre: DEBUG MARKER: DURATION=300 			MDSCOUNT=1 OSTCOUNT=7			RACER_ENABLE_REMOTE_DIRS=false 			RACER_ENABLE_STRIPED_DIRS=false 			RACER_ENABLE_MIGRATION=false 			RACER_ENABLE_PFL=true 			RACER_ENABLE_DOM=true 			RACER_ENABLE_FLR=true 			LFS=/usr/bin/lfs 			/usr/lib64/lustre/t
      [ 7726.780425] 3[19667]: segfault at 0 ip 0000000000403e5f sp 00007fff05523690 error 6 in 3 (deleted)[400000+6000]
      [ 7738.946240] Lustre: lfs: using old ioctl(LL_IOC_LOV_GETSTRIPE) on [0x200007163:0x167:0x0], use llapi_layout_get_by_path()
      [ 7858.011870] Lustre: lustre-OST0004-osc-ffff8e6e2480f800: Connection to lustre-OST0004 (at 10.9.4.117@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [ 7858.015216] LustreError: 167-0: lustre-OST0004-osc-ffff8e6e2480f800: This client was evicted by lustre-OST0004; in progress operations using this service will fail.
      [ 7858.019049] LustreError: 1426:0:(ldlm_resource.c:1146:ldlm_resource_complain()) lustre-OST0004-osc-ffff8e6e2480f800: namespace resource [0x832e:0x0:0x0].0x0 (ffff8e6e0ce026c0) refcount nonzero (2) after lock cleanup; forcing cleanup.
      [ 7920.249204] INFO: task dir_create.sh:15717 blocked for more than 120 seconds.
      [ 7920.250063] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.250808] dir_create.sh   D ffff8e6e02b14f10     0 15717  15658 0x00000080
      [ 7920.251566] Call Trace:
      [ 7920.251882]  [<ffffffffc0db47e2>] ? ll_dcompare+0x72/0x2e0 [lustre]
      [ 7920.252506]  [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.253246]  [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.253901]  [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f
      [ 7920.254400]  [<ffffffffa470bcde>] lookup_slow+0x33/0xa7
      [ 7920.254914]  [<ffffffffa422927f>] link_path_walk+0x80f/0x8b0
      [ 7920.255597]  [<ffffffffa422b75d>] ? do_last+0x66d/0x12c0
      [ 7920.256108]  [<ffffffffa422c465>] path_openat+0xb5/0x640
      [ 7920.256614]  [<ffffffffa422e01d>] do_filp_open+0x4d/0xb0
      [ 7920.257118]  [<ffffffffa423b4c4>] ? __alloc_fd+0xc4/0x170
      [ 7920.257637]  [<ffffffffa421a327>] do_sys_open+0x137/0x240
      [ 7920.258159]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.258783]  [<ffffffffa421a44e>] SyS_open+0x1e/0x20
      [ 7920.259260]  [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21
      [ 7920.259819]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.260552] INFO: task cp:32379 blocked for more than 120 seconds.
      [ 7920.261136] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.261877] cp              D ffff8e6e2611af70     0 32379  15806 0x00000080
      [ 7920.262581] Call Trace:
      [ 7920.262857]  [<ffffffffc0df4cc5>] ? ll_i2suppgid+0x15/0x40 [lustre]
      [ 7920.263466]  [<ffffffffc0df4d14>] ? ll_i2gids+0x24/0xb0 [lustre]
      [ 7920.264031]  [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.264635]  [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.265222]  [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f
      [ 7920.265711]  [<ffffffffc0de785c>] ll_setattr_raw+0xb6c/0xee0 [lustre]
      [ 7920.266332]  [<ffffffffc0de7c3c>] ll_setattr+0x6c/0xd0 [lustre]
      [ 7920.266897]  [<ffffffffa4239f14>] notify_change+0x2c4/0x420
      [ 7920.267428]  [<ffffffffa424fab9>] utimes_common+0xd9/0x1c0
      [ 7920.267952]  [<ffffffffa424fcde>] do_utimes+0x13e/0x180
      [ 7920.268452]  [<ffffffffa424fe24>] SyS_utimensat+0x64/0xb0
      [ 7920.268966]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.269585]  [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21
      [ 7920.270156]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.270775] INFO: task cp:434 blocked for more than 120 seconds.
      [ 7920.271344] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.272068] cp              D ffff8e6df7cb8000     0   434  15776 0x00000080
      [ 7920.272778] Call Trace:
      [ 7920.273028]  [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.273636]  [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.274225]  [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f
      [ 7920.274721]  [<ffffffffc0e14bc3>] vvp_io_write_start+0x603/0x810 [lustre]
      [ 7920.275506]  [<ffffffffc0978377>] ? cl_lock_request+0x67/0x1f0 [obdclass]
      [ 7920.276191]  [<ffffffffc097a228>] cl_io_start+0x68/0x130 [obdclass]
      [ 7920.276803]  [<ffffffffc097c611>] cl_io_loop+0x131/0xc70 [obdclass]
      [ 7920.277417]  [<ffffffffc0dc9db2>] ll_file_io_generic+0x4e2/0xd10 [lustre]
      [ 7920.278064]  [<ffffffffc0dcab32>] ll_file_aio_write+0x372/0x540 [lustre]
      [ 7920.278699]  [<ffffffffc0dcada4>] ll_file_write+0xa4/0x170 [lustre]
      [ 7920.279300]  [<ffffffffa421b490>] vfs_write+0xc0/0x1f0
      [ 7920.279776]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.280408]  [<ffffffffa421c2bf>] SyS_write+0x7f/0xf0
      [ 7920.280894]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.281512]  [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21
      [ 7920.282096]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.282719] INFO: task getfattr:1159 blocked for more than 120 seconds.
      [ 7920.283347] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.284090] getfattr        D ffff8e6df815af70     0  1159  15746 0x00000080
      [ 7920.284799] Call Trace:
      [ 7920.285054]  [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.285660]  [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.286244]  [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f
      [ 7920.286726]  [<ffffffffa470bcde>] lookup_slow+0x33/0xa7
      [ 7920.287231]  [<ffffffffa42291d4>] link_path_walk+0x764/0x8b0
      [ 7920.287770]  [<ffffffffc094c943>] ? class_handle2object+0xc3/0x1c0 [obdclass]
      [ 7920.288453]  [<ffffffffa422a24a>] path_lookupat+0x7a/0x8b0
      [ 7920.288981]  [<ffffffffa41f7b9b>] ? kmem_cache_alloc+0x19b/0x1f0
      [ 7920.289547]  [<ffffffffa422cd4f>] ? getname_flags+0x4f/0x1a0
      [ 7920.290096]  [<ffffffffa422aaab>] filename_lookup+0x2b/0xc0
      [ 7920.290624]  [<ffffffffa422dee7>] user_path_at_empty+0x67/0xc0
      [ 7920.291185]  [<ffffffffa4125092>] ? from_kgid_munged+0x12/0x20
      [ 7920.291740]  [<ffffffffa422142f>] ? cp_new_stat+0x14f/0x180
      [ 7920.292278]  [<ffffffffa422df51>] user_path_at+0x11/0x20
      [ 7920.292775]  [<ffffffffa4220f23>] vfs_fstatat+0x63/0xc0
      [ 7920.293282]  [<ffffffffa422148e>] SYSC_newstat+0x2e/0x60
      [ 7920.293776]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.294407]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.295029]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.295645]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.296272]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.296895]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.297512]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.298132]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.298746]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.299373]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.300000]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.300616]  [<ffffffffa422176e>] SyS_newstat+0xe/0x10
      [ 7920.301103]  [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21
      [ 7920.301668]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.302299] INFO: task getfattr:1656 blocked for more than 120 seconds.
      [ 7920.302919] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.303653] getfattr        D ffff8e6e06b6eeb0     0  1656  15679 0x00000080
      [ 7920.304366] Call Trace:
      [ 7920.304606]  [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.305221]  [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.305790]  [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f
      [ 7920.306292]  [<ffffffffa470bcde>] lookup_slow+0x33/0xa7
      [ 7920.306784]  [<ffffffffa422aa08>] path_lookupat+0x838/0x8b0
      [ 7920.307321]  [<ffffffffa41f7b79>] ? kmem_cache_alloc+0x179/0x1f0
      [ 7920.307887]  [<ffffffffa422cd4f>] ? getname_flags+0x4f/0x1a0
      [ 7920.308418]  [<ffffffffa422aaab>] filename_lookup+0x2b/0xc0
      [ 7920.308946]  [<ffffffffa422dee7>] user_path_at_empty+0x67/0xc0
      [ 7920.309494]  [<ffffffffa422df51>] user_path_at+0x11/0x20
      [ 7920.309998]  [<ffffffffa4243f68>] SyS_listxattr+0x48/0xb0
      [ 7920.310506]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.311125]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.311749]  [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21
      [ 7920.312320]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.312939] INFO: task chmod:1674 blocked for more than 120 seconds.
      [ 7920.313533] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.314270] chmod           D ffff8e6e02b12f70     0  1674  15675 0x00000080
      [ 7920.314990] Call Trace:
      [ 7920.315238]  [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.315829]  [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.316424]  [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f
      [ 7920.316921]  [<ffffffffa470bcde>] lookup_slow+0x33/0xa7
      [ 7920.317417]  [<ffffffffa422aa08>] path_lookupat+0x838/0x8b0
      [ 7920.317955]  [<ffffffffa406814e>] ? kvm_clock_get_cycles+0x1e/0x20
      [ 7920.318549]  [<ffffffffa41f7b79>] ? kmem_cache_alloc+0x179/0x1f0
      [ 7920.319125]  [<ffffffffa422cd4f>] ? getname_flags+0x4f/0x1a0
      [ 7920.319657]  [<ffffffffa422aaab>] filename_lookup+0x2b/0xc0
      [ 7920.320204]  [<ffffffffa422dee7>] user_path_at_empty+0x67/0xc0
      [ 7920.320749]  [<ffffffffa471b57c>] ? __do_page_fault+0x1bc/0x4f0
      [ 7920.321316]  [<ffffffffa422df51>] user_path_at+0x11/0x20
      [ 7920.321814]  [<ffffffffa4219d07>] SyS_fchmodat+0x47/0xc0
      [ 7920.322322]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.322942]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.323558]  [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21
      [ 7920.324124]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.324743] INFO: task getfattr:1834 blocked for more than 120 seconds.
      [ 7920.325372] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.326101] getfattr        D ffff8e6df7068000     0  1834  15711 0x00000080
      [ 7920.326802] Call Trace:
      [ 7920.327060]  [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.327661]  [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.328249]  [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f
      [ 7920.328745]  [<ffffffffa470bcde>] lookup_slow+0x33/0xa7
      [ 7920.329254]  [<ffffffffa422aa08>] path_lookupat+0x838/0x8b0
      [ 7920.329778]  [<ffffffffa4242b12>] ? generic_getxattr+0x52/0x70
      [ 7920.330338]  [<ffffffffa41f7b79>] ? kmem_cache_alloc+0x179/0x1f0
      [ 7920.331029]  [<ffffffffa422cd4f>] ? getname_flags+0x4f/0x1a0
      [ 7920.331651]  [<ffffffffa422aaab>] filename_lookup+0x2b/0xc0
      [ 7920.332278]  [<ffffffffa422dee7>] user_path_at_empty+0x67/0xc0
      [ 7920.332952]  [<ffffffffa471b57c>] ? __do_page_fault+0x1bc/0x4f0
      [ 7920.333913]  [<ffffffffa422df51>] user_path_at+0x11/0x20
      [ 7920.335027]  [<ffffffffa4243d2a>] SyS_getxattr+0x4a/0xc0
      [ 7920.336126]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.336980]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.337720]  [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21
      [ 7920.338301]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.338928] INFO: task rm:1977 blocked for more than 120 seconds.
      [ 7920.339503] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.340232] rm              D ffff8e6e05499fa0     0  1977  15734 0x00000080
      [ 7920.340937] Call Trace:
      [ 7920.341181]  [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.341784]  [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.342374]  [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f
      [ 7920.342857]  [<ffffffffa422d265>] do_rmdir+0x165/0x220
      [ 7920.343352]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.343972]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.344586]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.345204]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.345816]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.346444]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.347063]  [<ffffffffa422e555>] SyS_unlinkat+0x25/0x40
      [ 7920.347565]  [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21
      [ 7920.348130]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.348747] INFO: task mv:2144 blocked for more than 120 seconds.
      [ 7920.349323] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.350049] mv              D ffff8e6e05498000     0  2144  15700 0x00000080
      [ 7920.350756] Call Trace:
      [ 7920.351005]  [<ffffffffa422a2f2>] ? path_lookupat+0x122/0x8b0
      [ 7920.351547]  [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.352148]  [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.352728]  [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f
      [ 7920.353222]  [<ffffffffa42276e5>] lock_rename+0x55/0xe0
      [ 7920.353713]  [<ffffffffa422d81f>] SYSC_renameat2+0x22f/0x5a0
      [ 7920.354255]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.354871]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.355489]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.356110]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.356732]  [<ffffffffa422ea0e>] SyS_renameat2+0xe/0x10
      [ 7920.357247]  [<ffffffffa422ea4e>] SyS_rename+0x1e/0x20
      [ 7920.357731]  [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21
      [ 7920.358303]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.358930] INFO: task ln:2148 blocked for more than 120 seconds.
      [ 7920.359504] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.360254] ln              D ffff8e6df7cbeeb0     0  2148  15720 0x00000080
      [ 7920.360973] Call Trace:
      [ 7920.361221]  [<ffffffffa4714f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.361812]  [<ffffffffa4712d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.362402]  [<ffffffffa47120ef>] mutex_lock+0x1f/0x2f
      [ 7920.362896]  [<ffffffffa470bcde>] lookup_slow+0x33/0xa7
      [ 7920.363390]  [<ffffffffa422aa08>] path_lookupat+0x838/0x8b0
      [ 7920.363924]  [<ffffffffa41944fb>] ? unlock_page+0x2b/0x30
      [ 7920.364444]  [<ffffffffa41c13f1>] ? do_read_fault.isra.60+0x121/0x1a0
      [ 7920.365048]  [<ffffffffa41f7b79>] ? kmem_cache_alloc+0x179/0x1f0
      [ 7920.365610]  [<ffffffffa422cd4f>] ? getname_flags+0x4f/0x1a0
      [ 7920.366146]  [<ffffffffa422aaab>] filename_lookup+0x2b/0xc0
      [ 7920.366671]  [<ffffffffa422dee7>] user_path_at_empty+0x67/0xc0
      [ 7920.367232]  [<ffffffffa41c7c3d>] ? handle_mm_fault+0x39d/0x9b0
      [ 7920.367777]  [<ffffffffa4125092>] ? from_kgid_munged+0x12/0x20
      [ 7920.368329]  [<ffffffffa422df51>] user_path_at+0x11/0x20
      [ 7920.368821]  [<ffffffffa4220f23>] vfs_fstatat+0x63/0xc0
      [ 7920.369344]  [<ffffffffa42214f1>] SYSC_newlstat+0x31/0x60
      [ 7920.369857]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.370495]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.371115]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.371737]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.372363]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.372987]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.373604]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.374250]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.374872]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.375495]  [<ffffffffa47206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.376114]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.376736]  [<ffffffffa422177e>] SyS_newlstat+0xe/0x10
      [ 7920.377240]  [<ffffffffa4720795>] system_call_fastpath+0x1c/0x21
      [ 7920.377806]  [<ffffffffa47206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 8022.948194] Lustre: lustre-OST0004-osc-ffff8e6e2480f800: Connection restored to 10.9.4.117@tcp (at 10.9.4.117@tcp)
      [ 8023.080282] LustreError: 4670:0:(file.c:4366:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200007161:0x1:0x0] error: rc = -4
      [ 8023.080306] LustreError: 4664:0:(file.c:4366:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200007161:0x1:0x0] error: rc = -4
      

      From the second client’s (vm10) console log, we see similar calls as in the first client’s call stack

      [ 7722.357867] Lustre: DEBUG MARKER: == racer test 1: racer on clients: trevis-10vm10,trevis-10vm9.trevis.whamcloud.com DURATION=300 ====== 12:19:23 (1536495563)
      [ 7722.541747] Lustre: DEBUG MARKER: DURATION=300 			MDSCOUNT=1 OSTCOUNT=7			RACER_ENABLE_REMOTE_DIRS=false 			RACER_ENABLE_STRIPED_DIRS=false 			RACER_ENABLE_MIGRATION=false 			RACER_ENABLE_PFL=true 			RACER_ENABLE_DOM=true 			RACER_ENABLE_FLR=true 			LFS=/usr/bin/lfs 			/usr/lib64/lustre/t
      [ 7722.542686] Lustre: DEBUG MARKER: DURATION=300 			MDSCOUNT=1 OSTCOUNT=7			RACER_ENABLE_REMOTE_DIRS=false 			RACER_ENABLE_STRIPED_DIRS=false 			RACER_ENABLE_MIGRATION=false 			RACER_ENABLE_PFL=true 			RACER_ENABLE_DOM=true 			RACER_ENABLE_FLR=true 			LFS=/usr/bin/lfs 			/usr/lib64/lustre/t
      [ 7726.494344] Lustre: lfs: using old ioctl(LL_IOC_LOV_GETSTRIPE) on [0x200007164:0x60:0x0], use llapi_layout_get_by_path()
      [ 7920.259874] INFO: task dir_create.sh:14445 blocked for more than 120 seconds.
      [ 7920.260874] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.261634] dir_create.sh   D ffff9bac39240fd0     0 14445  14433 0x00000080
      [ 7920.262414] Call Trace:
      [ 7920.262749]  [<ffffffffc0e427e2>] ? ll_dcompare+0x72/0x2e0 [lustre]
      [ 7920.263384]  [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.264027]  [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.264623]  [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f
      [ 7920.265151]  [<ffffffff88f0bcde>] lookup_slow+0x33/0xa7
      [ 7920.265692]  [<ffffffff88a2927f>] link_path_walk+0x80f/0x8b0
      [ 7920.266253]  [<ffffffff88a2b75d>] ? do_last+0x66d/0x12c0
      [ 7920.266782]  [<ffffffff88a2c465>] path_openat+0xb5/0x640
      [ 7920.267314]  [<ffffffff88a2e01d>] do_filp_open+0x4d/0xb0
      [ 7920.267848]  [<ffffffff88a3b4c4>] ? __alloc_fd+0xc4/0x170
      [ 7920.268392]  [<ffffffff88a1a327>] do_sys_open+0x137/0x240
      [ 7920.268943]  [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.269595]  [<ffffffff88a1a44e>] SyS_open+0x1e/0x20
      [ 7920.270104]  [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21
      [ 7920.270693]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.271383] INFO: task ls:23458 blocked for more than 120 seconds.
      [ 7920.271998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.272770] ls              D ffff9bac1fd82f70     0 23458  14490 0x00000080
      [ 7920.273503] Call Trace:
      [ 7920.273774]  [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.274406]  [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.275021]  [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f
      [ 7920.275525]  [<ffffffff88a2b37f>] do_last+0x28f/0x12c0
      [ 7920.276048]  [<ffffffff88a2c487>] path_openat+0xd7/0x640
      [ 7920.276582]  [<ffffffffc0e4e929>] ? ll_file_data_put+0x89/0x180 [lustre]
      [ 7920.277251]  [<ffffffff88a2e01d>] do_filp_open+0x4d/0xb0
      [ 7920.277784]  [<ffffffff88a3b4c4>] ? __alloc_fd+0xc4/0x170
      [ 7920.278318]  [<ffffffff88a1a327>] do_sys_open+0x137/0x240
      [ 7920.278858]  [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.279515]  [<ffffffff88a1a464>] SyS_openat+0x14/0x20
      [ 7920.280033]  [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21
      [ 7920.280614]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.281265] INFO: task ls:23462 blocked for more than 120 seconds.
      [ 7920.281868] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.282636] ls              D ffff9bac1fc13f40     0 23462  14490 0x00000080
      [ 7920.283377] Call Trace:
      [ 7920.283638]  [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.284279]  [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.284885]  [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f
      [ 7920.285399]  [<ffffffff88a2b37f>] do_last+0x28f/0x12c0
      [ 7920.285922]  [<ffffffff88a2c487>] path_openat+0xd7/0x640
      [ 7920.286448]  [<ffffffffc0e4e929>] ? ll_file_data_put+0x89/0x180 [lustre]
      [ 7920.287110]  [<ffffffff88a2e01d>] do_filp_open+0x4d/0xb0
      [ 7920.287631]  [<ffffffff88a3b4c4>] ? __alloc_fd+0xc4/0x170
      [ 7920.288173]  [<ffffffff88a1a327>] do_sys_open+0x137/0x240
      [ 7920.288717]  [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.289361]  [<ffffffff88a1a464>] SyS_openat+0x14/0x20
      [ 7920.289878]  [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21
      [ 7920.290470]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.291141] INFO: task ls:23464 blocked for more than 120 seconds.
      [ 7920.291746] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.292497] ls              D ffff9bac20318000     0 23464  14490 0x00000080
      [ 7920.293256] Call Trace:
      [ 7920.293500]  [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.294150]  [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.294767]  [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f
      [ 7920.295280]  [<ffffffff88a2b37f>] do_last+0x28f/0x12c0
      [ 7920.295790]  [<ffffffff88a2c487>] path_openat+0xd7/0x640
      [ 7920.296327]  [<ffffffffc0e4e929>] ? ll_file_data_put+0x89/0x180 [lustre]
      [ 7920.296990]  [<ffffffff88a2e01d>] do_filp_open+0x4d/0xb0
      [ 7920.297510]  [<ffffffff88a3b4c4>] ? __alloc_fd+0xc4/0x170
      [ 7920.298056]  [<ffffffff88a1a327>] do_sys_open+0x137/0x240
      [ 7920.298587]  [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.299256]  [<ffffffff88a1a464>] SyS_openat+0x14/0x20
      [ 7920.299775]  [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21
      [ 7920.300367]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.301015] INFO: task ls:23466 blocked for more than 120 seconds.
      [ 7920.301612] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.302395] ls              D ffff9bac2031af70     0 23466  14490 0x00000080
      [ 7920.303136] Call Trace:
      [ 7920.303397]  [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.304035]  [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.304632]  [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f
      [ 7920.305155]  [<ffffffff88a2b37f>] do_last+0x28f/0x12c0
      [ 7920.305673]  [<ffffffff88a2c487>] path_openat+0xd7/0x640
      [ 7920.306200]  [<ffffffff88a2cd4f>] ? getname_flags+0x4f/0x1a0
      [ 7920.306762]  [<ffffffffc0e4e929>] ? ll_file_data_put+0x89/0x180 [lustre]
      [ 7920.307431]  [<ffffffff88a2e01d>] do_filp_open+0x4d/0xb0
      [ 7920.307963]  [<ffffffff88a3b4c4>] ? __alloc_fd+0xc4/0x170
      [ 7920.308493]  [<ffffffff88a1a327>] do_sys_open+0x137/0x240
      [ 7920.309037]  [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.309683]  [<ffffffff88a1a464>] SyS_openat+0x14/0x20
      [ 7920.310196]  [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21
      [ 7920.310786]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.311447] INFO: task getfattr:1379 blocked for more than 120 seconds.
      [ 7920.312100] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.312876] getfattr        D ffff9bac1d229fa0     0  1379  14538 0x00000080
      [ 7920.313611] Call Trace:
      [ 7920.313884]  [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.314513]  [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.315128]  [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f
      [ 7920.315631]  [<ffffffff88f0bcde>] lookup_slow+0x33/0xa7
      [ 7920.316162]  [<ffffffff88a2aa08>] path_lookupat+0x838/0x8b0
      [ 7920.316722]  [<ffffffff88a42b12>] ? generic_getxattr+0x52/0x70
      [ 7920.317294]  [<ffffffff88a42f6a>] ? vfs_getxattr+0x8a/0xb0
      [ 7920.317855]  [<ffffffff889f7a35>] ? kmem_cache_alloc+0x35/0x1f0
      [ 7920.318436]  [<ffffffff88a2cd4f>] ? getname_flags+0x4f/0x1a0
      [ 7920.319001]  [<ffffffff88a2aaab>] filename_lookup+0x2b/0xc0
      [ 7920.319544]  [<ffffffff88a2dee7>] user_path_at_empty+0x67/0xc0
      [ 7920.320131]  [<ffffffff88f1b57c>] ? __do_page_fault+0x1bc/0x4f0
      [ 7920.320718]  [<ffffffff88a2df51>] user_path_at+0x11/0x20
      [ 7920.321247]  [<ffffffff88a43d2a>] SyS_getxattr+0x4a/0xc0
      [ 7920.321777]  [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.322439]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.323091]  [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21
      [ 7920.323689]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.324352] INFO: task ln:2464 blocked for more than 120 seconds.
      [ 7920.324955] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.325711] ln              D ffff9bac21380fd0     0  2464  14485 0x00000080
      [ 7920.326458] Call Trace:
      [ 7920.326712]  [<ffffffff88a2a2f2>] ? path_lookupat+0x122/0x8b0
      [ 7920.327296]  [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.327932]  [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.328532]  [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f
      [ 7920.329053]  [<ffffffff88a2abc5>] filename_create+0x85/0x180
      [ 7920.329608]  [<ffffffff889f7a35>] ? kmem_cache_alloc+0x35/0x1f0
      [ 7920.330201]  [<ffffffff88a2cd4f>] ? getname_flags+0x4f/0x1a0
      [ 7920.330775]  [<ffffffff88a2cdc4>] ? getname_flags+0xc4/0x1a0
      [ 7920.331338]  [<ffffffff88a2d071>] user_path_create+0x41/0x60
      [ 7920.331898]  [<ffffffff88a2e5e8>] SyS_symlinkat+0x58/0x110
      [ 7920.332441]  [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.333091]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.333755]  [<ffffffff88a2e6b6>] SyS_symlink+0x16/0x20
      [ 7920.334279]  [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21
      [ 7920.334869]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.335512] INFO: task ls:2582 blocked for more than 120 seconds.
      [ 7920.336114] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.336890] ls              D ffff9bac1d329fa0     0  2582  14455 0x00000080
      [ 7920.337619] Call Trace:
      [ 7920.337894]  [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.338519]  [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.339135]  [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f
      [ 7920.339640]  [<ffffffff88a2b37f>] do_last+0x28f/0x12c0
      [ 7920.340160]  [<ffffffff88a2c487>] path_openat+0xd7/0x640
      [ 7920.340700]  [<ffffffffc0e4e929>] ? ll_file_data_put+0x89/0x180 [lustre]
      [ 7920.341373]  [<ffffffff88a2e01d>] do_filp_open+0x4d/0xb0
      [ 7920.341902]  [<ffffffff88a3b4c4>] ? __alloc_fd+0xc4/0x170
      [ 7920.342438]  [<ffffffff88a1a327>] do_sys_open+0x137/0x240
      [ 7920.342981]  [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.343618]  [<ffffffff88a1a464>] SyS_openat+0x14/0x20
      [ 7920.344142]  [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21
      [ 7920.344757]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.345402] INFO: task ls:2583 blocked for more than 120 seconds.
      [ 7920.346003] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.346774] ls              D ffff9bac1d32dee0     0  2583  14455 0x00000080
      [ 7920.347511] Call Trace:
      [ 7920.347787]  [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.348417]  [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.349033]  [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f
      [ 7920.349539]  [<ffffffff88a2b37f>] do_last+0x28f/0x12c0
      [ 7920.350059]  [<ffffffff88a2c487>] path_openat+0xd7/0x640
      [ 7920.350587]  [<ffffffffc0e4e929>] ? ll_file_data_put+0x89/0x180 [lustre]
      [ 7920.351271]  [<ffffffff88a2e01d>] do_filp_open+0x4d/0xb0
      [ 7920.351805]  [<ffffffff88a3b4c4>] ? __alloc_fd+0xc4/0x170
      [ 7920.352339]  [<ffffffff88a1a327>] do_sys_open+0x137/0x240
      [ 7920.352881]  [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.353524]  [<ffffffff88a1a464>] SyS_openat+0x14/0x20
      [ 7920.354050]  [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21
      [ 7920.354631]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 7920.355301] INFO: task ls:2584 blocked for more than 120 seconds.
      [ 7920.355903] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7920.356685] ls              D ffff9bac1e540fd0     0  2584  14455 0x00000080
      [ 7920.357413] Call Trace:
      [ 7920.357688]  [<ffffffff88f14f49>] schedule_preempt_disabled+0x29/0x70
      [ 7920.358315]  [<ffffffff88f12d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 7920.358927]  [<ffffffff88f120ef>] mutex_lock+0x1f/0x2f
      [ 7920.359432]  [<ffffffff88a2b37f>] do_last+0x28f/0x12c0
      [ 7920.359955]  [<ffffffff88a2c487>] path_openat+0xd7/0x640
      [ 7920.360484]  [<ffffffffc0e4e929>] ? ll_file_data_put+0x89/0x180 [lustre]
      [ 7920.361166]  [<ffffffff88a2e01d>] do_filp_open+0x4d/0xb0
      [ 7920.361698]  [<ffffffff88a3b4c4>] ? __alloc_fd+0xc4/0x170
      [ 7920.362231]  [<ffffffff88a1a327>] do_sys_open+0x137/0x240
      [ 7920.362768]  [<ffffffff88f206d5>] ? system_call_after_swapgs+0xa2/0x146
      [ 7920.363411]  [<ffffffff88a1a464>] SyS_openat+0x14/0x20
      [ 7920.363931]  [<ffffffff88f20795>] system_call_fastpath+0x1c/0x21
      [ 7920.364512]  [<ffffffff88f206e1>] ? system_call_after_swapgs+0xae/0x146
      [ 8022.975327] LustreError: 4302:0:(file.c:4366:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200007161:0x1:0x0] error: rc = -4
      [ 8022.975389] LustreError: 4303:0:(file.c:4366:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200007161:0x1:0x0] error: rc = -4
      [ 8022.975622] LustreError: 4305:0:(file.c:4366:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200007161:0x1:0x0] error: rc = -4
      

      Description continued in the first comment due to JIRA description line limits.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              tappro Mikhail Pershin
              Reporter:
              jamesanunez James Nunez
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: