Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11358

racer test 1 hangs in locking with DNE

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.12.0, Lustre 2.12.1, Lustre 2.12.3, Lustre 2.12.6
    • 3
    • 9223372036854775807

    Description

      For https://testing.whamcloud.com/test_sets/122c1c1e-b529-11e8-8c12-52540065bddc, we see that racer test_1 hangs while running in a DNE environment.
      On the first client (vm1), we see that dir_create is hung. From the client console log, we see

      [  641.878967] LustreError: 29399:0:(lov_object.c:1208:lov_layout_change()) lustre-clilov-ffff8c19f9a62000: cannot apply new layout on [0x240000404:0x23d:0x0] : rc = -22
      [  641.881511] LustreError: 29399:0:(lcommon_cl.c:187:cl_file_inode_init()) Failure to initialize cl object [0x240000404:0x23d:0x0]: -22
      [  649.716504] LustreError: 5568:0:(lov_object.c:1208:lov_layout_change()) lustre-clilov-ffff8c19f9a62000: cannot apply new layout on [0x240000404:0x23d:0x0] : rc = -22
      [  649.719224] LustreError: 5568:0:(lcommon_cl.c:187:cl_file_inode_init()) Failure to initialize cl object [0x240000404:0x23d:0x0]: -22
      [  715.441686] Lustre: lustre-OST0003-osc-ffff8c19f9a62000: Connection to lustre-OST0003 (at 10.2.4.95@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [  715.444729] LustreError: 167-0: lustre-OST0003-osc-ffff8c19f9a62000: This client was evicted by lustre-OST0003; in progress operations using this service will fail.
      [  715.446755] LustreError: 18480:0:(ldlm_resource.c:1146:ldlm_resource_complain()) lustre-OST0003-osc-ffff8c19f9a62000: namespace resource [0x57:0x0:0x0].0x0 (ffff8c19dec760c0) refcount nonzero (1) after lock cleanup; forcing cleanup.
      [  719.586481] Lustre: 4306:0:(client.c:2126:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1536596966/real 1536596966]  req@ffff8c19e35b1800 x1611238269322960/t0(0) o36->lustre-MDT0000-mdc-ffff8c19fbdd9000@10.2.4.96@tcp:12/10 lens 608/4768 e 0 to 1 dl 1536597121 ref 2 fl Rpc:IX/0/ffffffff rc 0/-1
      [  719.589415] LustreError: 4306:0:(llite_lib.c:1559:ll_md_setattr()) md_setattr fails: rc = -4
      [  719.593107] Lustre: lustre-MDT0000-mdc-ffff8c19fbdd9000: Connection restored to 10.2.4.96@tcp (at 10.2.4.96@tcp)
      [  840.233545] INFO: task dir_create.sh:28323 blocked for more than 120 seconds.
      [  840.234422] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  840.235280] dir_create.sh   D ffff8c19e2766eb0     0 28323      1 0x00000084
      [  840.236116] Call Trace:
      [  840.236610]  [<ffffffffc080985c>] ? jobid_interpret_string+0x12c/0x1c0 [obdclass]
      [  840.237611]  [<ffffffff9f514029>] schedule+0x29/0x70
      [  840.238134]  [<ffffffff9f511999>] schedule_timeout+0x239/0x2c0
      [  840.238757]  [<ffffffff9f1516dd>] ? radix_tree_gang_lookup+0xcd/0x150
      [  840.239748]  [<ffffffffc0ba9ab7>] ? osc_page_gang_lookup+0x307/0x370 [osc]
      [  840.240501]  [<ffffffff9f5143dd>] wait_for_completion+0xfd/0x140
      [  840.241191]  [<ffffffff9eecf670>] ? wake_up_state+0x20/0x20
      [  840.241775]  [<ffffffffc0ba3688>] osc_io_setattr_end+0xe8/0x200 [osc]
      [  840.242554]  [<ffffffffc0bf4fa0>] ? lov_io_iter_fini_wrapper+0x50/0x50 [lov]
      [  840.243333]  [<ffffffffc0803f70>] cl_io_end+0x60/0x150 [obdclass]
      [  840.243993]  [<ffffffffc0bf507b>] lov_io_end_wrapper+0xdb/0xe0 [lov]
      [  840.244789]  [<ffffffffc0bf5507>] lov_io_call.isra.5+0x87/0x140 [lov]
      [  840.245626]  [<ffffffffc0bf55f6>] lov_io_end+0x36/0xb0 [lov]
      [  840.246235]  [<ffffffffc0803f70>] cl_io_end+0x60/0x150 [obdclass]
      [  840.246867]  [<ffffffffc080661f>] cl_io_loop+0x13f/0xc70 [obdclass]
      [  840.247719]  [<ffffffffc0c959c0>] cl_setattr_ost+0x250/0x3c0 [lustre]
      [  840.248424]  [<ffffffffc0c71b6f>] ll_setattr_raw+0xe7f/0xee0 [lustre]
      [  840.249146]  [<ffffffffc0c71c3c>] ll_setattr+0x6c/0xd0 [lustre]
      [  840.249881]  [<ffffffff9f039f14>] notify_change+0x2c4/0x420
      [  840.250539]  [<ffffffff9f0f3a5e>] ? process_measurement+0x8e/0x250
      [  840.251186]  [<ffffffff9f019185>] do_truncate+0x75/0xc0
      [  840.251733]  [<ffffffff9f02b717>] do_last+0x627/0x12c0
      [  840.252361]  [<ffffffff9f02c487>] path_openat+0xd7/0x640
      [  840.252995]  [<ffffffff9f02e01d>] do_filp_open+0x4d/0xb0
      [  840.253633]  [<ffffffff9f03b4c4>] ? __alloc_fd+0xc4/0x170
      [  840.254238]  [<ffffffff9f01a327>] do_sys_open+0x137/0x240
      [  840.254805]  [<ffffffff9f5206d5>] ? system_call_after_swapgs+0xa2/0x146
      [  840.255637]  [<ffffffff9f01a44e>] SyS_open+0x1e/0x20
      [  840.256167]  [<ffffffff9f520795>] system_call_fastpath+0x1c/0x21
      [  840.256788]  [<ffffffff9f5206e1>] ? system_call_after_swapgs+0xae/0x146
      

      On the second client (vm2), we see it is stuck in setfattr, mv, ln, touch, lfs, cat, chmod, … ending finally with “md_setattr fails: rc = -4”. From the client console log, we see:

      [  960.234201] INFO: task setfattr:4379 blocked for more than 120 seconds.
      [  960.235152] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  960.236088] setfattr        D ffff88319d295ee0     0  4379  10275 0x00000080
      [  960.236836] Call Trace:
      [  960.237118]  [<ffffffff89e27040>] ? generic_permission+0x110/0x1d0
      [  960.237930]  [<ffffffff8a314f49>] schedule_preempt_disabled+0x29/0x70
      [  960.238864]  [<ffffffff8a312d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [  960.239574]  [<ffffffff8a3120ef>] mutex_lock+0x1f/0x2f
      [  960.240178]  [<ffffffff89e432af>] vfs_removexattr+0x5f/0x130
      [  960.241016]  [<ffffffff89e433d5>] removexattr+0x55/0x80
      [  960.241569]  [<ffffffff89e2ccdd>] ? putname+0x3d/0x60
      [  960.242271]  [<ffffffff89e2def2>] ? user_path_at_empty+0x72/0xc0
      [  960.242971]  [<ffffffff89e1e418>] ? __sb_start_write+0x58/0x110
      [  960.243748]  [<ffffffff8a31b57c>] ? __do_page_fault+0x1bc/0x4f0
      [  960.244509]  [<ffffffff89e3e03c>] ? mnt_want_write+0x2c/0x50
      [  960.245274]  [<ffffffff89e441d4>] SyS_removexattr+0x94/0xd0
      [  960.245851]  [<ffffffff8a320795>] system_call_fastpath+0x1c/0x21
      [  960.246428]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  960.247093] INFO: task setfattr:4708 blocked for more than 120 seconds.
      [  960.247889] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  960.248730] setfattr        D ffff88319d285ee0     0  4708  10256 0x00000080
      [  960.249474] Call Trace:
      [  960.249731]  [<ffffffff89e27040>] ? generic_permission+0x110/0x1d0
      [  960.250601]  [<ffffffff8a314f49>] schedule_preempt_disabled+0x29/0x70
      [  960.251286]  [<ffffffff8a312d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [  960.251901]  [<ffffffff8a3120ef>] mutex_lock+0x1f/0x2f
      [  960.252397]  [<ffffffff89e432af>] vfs_removexattr+0x5f/0x130
      [  960.253026]  [<ffffffff89e433d5>] removexattr+0x55/0x80
      [  960.253722]  [<ffffffff89e2ccdd>] ? putname+0x3d/0x60
      [  960.254310]  [<ffffffff89e2def2>] ? user_path_at_empty+0x72/0xc0
      [  960.255005]  [<ffffffff89e1e418>] ? __sb_start_write+0x58/0x110
      [  960.255580]  [<ffffffff8a31b57c>] ? __do_page_fault+0x1bc/0x4f0
      [  960.256302]  [<ffffffff89e3e03c>] ? mnt_want_write+0x2c/0x50
      [  960.256943]  [<ffffffff89e441d4>] SyS_removexattr+0x94/0xd0
      [  960.257558]  [<ffffffff8a320795>] system_call_fastpath+0x1c/0x21
      [  960.258238]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  960.258922] INFO: task mv:4894 blocked for more than 120 seconds.
      [  960.259514] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  960.260406] mv              D ffff8831abf83f40     0  4894  10227 0x00000080
      [  960.261156] Call Trace:
      [  960.261461]  [<ffffffff89ed1072>] ? security_inode_permission+0x22/0x30
      [  960.262321]  [<ffffffff8a314f49>] schedule_preempt_disabled+0x29/0x70
      [  960.263071]  [<ffffffff8a312d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [  960.263708]  [<ffffffff8a3120ef>] mutex_lock+0x1f/0x2f
      [  960.264213]  [<ffffffff8a30bcde>] lookup_slow+0x33/0xa7
      [  960.264920]  [<ffffffff89e2aa08>] path_lookupat+0x838/0x8b0
      [  960.265479]  [<ffffffff89df7a35>] ? kmem_cache_alloc+0x35/0x1f0
      [  960.266156]  [<ffffffff89e2cd4f>] ? getname_flags+0x4f/0x1a0
      [  960.266720]  [<ffffffff89e2aaab>] filename_lookup+0x2b/0xc0
      [  960.267252]  [<ffffffff89e2dee7>] user_path_at_empty+0x67/0xc0
      [  960.267959]  [<ffffffff89d25092>] ? from_kgid_munged+0x12/0x20
      [  960.268570]  [<ffffffff89e2142f>] ? cp_new_stat+0x14f/0x180
      [  960.269147]  [<ffffffff89e2df51>] user_path_at+0x11/0x20
      [  960.269667]  [<ffffffff89e20f23>] vfs_fstatat+0x63/0xc0
      [  960.270189]  [<ffffffff89e214f1>] SYSC_newlstat+0x31/0x60
      [  960.270976]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  960.271851]  [<ffffffff8a3206d5>] ? system_call_after_swapgs+0xa2/0x146
      [  960.272595]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  960.273350]  [<ffffffff8a3206d5>] ? system_call_after_swapgs+0xa2/0x146
      [  960.274040]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  960.274856]  [<ffffffff8a3206d5>] ? system_call_after_swapgs+0xa2/0x146
      [  960.275794]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  960.276704]  [<ffffffff8a3206d5>] ? system_call_after_swapgs+0xa2/0x146
      [  960.277453]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  960.278135]  [<ffffffff8a3206d5>] ? system_call_after_swapgs+0xa2/0x146
      [  960.278810]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  960.279628]  [<ffffffff89e2177e>] SyS_newlstat+0xe/0x10
      [  960.280204]  [<ffffffff8a320795>] system_call_fastpath+0x1c/0x21
      [  960.280914]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  960.281654] INFO: task ln:5093 blocked for more than 120 seconds.
      [  960.282238] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  960.283146] ln              D ffff88319d24af70     0  5093  10229 0x00000080
      [  960.283887] Call Trace:
      [  960.284172]  [<ffffffff89e2a2f2>] ? path_lookupat+0x122/0x8b0
      [  960.284908]  [<ffffffff89df5a59>] ? ___slab_alloc+0x209/0x4f0
      [  960.285587]  [<ffffffff8a314f49>] schedule_preempt_disabled+0x29/0x70
      [  960.286261]  [<ffffffff8a312d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [  960.287045]  [<ffffffff8a3120ef>] mutex_lock+0x1f/0x2f
      [  960.287796]  [<ffffffff89e2abc5>] filename_create+0x85/0x180
      [  960.288559]  [<ffffffff89df7b9b>] ? kmem_cache_alloc+0x19b/0x1f0
      [  960.289344]  [<ffffffff89e2cd4f>] ? getname_flags+0x4f/0x1a0
      [  960.290171]  [<ffffffff89e2cdc4>] ? getname_flags+0xc4/0x1a0
      [  960.290901]  [<ffffffff89e2d071>] user_path_create+0x41/0x60
      [  960.291687]  [<ffffffff89e2e5e8>] SyS_symlinkat+0x58/0x110
      [  960.292335]  [<ffffffff8a3206d5>] ? system_call_after_swapgs+0xa2/0x146
      [  960.293033]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  960.293880]  [<ffffffff89e2e6b6>] SyS_symlink+0x16/0x20
      [  960.294530]  [<ffffffff8a320795>] system_call_fastpath+0x1c/0x21
      [  960.295233]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  960.295935] INFO: task touch:5326 blocked for more than 120 seconds.
      [  960.296628] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  960.297687] touch           D ffff8831aad8eeb0     0  5326  10280 0x00000080
      [  960.298689] Call Trace:
      [  960.298949]  [<ffffffff8a314f49>] schedule_preempt_disabled+0x29/0x70
      [  960.299603]  [<ffffffff8a312d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [  960.300357]  [<ffffffff8a3120ef>] mutex_lock+0x1f/0x2f
      [  960.300970]  [<ffffffff89e2b37f>] do_last+0x28f/0x12c0
      [  960.301611]  [<ffffffff89e2c487>] path_openat+0xd7/0x640
      [  960.302331]  [<ffffffff89e2e01d>] do_filp_open+0x4d/0xb0
      [  960.303145]  [<ffffffff89e3b4c4>] ? __alloc_fd+0xc4/0x170
      [  960.303935]  [<ffffffff89e1a327>] do_sys_open+0x137/0x240
      [  960.304594]  [<ffffffff8a3206d5>] ? system_call_after_swapgs+0xa2/0x146
      [  960.305239]  [<ffffffff89e1a44e>] SyS_open+0x1e/0x20
      [  960.305818]  [<ffffffff8a320795>] system_call_fastpath+0x1c/0x21
      [  960.306572]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  960.307450] INFO: task setfattr:5397 blocked for more than 120 seconds.
      [  960.308286] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  960.309232] setfattr        D ffff8831a6433f40     0  5397  10258 0x00000080
      [  960.310116] Call Trace:
      [  960.310390]  [<ffffffff89e27040>] ? generic_permission+0x110/0x1d0
      [  960.311113]  [<ffffffff8a314f49>] schedule_preempt_disabled+0x29/0x70
      [  960.312040]  [<ffffffff8a312d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [  960.312870]  [<ffffffff8a3120ef>] mutex_lock+0x1f/0x2f
      [  960.313448]  [<ffffffff89e43612>] vfs_setxattr+0x62/0xc0
      [  960.314029]  [<ffffffff89e4379c>] setxattr+0x12c/0x1c0
      [  960.314591]  [<ffffffff89df6ec2>] ? kmem_cache_free+0x1e2/0x200
      [  960.315167]  [<ffffffff89e2ccdd>] ? putname+0x3d/0x60
      [  960.315661]  [<ffffffff89e2def2>] ? user_path_at_empty+0x72/0xc0
      [  960.316354]  [<ffffffff89e1e418>] ? __sb_start_write+0x58/0x110
      [  960.316947]  [<ffffffff8a31b57c>] ? __do_page_fault+0x1bc/0x4f0
      [  960.317592]  [<ffffffff89e43a97>] SyS_setxattr+0xb7/0x100
      [  960.318115]  [<ffffffff8a320795>] system_call_fastpath+0x1c/0x21
      [  960.318753]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  960.319582] INFO: task ln:5444 blocked for more than 120 seconds.
      [  960.320167] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  960.320996] ln              D ffff8831abf86eb0     0  5444  10267 0x00000080
      [  960.321733] Call Trace:
      [  960.322040]  [<ffffffff89e2a2f2>] ? path_lookupat+0x122/0x8b0
      [  960.322603]  [<ffffffff89df5a59>] ? ___slab_alloc+0x209/0x4f0
      [  960.323223]  [<ffffffff8a314f49>] schedule_preempt_disabled+0x29/0x70
      [  960.323886]  [<ffffffff8a312d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [  960.324486]  [<ffffffff8a3120ef>] mutex_lock+0x1f/0x2f
      [  960.325130]  [<ffffffff89e2abc5>] filename_create+0x85/0x180
      [  960.325688]  [<ffffffff89df7b9b>] ? kmem_cache_alloc+0x19b/0x1f0
      [  960.326313]  [<ffffffff89e2cd4f>] ? getname_flags+0x4f/0x1a0
      [  960.326988]  [<ffffffff89e2cdc4>] ? getname_flags+0xc4/0x1a0
      [  960.327537]  [<ffffffff89e2d071>] user_path_create+0x41/0x60
      [  960.328165]  [<ffffffff89e2e5e8>] SyS_symlinkat+0x58/0x110
      [  960.328705]  [<ffffffff8a3206d5>] ? system_call_after_swapgs+0xa2/0x146
      [  960.329367]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  960.330116]  [<ffffffff89e2e6b6>] SyS_symlink+0x16/0x20
      [  960.330631]  [<ffffffff8a320795>] system_call_fastpath+0x1c/0x21
      [  960.331304]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  960.332054] INFO: task lfs:10586 blocked for more than 120 seconds.
      [  960.332725] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  960.333471] lfs             D ffff8831abe38fd0     0 10586  10241 0x00000080
      [  960.334225] Call Trace:
      [  960.334539]  [<ffffffffc0d167e2>] ? ll_dcompare+0x72/0x2e0 [lustre]
      [  960.335222]  [<ffffffff8a314f49>] schedule_preempt_disabled+0x29/0x70
      [  960.335853]  [<ffffffff8a312d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [  960.336513]  [<ffffffff8a3120ef>] mutex_lock+0x1f/0x2f
      [  960.337026]  [<ffffffffc0d32a9f>] ll_migrate+0x16f/0xcc0 [lustre]
      [  960.337669]  [<ffffffffc0d21a35>] ll_dir_ioctl+0x59c5/0x6560 [lustre]
      [  960.338587]  [<ffffffff89dc5a9c>] ? handle_pte_fault+0x2dc/0xc30
      [  960.339201]  [<ffffffff89e3dde4>] ? mntput+0x24/0x40
      [  960.339693]  [<ffffffff89dc7c3d>] ? handle_mm_fault+0x39d/0x9b0
      [  960.340288]  [<ffffffff89e30350>] do_vfs_ioctl+0x350/0x560
      [  960.340855]  [<ffffffff8a31b57c>] ? __do_page_fault+0x1bc/0x4f0
      [  960.341908]  [<ffffffff89e30601>] SyS_ioctl+0xa1/0xc0
      [  960.342852]  [<ffffffff8a3206d5>] ? system_call_after_swapgs+0xa2/0x146
      [  960.343595]  [<ffffffff8a320795>] system_call_fastpath+0x1c/0x21
      [  960.344181]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  960.345175] INFO: task cat:10688 blocked for more than 120 seconds.
      [  960.346073] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  960.346939] cat             D ffff8831a95acf10     0 10688  10231 0x00000080
      [  960.347930] Call Trace:
      [  960.348208]  [<ffffffff89ed0a1e>] ? security_capable+0x1e/0x20
      [  960.348956]  [<ffffffff8a314f49>] schedule_preempt_disabled+0x29/0x70
      [  960.350080]  [<ffffffff8a312d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [  960.350886]  [<ffffffff8a3120ef>] mutex_lock+0x1f/0x2f
      [  960.351527]  [<ffffffff8a30bcde>] lookup_slow+0x33/0xa7
      [  960.352142]  [<ffffffff89e2927f>] link_path_walk+0x80f/0x8b0
      [  960.352945]  [<ffffffff89e2c465>] path_openat+0xb5/0x640
      [  960.353620]  [<ffffffff89cf9172>] ? ktime_get+0x52/0xe0
      [  960.354387]  [<ffffffffc0d285ef>] ? ll_getattr+0x22f/0x580 [lustre]
      [  960.355148]  [<ffffffff89e2e01d>] do_filp_open+0x4d/0xb0
      [  960.355707]  [<ffffffff89e3b4c4>] ? __alloc_fd+0xc4/0x170
      [  960.356563]  [<ffffffff89e1a327>] do_sys_open+0x137/0x240
      [  960.357315]  [<ffffffff8a3206d5>] ? system_call_after_swapgs+0xa2/0x146
      [  960.358497]  [<ffffffff89e1a44e>] SyS_open+0x1e/0x20
      [  960.359026]  [<ffffffff8a320795>] system_call_fastpath+0x1c/0x21
      [  960.359646]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  960.360691] INFO: task chmod:11121 blocked for more than 120 seconds.
      [  960.361650] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  960.362816] chmod           D ffff8831a32f8fd0     0 11121  10234 0x00000080
      [  960.363567] Call Trace:
      [  960.363824]  [<ffffffff89e2cd4f>] ? getname_flags+0x4f/0x1a0
      [  960.364484]  [<ffffffff8a314f49>] schedule_preempt_disabled+0x29/0x70
      [  960.365104]  [<ffffffff8a312d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [  960.365787]  [<ffffffff8a3120ef>] mutex_lock+0x1f/0x2f
      [  960.366323]  [<ffffffff89e189c8>] chmod_common+0x78/0x160
      [  960.366924]  [<ffffffff8a31b57c>] ? __do_page_fault+0x1bc/0x4f0
      [  960.367607]  [<ffffffff89e19d17>] SyS_fchmodat+0x57/0xc0
      [  960.368125]  [<ffffffff8a320795>] system_call_fastpath+0x1c/0x21
      [  960.368777]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  990.121506] LustreError: 4369:0:(llite_lib.c:1559:ll_md_setattr()) md_setattr fails: rc = -4
      

      On MDS2, 4 (vm5), we see the following three call traces in the console log:

      [  259.348108] Lustre: DEBUG MARKER: == racer test 1: racer on clients: onyx-30vm1.onyx.whamcloud.com,onyx-30vm2 DURATION=300 ============= 16:25:55 (1536596755)
      [  351.363812] LNet: Service thread pid 22468 was inactive for 62.20s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [  351.367067] Pid: 22468, comm: mdt00_009 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018
      [  351.368822] Call Trace:
      [  351.369374]  [<ffffffffc0d3df61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
      [  351.370937]  [<ffffffffc0d3ed83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
      [  351.372390]  [<ffffffffc10e63d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt]
      [  351.373833]  [<ffffffffc10e6a80>] mdt_object_lock_internal+0x70/0x330 [mdt]
      [  351.375390]  [<ffffffffc10e6d60>] mdt_object_lock+0x20/0x30 [mdt]
      [  351.376680]  [<ffffffffc112186b>] mdt_brw_enqueue+0x44b/0x760 [mdt]
      [  351.377955]  [<ffffffffc10d44bf>] mdt_intent_brw+0x1f/0x30 [mdt]
      [  351.379141]  [<ffffffffc10ec9c8>] mdt_intent_policy+0x2f8/0xd10 [mdt]
      [  351.380500]  [<ffffffffc0d24e9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc]
      [  351.381929]  [<ffffffffc0d4d503>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc]
      [  351.383341]  [<ffffffffc0dd39b2>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [  351.384680]  [<ffffffffc0ddc2fa>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [  351.386091]  [<ffffffffc0d7f48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [  351.387592]  [<ffffffffc0d82cc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]
      [  351.388691]  [<ffffffffb9abb621>] kthread+0xd1/0xe0
      [  351.389324]  [<ffffffffba1205f7>] ret_from_fork_nospec_end+0x0/0x39
      [  351.390034]  [<ffffffffffffffff>] 0xffffffffffffffff
      [  351.390691] LustreError: dumping log to /tmp/lustre-log.1536596849.22468
      [  351.992686] Pid: 22484, comm: mdt00_019 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018
      [  351.994620] Call Trace:
      [  351.995150]  [<ffffffffc0d3df61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
      [  351.996680]  [<ffffffffc0d3ed83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
      [  351.998519]  [<ffffffffc10e63d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt]
      [  352.000268]  [<ffffffffc10e6a80>] mdt_object_lock_internal+0x70/0x330 [mdt]
      [  352.001606]  [<ffffffffc10e8f1a>] mdt_object_find_lock+0x6a/0x1a0 [mdt]
      [  352.002977]  [<ffffffffc1104b4e>] mdt_reint_setxattr+0x1ce/0xf20 [mdt]
      [  352.004283]  [<ffffffffc1103293>] mdt_reint_rec+0x83/0x210 [mdt]
      [  352.005561]  [<ffffffffc10e21e2>] mdt_reint_internal+0x6b2/0xa80 [mdt]
      [  352.006896]  [<ffffffffc10ed447>] mdt_reint+0x67/0x140 [mdt]
      [  352.007959]  [<ffffffffc0ddc2fa>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [  352.009256]  [<ffffffffc0d7f48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [  352.010650]  [<ffffffffc0d82cc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]
      [  352.011847]  [<ffffffffb9abb621>] kthread+0xd1/0xe0
      [  352.012767]  [<ffffffffba1205f7>] ret_from_fork_nospec_end+0x0/0x39
      [  352.013960]  [<ffffffffffffffff>] 0xffffffffffffffff
      [  352.899805] Pid: 22464, comm: mdt00_005 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018
      [  352.901685] Call Trace:
      [  352.902186]  [<ffffffffc0d3df61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
      [  352.903489]  [<ffffffffc0d3ed83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
      [  352.905019]  [<ffffffffc10e63d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt]
      [  352.906256]  [<ffffffffc10e6a80>] mdt_object_lock_internal+0x70/0x330 [mdt]
      [  352.907593]  [<ffffffffc10e6df7>] mdt_object_lock_try+0x27/0xb0 [mdt]
      [  352.908889]  [<ffffffffc10e8547>] mdt_getattr_name_lock+0x1297/0x1c00 [mdt]
      [  352.910163]  [<ffffffffc10efae5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
      [  352.911631]  [<ffffffffc10ec9c8>] mdt_intent_policy+0x2f8/0xd10 [mdt]
      [  352.912921]  [<ffffffffc0d24e9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc]
      [  352.914176]  [<ffffffffc0d4d503>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc]
      [  352.915530]  [<ffffffffc0dd39b2>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [  352.916805]  [<ffffffffc0ddc2fa>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [  352.918088]  [<ffffffffc0d7f48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [  352.919538]  [<ffffffffc0d82cc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]
      [  352.920825]  [<ffffffffb9abb621>] kthread+0xd1/0xe0
      [  352.921765]  [<ffffffffba1205f7>] ret_from_fork_nospec_end+0x0/0x39
      [  352.922967]  [<ffffffffffffffff>] 0xffffffffffffffff
      [  352.924148] LustreError: dumping log to /tmp/lustre-log.1536596850.22464
      [  353.155770] Pid: 14392, comm: mdt00_002 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018
      [  353.157590] Call Trace:
      [  353.158104]  [<ffffffffc0d3df61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
      [  353.159532]  [<ffffffffc0d3ed83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
      [  353.160957]  [<ffffffffc10e63d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt]
      [  353.162198]  [<ffffffffc10e6a80>] mdt_object_lock_internal+0x70/0x330 [mdt]
      [  353.163514]  [<ffffffffc10e6df7>] mdt_object_lock_try+0x27/0xb0 [mdt]
      [  353.164788]  [<ffffffffc10e8547>] mdt_getattr_name_lock+0x1297/0x1c00 [mdt]
      [  353.166069]  [<ffffffffc10efae5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
      [  353.167514]  [<ffffffffc10ec9c8>] mdt_intent_policy+0x2f8/0xd10 [mdt]
      [  353.168828]  [<ffffffffc0d24e9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc]
      [  353.170114]  [<ffffffffc0d4d503>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc]
      [  353.171477]  [<ffffffffc0dd39b2>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [  353.172752]  [<ffffffffc0ddc2fa>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [  353.174046]  [<ffffffffc0d7f48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [  353.175514]  [<ffffffffc0d82cc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]
      [  353.176790]  [<ffffffffb9abb621>] kthread+0xd1/0xe0
      [  353.177702]  [<ffffffffba1205f7>] ret_from_fork_nospec_end+0x0/0x39
      [  353.179138]  [<ffffffffffffffff>] 0xffffffffffffffff
      [  354.435795] Pid: 22466, comm: mdt00_007 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018
      [  354.437623] Call Trace:
      [  354.438121]  [<ffffffffc0d3df61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
      [  354.439416]  [<ffffffffc0d3ed83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
      [  354.440744]  [<ffffffffc10e63d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt]
      [  354.442189]  [<ffffffffc10e6a80>] mdt_object_lock_internal+0x70/0x330 [mdt]
      [  354.443492]  [<ffffffffc10e6df7>] mdt_object_lock_try+0x27/0xb0 [mdt]
      [  354.444784]  [<ffffffffc10e8547>] mdt_getattr_name_lock+0x1297/0x1c00 [mdt]
      [  354.446055]  [<ffffffffc10efae5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
      [  354.447309]  [<ffffffffc10ec9c8>] mdt_intent_policy+0x2f8/0xd10 [mdt]
      [  354.448497]  [<ffffffffc0d24e9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc]
      [  354.450007]  [<ffffffffc0d4d503>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc]
      [  354.451294]  [<ffffffffc0dd39b2>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [  354.452545]  [<ffffffffc0ddc2fa>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [  354.453934]  [<ffffffffc0d7f48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [  354.455351]  [<ffffffffc0d82cc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]
      [  354.456626]  [<ffffffffb9abb621>] kthread+0xd1/0xe0
      [  354.457598]  [<ffffffffba1205f7>] ret_from_fork_nospec_end+0x0/0x39
      [  354.458800]  [<ffffffffffffffff>] 0xffffffffffffffff
      [  354.459764] LustreError: dumping log to /tmp/lustre-log.1536596852.22466
      [  355.715812] LNet: Service thread pid 22486 was inactive for 62.01s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
      [  355.718240] LustreError: dumping log to /tmp/lustre-log.1536596853.22486
      [  355.843781] LNet: Service thread pid 22494 was inactive for 62.11s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
      [  390.164855] LustreError: 14386:0:(ldlm_lockd.c:251:expired_lock_main()) ### lock callback timer expired after 99s: evicting client at 10.2.4.94@tcp  ns: mdt-lustre-MDT0001_UUID lock: ffff92c1dc98ed80/0x1d42b9ca91f94e31 lrc: 3/0,0 mode: PW/PW res: [0x240000402:0x5d:0x0].0x0 bits 0x40/0x0 rrc: 12 type: IBT flags: 0x60200400000020 nid: 10.2.4.94@tcp remote: 0xef3380579c7186d1 expref: 88 pid: 22490 timeout: 390 lvb_type: 0
      [  390.173809] LustreError: 22468:0:(ldlm_lockd.c:1342:ldlm_handle_enqueue0()) ### lock on destroyed export ffff92c1dd765400 ns: mdt-lustre-MDT0001_UUID lock: ffff92c1dc98f680/0x1d42b9ca91f94ed2 lrc: 3/0,0 mode: PW/PW res: [0x240000402:0x5d:0x0].0x0 bits 0x40/0x0 rrc: 9 type: IBT flags: 0x50200000000000 nid: 10.2.4.94@tcp remote: 0xef3380579c718717 expref: 18 pid: 22468 timeout: 0 lvb_type: 0
      [  390.179772] LNet: Service thread pid 22468 completed after 101.02s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [  390.180082] LNet: Service thread pid 22484 completed after 100.40s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [  390.180474] LNet: Service thread pid 22466 completed after 97.81s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [  390.180626] LustreError: 14392:0:(ldlm_lockd.c:1342:ldlm_handle_enqueue0()) ### lock on destroyed export ffff92c1dd765400 ns: mdt-lustre-MDT0001_UUID lock: ffff92c1dd03f8c0/0x1d42b9ca91f96aa8 lrc: 3/0,0 mode: PR/PR res: [0x240000402:0x5d:0x0].0x0 bits 0x1b/0x0 rrc: 7 type: IBT flags: 0x50200000000000 nid: 10.2.4.94@tcp remote: 0xef3380579c7192d9 expref: 4 pid: 14392 timeout: 0 lvb_type: 0
      [  390.180862] LNet: Service thread pid 14392 completed after 99.10s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [  390.181233] LNet: Service thread pid 22464 completed after 99.30s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [  390.181528] LNet: Service thread pid 22486 completed after 96.47s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [  390.181813] LNet: Service thread pid 22494 completed after 96.45s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [  468.993496] LNetError: 14367:0:(lib-msg.c:794:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)
      [  468.994385] LNetError: 14366:0:(lib-msg.c:794:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)
      [  555.134662] LustreError: 14422:0:(llog_cat.c:801:llog_cat_cancel_records()) lustre-MDT0000-osp-MDT0001: fail to cancel 1 of 1 llog-records: rc = -116
      [  555.136979] LustreError: 14422:0:(llog_cat.c:801:llog_cat_cancel_records()) lustre-MDT0000-osp-MDT0001: fail to cancel 1 of 1 llog-records: rc = -116
      [  555.139399] LustreError: 14422:0:(llog_cat.c:801:llog_cat_cancel_records()) lustre-MDT0000-osp-MDT0001: fail to cancel 1 of 1 llog-records: rc = -116
      [  555.141809] LustreError: 14422:0:(llog_cat.c:801:llog_cat_cancel_records()) lustre-MDT0000-osp-MDT0001: fail to cancel 1 of 1 llog-records: rc = -116
      [  555.144689] LustreError: 14422:0:(llog_cat.c:801:llog_cat_cancel_records()) lustre-MDT0000-osp-MDT0001: fail to cancel 1 of 1 llog-records: rc = -116
      

      Description continued in the first comment due to JIRA description line limits.

      Attachments

        Issue Links

          Activity

            [LU-11358] racer test 1 hangs in locking with DNE

            We're seeing a very similar hang in a non-DNE environment; https://testing.whamcloud.com/test_sets/dcea0894-fd71-11e9-bbc3-52540065bddc.

            If this is not the same hang, I can open a new ticket.

            jamesanunez James Nunez (Inactive) added a comment - We're seeing a very similar hang in a non-DNE environment; https://testing.whamcloud.com/test_sets/dcea0894-fd71-11e9-bbc3-52540065bddc . If this is not the same hang, I can open a new ticket.

            We are seeing this issue or something very close to it running racer with DNE/ZFS; https://testing.whamcloud.com/test_sets/d49634dc-df31-11e9-b62b-52540065bddc .

            On the MDS2 console, we see the additional LNet health error

             
            [75858.646568] Lustre: DEBUG MARKER: == racer test 1: racer on clients: trevis-48vm11.trevis.whamcloud.com,trevis-48vm12 DURATION=900 ===== 07:13:24 (1569309204)
            [75928.264277] LNetError: 1597:0:(lib-msg.c:816:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)
            [75971.956667] Lustre: lustre-MDT0001: Client 826f72cb-2390-4ab9-e63f-b2fb6334f6ca (at 10.9.3.135@tcp) reconnecting
            [76114.156474] LNet: Service thread pid 22105 was inactive for 62.10s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
            [76114.159530] Pid: 22105, comm: mdt00_014 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Wed Sep 18 04:48:28 UTC 2019
            [76114.161351] Call Trace:
            [76114.161955]  [<ffffffffc117fae0>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
            [76114.163733]  [<ffffffffc1180601>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
            [76114.165034]  [<ffffffffc153252b>] mdt_object_local_lock+0x50b/0xb20 [mdt]
            [76114.166470]  [<ffffffffc1532bb0>] mdt_object_lock_internal+0x70/0x360 [mdt]
            [76114.167741]  [<ffffffffc1532efc>] mdt_reint_object_lock+0x2c/0x60 [mdt]
            [76114.168988]  [<ffffffffc154a81c>] mdt_reint_striped_lock+0x8c/0x510 [mdt]
            [76114.170334]  [<ffffffffc154b754>] mdt_reint_unlink+0x764/0x14b0 [mdt]
            [76114.171605]  [<ffffffffc1550283>] mdt_reint_rec+0x83/0x210 [mdt]
            [76114.172706]  [<ffffffffc152e1b3>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
            [76114.173982]  [<ffffffffc1539587>] mdt_reint+0x67/0x140 [mdt]
            [76114.175070]  [<ffffffffc121deca>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
            [76114.176550]  [<ffffffffc11c330b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            [76114.177947]  [<ffffffffc11c6c6c>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
            [76114.179173]  [<ffffffffb6cc50d1>] kthread+0xd1/0xe0
            [76114.180206]  [<ffffffffb738cd37>] ret_from_fork_nospec_end+0x0/0x39
            [76114.181483]  [<ffffffffffffffff>] 0xffffffffffffffff
            
            jamesanunez James Nunez (Inactive) added a comment - We are seeing this issue or something very close to it running racer with DNE/ZFS; https://testing.whamcloud.com/test_sets/d49634dc-df31-11e9-b62b-52540065bddc . On the MDS2 console, we see the additional LNet health error [75858.646568] Lustre: DEBUG MARKER: == racer test 1: racer on clients: trevis-48vm11.trevis.whamcloud.com,trevis-48vm12 DURATION=900 ===== 07:13:24 (1569309204) [75928.264277] LNetError: 1597:0:(lib-msg.c:816:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0) [75971.956667] Lustre: lustre-MDT0001: Client 826f72cb-2390-4ab9-e63f-b2fb6334f6ca (at 10.9.3.135@tcp) reconnecting [76114.156474] LNet: Service thread pid 22105 was inactive for 62.10s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [76114.159530] Pid: 22105, comm: mdt00_014 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Wed Sep 18 04:48:28 UTC 2019 [76114.161351] Call Trace: [76114.161955] [<ffffffffc117fae0>] ldlm_completion_ast+0x430/0x860 [ptlrpc] [76114.163733] [<ffffffffc1180601>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] [76114.165034] [<ffffffffc153252b>] mdt_object_local_lock+0x50b/0xb20 [mdt] [76114.166470] [<ffffffffc1532bb0>] mdt_object_lock_internal+0x70/0x360 [mdt] [76114.167741] [<ffffffffc1532efc>] mdt_reint_object_lock+0x2c/0x60 [mdt] [76114.168988] [<ffffffffc154a81c>] mdt_reint_striped_lock+0x8c/0x510 [mdt] [76114.170334] [<ffffffffc154b754>] mdt_reint_unlink+0x764/0x14b0 [mdt] [76114.171605] [<ffffffffc1550283>] mdt_reint_rec+0x83/0x210 [mdt] [76114.172706] [<ffffffffc152e1b3>] mdt_reint_internal+0x6e3/0xaf0 [mdt] [76114.173982] [<ffffffffc1539587>] mdt_reint+0x67/0x140 [mdt] [76114.175070] [<ffffffffc121deca>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [76114.176550] [<ffffffffc11c330b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [76114.177947] [<ffffffffc11c6c6c>] ptlrpc_main+0xb2c/0x1460 [ptlrpc] [76114.179173] [<ffffffffb6cc50d1>] kthread+0xd1/0xe0 [76114.180206] [<ffffffffb738cd37>] ret_from_fork_nospec_end+0x0/0x39 [76114.181483] [<ffffffffffffffff>] 0xffffffffffffffff

            We are running 2.12.0 + patches including LU-11359 and just had an event with a lot of these call traces. For a long time, one or two ldlm_bl threads per MDS were taking 100% cpu. The system recovered by itself after we killed the robinhood server (that reads changelogs):

            sthiell Stephane Thiell added a comment - We are running 2.12.0 + patches including LU-11359 and just had an event with a lot of these call traces. For a long time, one or two ldlm_bl threads per MDS were taking 100% cpu. The system recovered by itself after we killed the robinhood server (that reads changelogs): MDT0000 and MDT0002 in fir-md1-s1_compl_ast.log MDT0001 and MDT0003 in fir-md1-s2_compl_ast.log
            sarah Sarah Liu added a comment - another one on b2_12 https://testing.whamcloud.com/test_sets/5f8412e0-5b42-11e9-a256-52540065bddc
            pjones Peter Jones added a comment -

            Lai

            Could you please comment?

            Thanks

            Peter

            pjones Peter Jones added a comment - Lai Could you please comment? Thanks Peter

            JIRA won't allow me to add more lines to the description. So, here is the rest of the description:

            On MDS1, 3 (vm4), we see similar call traces in the MDS console log:

            [  259.315768] Lustre: DEBUG MARKER: == racer test 1: racer on clients: onyx-30vm1.onyx.whamcloud.com,onyx-30vm2 DURATION=300 ============= 16:25:55 (1536596755)
            [  408.065620] LNet: Service thread pid 23632 was inactive for 62.12s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
            [  408.067505] Pid: 23632, comm: mdt00_012 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018
            [  408.068612] Call Trace:
            [  408.068963]  [<ffffffffc0bb8f61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
            [  408.070119]  [<ffffffffc0bb9d83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
            [  408.071034]  [<ffffffffc0fd93d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt]
            [  408.071999]  [<ffffffffc0fd9a80>] mdt_object_lock_internal+0x70/0x330 [mdt]
            [  408.073061]  [<ffffffffc0fd9df7>] mdt_object_lock_try+0x27/0xb0 [mdt]
            [  408.073925]  [<ffffffffc0ffd098>] mdt_object_open_lock+0x688/0xac0 [mdt]
            [  408.074920]  [<ffffffffc10017b3>] mdt_reint_open+0x14e3/0x3160 [mdt]
            [  408.075798]  [<ffffffffc0ff6293>] mdt_reint_rec+0x83/0x210 [mdt]
            [  408.076651]  [<ffffffffc0fd51e2>] mdt_reint_internal+0x6b2/0xa80 [mdt]
            [  408.077560]  [<ffffffffc0fe1922>] mdt_intent_open+0x82/0x350 [mdt]
            [  408.078413]  [<ffffffffc0fdf9c8>] mdt_intent_policy+0x2f8/0xd10 [mdt]
            [  408.079311]  [<ffffffffc0b9fe9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc]
            [  408.080371]  [<ffffffffc0bc8503>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc]
            [  408.081507]  [<ffffffffc0c4e9b2>] tgt_enqueue+0x62/0x210 [ptlrpc]
            [  408.082415]  [<ffffffffc0c572fa>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
            [  408.083497]  [<ffffffffc0bfa48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            [  408.084776]  [<ffffffffc0bfdcc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]
            [  408.085796]  [<ffffffff964bb621>] kthread+0xd1/0xe0
            [  408.086716]  [<ffffffff96b205f7>] ret_from_fork_nospec_end+0x0/0x39
            [  408.087793]  [<ffffffffffffffff>] 0xffffffffffffffff
            [  408.088619] LustreError: dumping log to /tmp/lustre-log.1536596905.23632
            [  408.671328] Pid: 23651, comm: mdt00_026 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018
            [  408.672721] Call Trace:
            [  408.673309]  [<ffffffffc0bb8f61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
            [  408.674856]  [<ffffffffc0bb9d83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
            [  408.676257]  [<ffffffffc0fd93d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt]
            [  408.677598]  [<ffffffffc0fd9a80>] mdt_object_lock_internal+0x70/0x330 [mdt]
            [  408.678935]  [<ffffffffc0fdb1c5>] mdt_getattr_name_lock+0xf15/0x1c00 [mdt]
            [  408.680486]  [<ffffffffc0fe2ae5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
            [  408.681934]  [<ffffffffc0fdf9c8>] mdt_intent_policy+0x2f8/0xd10 [mdt]
            [  408.683203]  [<ffffffffc0b9fe9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc]
            [  408.684711]  [<ffffffffc0bc8503>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc]
            [  408.686102]  [<ffffffffc0c4e9b2>] tgt_enqueue+0x62/0x210 [ptlrpc]
            [  408.687452]  [<ffffffffc0c572fa>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
            [  408.688845]  [<ffffffffc0bfa48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            [  408.689943]  [<ffffffffc0bfdcc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]
            [  408.690671]  [<ffffffff964bb621>] kthread+0xd1/0xe0
            [  408.691215]  [<ffffffff96b205f7>] ret_from_fork_nospec_end+0x0/0x39
            [  408.691910]  [<ffffffffffffffff>] 0xffffffffffffffff
            [  408.692489] Pid: 23638, comm: mdt00_018 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018
            [  408.693465] Call Trace:
            [  408.693756]  [<ffffffffc0bb8f61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
            [  408.694549]  [<ffffffffc0bb9d83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
            [  408.695315]  [<ffffffffc0fd93d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt]
            [  408.696111]  [<ffffffffc0fd9a80>] mdt_object_lock_internal+0x70/0x330 [mdt]
            [  408.696874]  [<ffffffffc0fd9d9c>] mdt_reint_object_lock+0x2c/0x60 [mdt]
            [  408.697605]  [<ffffffffc0feb977>] mdt_reint_striped_lock+0x87/0x4d0 [mdt]
            [  408.698334]  [<ffffffffc0ff2036>] mdt_reint_unlink+0x706/0x1490 [mdt]
            [  408.699045]  [<ffffffffc0ff6293>] mdt_reint_rec+0x83/0x210 [mdt]
            [  408.699719]  [<ffffffffc0fd51e2>] mdt_reint_internal+0x6b2/0xa80 [mdt]
            [  408.700429]  [<ffffffffc0fe0447>] mdt_reint+0x67/0x140 [mdt]
            [  408.701056]  [<ffffffffc0c572fa>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
            [  408.701892]  [<ffffffffc0bfa48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            [  408.702758]  [<ffffffffc0bfdcc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]
            [  408.703467]  [<ffffffff964bb621>] kthread+0xd1/0xe0
            [  408.704024]  [<ffffffff96b205f7>] ret_from_fork_nospec_end+0x0/0x39
            [  408.704732]  [<ffffffffffffffff>] 0xffffffffffffffff
            [  408.705624] Pid: 23625, comm: mdt00_007 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018
            [  408.706776] Call Trace:
            [  408.707061]  [<ffffffffc0bb8f61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
            [  408.707872]  [<ffffffffc0bb9d83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
            [  408.708669]  [<ffffffffc0fd93d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt]
            [  408.709414]  [<ffffffffc0fd9a80>] mdt_object_lock_internal+0x70/0x330 [mdt]
            [  408.710173]  [<ffffffffc0fdaaea>] mdt_getattr_name_lock+0x83a/0x1c00 [mdt]
            [  408.710928]  [<ffffffffc0fe2ae5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
            [  408.711653]  [<ffffffffc0fdf9c8>] mdt_intent_policy+0x2f8/0xd10 [mdt]
            [  408.712341]  [<ffffffffc0b9fe9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc]
            [  408.713115]  [<ffffffffc0bc8503>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc]
            [  408.713906]  [<ffffffffc0c4e9b2>] tgt_enqueue+0x62/0x210 [ptlrpc]
            [  408.714665]  [<ffffffffc0c572fa>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
            [  408.715429]  [<ffffffffc0bfa48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            [  408.716277]  [<ffffffffc0bfdcc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]
            [  408.717000]  [<ffffffff964bb621>] kthread+0xd1/0xe0
            [  408.717564]  [<ffffffff96b205f7>] ret_from_fork_nospec_end+0x0/0x39
            [  408.718255]  [<ffffffffffffffff>] 0xffffffffffffffff
            [  408.718856] LustreError: dumping log to /tmp/lustre-log.1536596906.23625
            [  409.275793] Pid: 23641, comm: mdt00_021 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018
            [  409.277732] Call Trace:
            [  409.278221]  [<ffffffffc0bb8f61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
            [  409.280058]  [<ffffffffc0bb9d83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
            [  409.281479]  [<ffffffffc0fd93d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt]
            [  409.282821]  [<ffffffffc0fd9a80>] mdt_object_lock_internal+0x70/0x330 [mdt]
            [  409.284152]  [<ffffffffc0fdaaea>] mdt_getattr_name_lock+0x83a/0x1c00 [mdt]
            [  409.285482]  [<ffffffffc0fe2ae5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
            [  409.286819]  [<ffffffffc0fdf9c8>] mdt_intent_policy+0x2f8/0xd10 [mdt]
            [  409.288208]  [<ffffffffc0b9fe9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc]
            [  409.289500]  [<ffffffffc0bc8503>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc]
            [  409.290818]  [<ffffffffc0c4e9b2>] tgt_enqueue+0x62/0x210 [ptlrpc]
            [  409.293953]  [<ffffffffc0c572fa>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
            [  409.295296]  [<ffffffffc0bfa48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            [  409.296786]  [<ffffffffc0bfdcc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]
            [  409.298362]  [<ffffffff964bb621>] kthread+0xd1/0xe0
            [  409.299510]  [<ffffffff96b205f7>] ret_from_fork_nospec_end+0x0/0x39
            [  409.300537]  [<ffffffffffffffff>] 0xffffffffffffffff
            [  409.301465] LNet: Service thread pid 23636 was inactive for 62.65s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
            [  409.302975] LNet: Service thread pid 23633 was inactive for 62.64s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
            [  445.161620] LustreError: 14404:0:(ldlm_lockd.c:251:expired_lock_main()) ### lock callback timer expired after 99s: evicting client at 10.2.4.94@tcp  ns: mdt-lustre-MDT0000_UUID lock: ffff9c691e927d40/0x6cd7fdda56e1ffdf lrc: 3/0,0 mode: EX/EX res: [0x200000405:0x1f6:0x0].0x0 bits 0x4/0x0 rrc: 7 type: IBT flags: 0x60200400000020 nid: 10.2.4.94@tcp remote: 0xef3380579c724faf expref: 137 pid: 23632 timeout: 445 lvb_type: 0
            [  445.169506] LNet: Service thread pid 23632 completed after 99.22s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            [  445.169656] LNet: Service thread pid 23638 completed after 99.12s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            [  445.170016] LNet: Service thread pid 23651 completed after 98.81s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            [  445.172273] LNet: Service thread pid 23633 completed after 98.51s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            [  445.179865] LNet: Service thread pid 23625 completed after 98.58s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            [  445.180073] LNet: Service thread pid 23641 completed after 98.62s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            [  445.190969] LNet: Service thread pid 23636 completed after 98.54s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            [  469.495444] LNetError: 14384:0:(lib-msg.c:794:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)
            

            The call traces presented here have similarities to LU-5886 and LU-10525. Although this look similar to LU-10852, the call trace for LU-10852 all have ‘osp_md_object_lock()’ in the call stack and the hangs described here do not.

            We’ve seen these call traces in console logs in other test sessions;
            https://testing.whamcloud.com/test_sets/1237377a-b529-11e8-8c12-52540065bddc

            jamesanunez James Nunez (Inactive) added a comment - JIRA won't allow me to add more lines to the description. So, here is the rest of the description: On MDS1, 3 (vm4), we see similar call traces in the MDS console log: [ 259.315768] Lustre: DEBUG MARKER: == racer test 1: racer on clients: onyx-30vm1.onyx.whamcloud.com,onyx-30vm2 DURATION=300 ============= 16:25:55 (1536596755) [ 408.065620] LNet: Service thread pid 23632 was inactive for 62.12s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 408.067505] Pid: 23632, comm: mdt00_012 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018 [ 408.068612] Call Trace: [ 408.068963] [<ffffffffc0bb8f61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc] [ 408.070119] [<ffffffffc0bb9d83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc] [ 408.071034] [<ffffffffc0fd93d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt] [ 408.071999] [<ffffffffc0fd9a80>] mdt_object_lock_internal+0x70/0x330 [mdt] [ 408.073061] [<ffffffffc0fd9df7>] mdt_object_lock_try+0x27/0xb0 [mdt] [ 408.073925] [<ffffffffc0ffd098>] mdt_object_open_lock+0x688/0xac0 [mdt] [ 408.074920] [<ffffffffc10017b3>] mdt_reint_open+0x14e3/0x3160 [mdt] [ 408.075798] [<ffffffffc0ff6293>] mdt_reint_rec+0x83/0x210 [mdt] [ 408.076651] [<ffffffffc0fd51e2>] mdt_reint_internal+0x6b2/0xa80 [mdt] [ 408.077560] [<ffffffffc0fe1922>] mdt_intent_open+0x82/0x350 [mdt] [ 408.078413] [<ffffffffc0fdf9c8>] mdt_intent_policy+0x2f8/0xd10 [mdt] [ 408.079311] [<ffffffffc0b9fe9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc] [ 408.080371] [<ffffffffc0bc8503>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc] [ 408.081507] [<ffffffffc0c4e9b2>] tgt_enqueue+0x62/0x210 [ptlrpc] [ 408.082415] [<ffffffffc0c572fa>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [ 408.083497] [<ffffffffc0bfa48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 408.084776] [<ffffffffc0bfdcc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc] [ 408.085796] [<ffffffff964bb621>] kthread+0xd1/0xe0 [ 408.086716] [<ffffffff96b205f7>] ret_from_fork_nospec_end+0x0/0x39 [ 408.087793] [<ffffffffffffffff>] 0xffffffffffffffff [ 408.088619] LustreError: dumping log to /tmp/lustre-log.1536596905.23632 [ 408.671328] Pid: 23651, comm: mdt00_026 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018 [ 408.672721] Call Trace: [ 408.673309] [<ffffffffc0bb8f61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc] [ 408.674856] [<ffffffffc0bb9d83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc] [ 408.676257] [<ffffffffc0fd93d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt] [ 408.677598] [<ffffffffc0fd9a80>] mdt_object_lock_internal+0x70/0x330 [mdt] [ 408.678935] [<ffffffffc0fdb1c5>] mdt_getattr_name_lock+0xf15/0x1c00 [mdt] [ 408.680486] [<ffffffffc0fe2ae5>] mdt_intent_getattr+0x2b5/0x480 [mdt] [ 408.681934] [<ffffffffc0fdf9c8>] mdt_intent_policy+0x2f8/0xd10 [mdt] [ 408.683203] [<ffffffffc0b9fe9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc] [ 408.684711] [<ffffffffc0bc8503>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc] [ 408.686102] [<ffffffffc0c4e9b2>] tgt_enqueue+0x62/0x210 [ptlrpc] [ 408.687452] [<ffffffffc0c572fa>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [ 408.688845] [<ffffffffc0bfa48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 408.689943] [<ffffffffc0bfdcc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc] [ 408.690671] [<ffffffff964bb621>] kthread+0xd1/0xe0 [ 408.691215] [<ffffffff96b205f7>] ret_from_fork_nospec_end+0x0/0x39 [ 408.691910] [<ffffffffffffffff>] 0xffffffffffffffff [ 408.692489] Pid: 23638, comm: mdt00_018 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018 [ 408.693465] Call Trace: [ 408.693756] [<ffffffffc0bb8f61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc] [ 408.694549] [<ffffffffc0bb9d83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc] [ 408.695315] [<ffffffffc0fd93d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt] [ 408.696111] [<ffffffffc0fd9a80>] mdt_object_lock_internal+0x70/0x330 [mdt] [ 408.696874] [<ffffffffc0fd9d9c>] mdt_reint_object_lock+0x2c/0x60 [mdt] [ 408.697605] [<ffffffffc0feb977>] mdt_reint_striped_lock+0x87/0x4d0 [mdt] [ 408.698334] [<ffffffffc0ff2036>] mdt_reint_unlink+0x706/0x1490 [mdt] [ 408.699045] [<ffffffffc0ff6293>] mdt_reint_rec+0x83/0x210 [mdt] [ 408.699719] [<ffffffffc0fd51e2>] mdt_reint_internal+0x6b2/0xa80 [mdt] [ 408.700429] [<ffffffffc0fe0447>] mdt_reint+0x67/0x140 [mdt] [ 408.701056] [<ffffffffc0c572fa>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [ 408.701892] [<ffffffffc0bfa48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 408.702758] [<ffffffffc0bfdcc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc] [ 408.703467] [<ffffffff964bb621>] kthread+0xd1/0xe0 [ 408.704024] [<ffffffff96b205f7>] ret_from_fork_nospec_end+0x0/0x39 [ 408.704732] [<ffffffffffffffff>] 0xffffffffffffffff [ 408.705624] Pid: 23625, comm: mdt00_007 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018 [ 408.706776] Call Trace: [ 408.707061] [<ffffffffc0bb8f61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc] [ 408.707872] [<ffffffffc0bb9d83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc] [ 408.708669] [<ffffffffc0fd93d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt] [ 408.709414] [<ffffffffc0fd9a80>] mdt_object_lock_internal+0x70/0x330 [mdt] [ 408.710173] [<ffffffffc0fdaaea>] mdt_getattr_name_lock+0x83a/0x1c00 [mdt] [ 408.710928] [<ffffffffc0fe2ae5>] mdt_intent_getattr+0x2b5/0x480 [mdt] [ 408.711653] [<ffffffffc0fdf9c8>] mdt_intent_policy+0x2f8/0xd10 [mdt] [ 408.712341] [<ffffffffc0b9fe9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc] [ 408.713115] [<ffffffffc0bc8503>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc] [ 408.713906] [<ffffffffc0c4e9b2>] tgt_enqueue+0x62/0x210 [ptlrpc] [ 408.714665] [<ffffffffc0c572fa>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [ 408.715429] [<ffffffffc0bfa48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 408.716277] [<ffffffffc0bfdcc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc] [ 408.717000] [<ffffffff964bb621>] kthread+0xd1/0xe0 [ 408.717564] [<ffffffff96b205f7>] ret_from_fork_nospec_end+0x0/0x39 [ 408.718255] [<ffffffffffffffff>] 0xffffffffffffffff [ 408.718856] LustreError: dumping log to /tmp/lustre-log.1536596906.23625 [ 409.275793] Pid: 23641, comm: mdt00_021 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018 [ 409.277732] Call Trace: [ 409.278221] [<ffffffffc0bb8f61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc] [ 409.280058] [<ffffffffc0bb9d83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc] [ 409.281479] [<ffffffffc0fd93d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt] [ 409.282821] [<ffffffffc0fd9a80>] mdt_object_lock_internal+0x70/0x330 [mdt] [ 409.284152] [<ffffffffc0fdaaea>] mdt_getattr_name_lock+0x83a/0x1c00 [mdt] [ 409.285482] [<ffffffffc0fe2ae5>] mdt_intent_getattr+0x2b5/0x480 [mdt] [ 409.286819] [<ffffffffc0fdf9c8>] mdt_intent_policy+0x2f8/0xd10 [mdt] [ 409.288208] [<ffffffffc0b9fe9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc] [ 409.289500] [<ffffffffc0bc8503>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc] [ 409.290818] [<ffffffffc0c4e9b2>] tgt_enqueue+0x62/0x210 [ptlrpc] [ 409.293953] [<ffffffffc0c572fa>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [ 409.295296] [<ffffffffc0bfa48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 409.296786] [<ffffffffc0bfdcc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc] [ 409.298362] [<ffffffff964bb621>] kthread+0xd1/0xe0 [ 409.299510] [<ffffffff96b205f7>] ret_from_fork_nospec_end+0x0/0x39 [ 409.300537] [<ffffffffffffffff>] 0xffffffffffffffff [ 409.301465] LNet: Service thread pid 23636 was inactive for 62.65s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [ 409.302975] LNet: Service thread pid 23633 was inactive for 62.64s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [ 445.161620] LustreError: 14404:0:(ldlm_lockd.c:251:expired_lock_main()) ### lock callback timer expired after 99s: evicting client at 10.2.4.94@tcp ns: mdt-lustre-MDT0000_UUID lock: ffff9c691e927d40/0x6cd7fdda56e1ffdf lrc: 3/0,0 mode: EX/EX res: [0x200000405:0x1f6:0x0].0x0 bits 0x4/0x0 rrc: 7 type: IBT flags: 0x60200400000020 nid: 10.2.4.94@tcp remote: 0xef3380579c724faf expref: 137 pid: 23632 timeout: 445 lvb_type: 0 [ 445.169506] LNet: Service thread pid 23632 completed after 99.22s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [ 445.169656] LNet: Service thread pid 23638 completed after 99.12s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [ 445.170016] LNet: Service thread pid 23651 completed after 98.81s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [ 445.172273] LNet: Service thread pid 23633 completed after 98.51s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [ 445.179865] LNet: Service thread pid 23625 completed after 98.58s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [ 445.180073] LNet: Service thread pid 23641 completed after 98.62s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [ 445.190969] LNet: Service thread pid 23636 completed after 98.54s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [ 469.495444] LNetError: 14384:0:(lib-msg.c:794:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0) The call traces presented here have similarities to LU-5886 and LU-10525 . Although this look similar to LU-10852 , the call trace for LU-10852 all have ‘osp_md_object_lock()’ in the call stack and the hangs described here do not. We’ve seen these call traces in console logs in other test sessions; https://testing.whamcloud.com/test_sets/1237377a-b529-11e8-8c12-52540065bddc

            People

              laisiyao Lai Siyao
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: