Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11358

racer test 1 hangs in locking with DNE

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.12.0, Lustre 2.12.1, Lustre 2.12.3, Lustre 2.12.6
    • 3
    • 9223372036854775807

    Description

      For https://testing.whamcloud.com/test_sets/122c1c1e-b529-11e8-8c12-52540065bddc, we see that racer test_1 hangs while running in a DNE environment.
      On the first client (vm1), we see that dir_create is hung. From the client console log, we see

      [  641.878967] LustreError: 29399:0:(lov_object.c:1208:lov_layout_change()) lustre-clilov-ffff8c19f9a62000: cannot apply new layout on [0x240000404:0x23d:0x0] : rc = -22
      [  641.881511] LustreError: 29399:0:(lcommon_cl.c:187:cl_file_inode_init()) Failure to initialize cl object [0x240000404:0x23d:0x0]: -22
      [  649.716504] LustreError: 5568:0:(lov_object.c:1208:lov_layout_change()) lustre-clilov-ffff8c19f9a62000: cannot apply new layout on [0x240000404:0x23d:0x0] : rc = -22
      [  649.719224] LustreError: 5568:0:(lcommon_cl.c:187:cl_file_inode_init()) Failure to initialize cl object [0x240000404:0x23d:0x0]: -22
      [  715.441686] Lustre: lustre-OST0003-osc-ffff8c19f9a62000: Connection to lustre-OST0003 (at 10.2.4.95@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [  715.444729] LustreError: 167-0: lustre-OST0003-osc-ffff8c19f9a62000: This client was evicted by lustre-OST0003; in progress operations using this service will fail.
      [  715.446755] LustreError: 18480:0:(ldlm_resource.c:1146:ldlm_resource_complain()) lustre-OST0003-osc-ffff8c19f9a62000: namespace resource [0x57:0x0:0x0].0x0 (ffff8c19dec760c0) refcount nonzero (1) after lock cleanup; forcing cleanup.
      [  719.586481] Lustre: 4306:0:(client.c:2126:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1536596966/real 1536596966]  req@ffff8c19e35b1800 x1611238269322960/t0(0) o36->lustre-MDT0000-mdc-ffff8c19fbdd9000@10.2.4.96@tcp:12/10 lens 608/4768 e 0 to 1 dl 1536597121 ref 2 fl Rpc:IX/0/ffffffff rc 0/-1
      [  719.589415] LustreError: 4306:0:(llite_lib.c:1559:ll_md_setattr()) md_setattr fails: rc = -4
      [  719.593107] Lustre: lustre-MDT0000-mdc-ffff8c19fbdd9000: Connection restored to 10.2.4.96@tcp (at 10.2.4.96@tcp)
      [  840.233545] INFO: task dir_create.sh:28323 blocked for more than 120 seconds.
      [  840.234422] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  840.235280] dir_create.sh   D ffff8c19e2766eb0     0 28323      1 0x00000084
      [  840.236116] Call Trace:
      [  840.236610]  [<ffffffffc080985c>] ? jobid_interpret_string+0x12c/0x1c0 [obdclass]
      [  840.237611]  [<ffffffff9f514029>] schedule+0x29/0x70
      [  840.238134]  [<ffffffff9f511999>] schedule_timeout+0x239/0x2c0
      [  840.238757]  [<ffffffff9f1516dd>] ? radix_tree_gang_lookup+0xcd/0x150
      [  840.239748]  [<ffffffffc0ba9ab7>] ? osc_page_gang_lookup+0x307/0x370 [osc]
      [  840.240501]  [<ffffffff9f5143dd>] wait_for_completion+0xfd/0x140
      [  840.241191]  [<ffffffff9eecf670>] ? wake_up_state+0x20/0x20
      [  840.241775]  [<ffffffffc0ba3688>] osc_io_setattr_end+0xe8/0x200 [osc]
      [  840.242554]  [<ffffffffc0bf4fa0>] ? lov_io_iter_fini_wrapper+0x50/0x50 [lov]
      [  840.243333]  [<ffffffffc0803f70>] cl_io_end+0x60/0x150 [obdclass]
      [  840.243993]  [<ffffffffc0bf507b>] lov_io_end_wrapper+0xdb/0xe0 [lov]
      [  840.244789]  [<ffffffffc0bf5507>] lov_io_call.isra.5+0x87/0x140 [lov]
      [  840.245626]  [<ffffffffc0bf55f6>] lov_io_end+0x36/0xb0 [lov]
      [  840.246235]  [<ffffffffc0803f70>] cl_io_end+0x60/0x150 [obdclass]
      [  840.246867]  [<ffffffffc080661f>] cl_io_loop+0x13f/0xc70 [obdclass]
      [  840.247719]  [<ffffffffc0c959c0>] cl_setattr_ost+0x250/0x3c0 [lustre]
      [  840.248424]  [<ffffffffc0c71b6f>] ll_setattr_raw+0xe7f/0xee0 [lustre]
      [  840.249146]  [<ffffffffc0c71c3c>] ll_setattr+0x6c/0xd0 [lustre]
      [  840.249881]  [<ffffffff9f039f14>] notify_change+0x2c4/0x420
      [  840.250539]  [<ffffffff9f0f3a5e>] ? process_measurement+0x8e/0x250
      [  840.251186]  [<ffffffff9f019185>] do_truncate+0x75/0xc0
      [  840.251733]  [<ffffffff9f02b717>] do_last+0x627/0x12c0
      [  840.252361]  [<ffffffff9f02c487>] path_openat+0xd7/0x640
      [  840.252995]  [<ffffffff9f02e01d>] do_filp_open+0x4d/0xb0
      [  840.253633]  [<ffffffff9f03b4c4>] ? __alloc_fd+0xc4/0x170
      [  840.254238]  [<ffffffff9f01a327>] do_sys_open+0x137/0x240
      [  840.254805]  [<ffffffff9f5206d5>] ? system_call_after_swapgs+0xa2/0x146
      [  840.255637]  [<ffffffff9f01a44e>] SyS_open+0x1e/0x20
      [  840.256167]  [<ffffffff9f520795>] system_call_fastpath+0x1c/0x21
      [  840.256788]  [<ffffffff9f5206e1>] ? system_call_after_swapgs+0xae/0x146
      

      On the second client (vm2), we see it is stuck in setfattr, mv, ln, touch, lfs, cat, chmod, … ending finally with “md_setattr fails: rc = -4”. From the client console log, we see:

      [  960.234201] INFO: task setfattr:4379 blocked for more than 120 seconds.
      [  960.235152] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  960.236088] setfattr        D ffff88319d295ee0     0  4379  10275 0x00000080
      [  960.236836] Call Trace:
      [  960.237118]  [<ffffffff89e27040>] ? generic_permission+0x110/0x1d0
      [  960.237930]  [<ffffffff8a314f49>] schedule_preempt_disabled+0x29/0x70
      [  960.238864]  [<ffffffff8a312d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [  960.239574]  [<ffffffff8a3120ef>] mutex_lock+0x1f/0x2f
      [  960.240178]  [<ffffffff89e432af>] vfs_removexattr+0x5f/0x130
      [  960.241016]  [<ffffffff89e433d5>] removexattr+0x55/0x80
      [  960.241569]  [<ffffffff89e2ccdd>] ? putname+0x3d/0x60
      [  960.242271]  [<ffffffff89e2def2>] ? user_path_at_empty+0x72/0xc0
      [  960.242971]  [<ffffffff89e1e418>] ? __sb_start_write+0x58/0x110
      [  960.243748]  [<ffffffff8a31b57c>] ? __do_page_fault+0x1bc/0x4f0
      [  960.244509]  [<ffffffff89e3e03c>] ? mnt_want_write+0x2c/0x50
      [  960.245274]  [<ffffffff89e441d4>] SyS_removexattr+0x94/0xd0
      [  960.245851]  [<ffffffff8a320795>] system_call_fastpath+0x1c/0x21
      [  960.246428]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  960.247093] INFO: task setfattr:4708 blocked for more than 120 seconds.
      [  960.247889] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  960.248730] setfattr        D ffff88319d285ee0     0  4708  10256 0x00000080
      [  960.249474] Call Trace:
      [  960.249731]  [<ffffffff89e27040>] ? generic_permission+0x110/0x1d0
      [  960.250601]  [<ffffffff8a314f49>] schedule_preempt_disabled+0x29/0x70
      [  960.251286]  [<ffffffff8a312d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [  960.251901]  [<ffffffff8a3120ef>] mutex_lock+0x1f/0x2f
      [  960.252397]  [<ffffffff89e432af>] vfs_removexattr+0x5f/0x130
      [  960.253026]  [<ffffffff89e433d5>] removexattr+0x55/0x80
      [  960.253722]  [<ffffffff89e2ccdd>] ? putname+0x3d/0x60
      [  960.254310]  [<ffffffff89e2def2>] ? user_path_at_empty+0x72/0xc0
      [  960.255005]  [<ffffffff89e1e418>] ? __sb_start_write+0x58/0x110
      [  960.255580]  [<ffffffff8a31b57c>] ? __do_page_fault+0x1bc/0x4f0
      [  960.256302]  [<ffffffff89e3e03c>] ? mnt_want_write+0x2c/0x50
      [  960.256943]  [<ffffffff89e441d4>] SyS_removexattr+0x94/0xd0
      [  960.257558]  [<ffffffff8a320795>] system_call_fastpath+0x1c/0x21
      [  960.258238]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  960.258922] INFO: task mv:4894 blocked for more than 120 seconds.
      [  960.259514] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  960.260406] mv              D ffff8831abf83f40     0  4894  10227 0x00000080
      [  960.261156] Call Trace:
      [  960.261461]  [<ffffffff89ed1072>] ? security_inode_permission+0x22/0x30
      [  960.262321]  [<ffffffff8a314f49>] schedule_preempt_disabled+0x29/0x70
      [  960.263071]  [<ffffffff8a312d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [  960.263708]  [<ffffffff8a3120ef>] mutex_lock+0x1f/0x2f
      [  960.264213]  [<ffffffff8a30bcde>] lookup_slow+0x33/0xa7
      [  960.264920]  [<ffffffff89e2aa08>] path_lookupat+0x838/0x8b0
      [  960.265479]  [<ffffffff89df7a35>] ? kmem_cache_alloc+0x35/0x1f0
      [  960.266156]  [<ffffffff89e2cd4f>] ? getname_flags+0x4f/0x1a0
      [  960.266720]  [<ffffffff89e2aaab>] filename_lookup+0x2b/0xc0
      [  960.267252]  [<ffffffff89e2dee7>] user_path_at_empty+0x67/0xc0
      [  960.267959]  [<ffffffff89d25092>] ? from_kgid_munged+0x12/0x20
      [  960.268570]  [<ffffffff89e2142f>] ? cp_new_stat+0x14f/0x180
      [  960.269147]  [<ffffffff89e2df51>] user_path_at+0x11/0x20
      [  960.269667]  [<ffffffff89e20f23>] vfs_fstatat+0x63/0xc0
      [  960.270189]  [<ffffffff89e214f1>] SYSC_newlstat+0x31/0x60
      [  960.270976]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  960.271851]  [<ffffffff8a3206d5>] ? system_call_after_swapgs+0xa2/0x146
      [  960.272595]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  960.273350]  [<ffffffff8a3206d5>] ? system_call_after_swapgs+0xa2/0x146
      [  960.274040]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  960.274856]  [<ffffffff8a3206d5>] ? system_call_after_swapgs+0xa2/0x146
      [  960.275794]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  960.276704]  [<ffffffff8a3206d5>] ? system_call_after_swapgs+0xa2/0x146
      [  960.277453]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  960.278135]  [<ffffffff8a3206d5>] ? system_call_after_swapgs+0xa2/0x146
      [  960.278810]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  960.279628]  [<ffffffff89e2177e>] SyS_newlstat+0xe/0x10
      [  960.280204]  [<ffffffff8a320795>] system_call_fastpath+0x1c/0x21
      [  960.280914]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  960.281654] INFO: task ln:5093 blocked for more than 120 seconds.
      [  960.282238] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  960.283146] ln              D ffff88319d24af70     0  5093  10229 0x00000080
      [  960.283887] Call Trace:
      [  960.284172]  [<ffffffff89e2a2f2>] ? path_lookupat+0x122/0x8b0
      [  960.284908]  [<ffffffff89df5a59>] ? ___slab_alloc+0x209/0x4f0
      [  960.285587]  [<ffffffff8a314f49>] schedule_preempt_disabled+0x29/0x70
      [  960.286261]  [<ffffffff8a312d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [  960.287045]  [<ffffffff8a3120ef>] mutex_lock+0x1f/0x2f
      [  960.287796]  [<ffffffff89e2abc5>] filename_create+0x85/0x180
      [  960.288559]  [<ffffffff89df7b9b>] ? kmem_cache_alloc+0x19b/0x1f0
      [  960.289344]  [<ffffffff89e2cd4f>] ? getname_flags+0x4f/0x1a0
      [  960.290171]  [<ffffffff89e2cdc4>] ? getname_flags+0xc4/0x1a0
      [  960.290901]  [<ffffffff89e2d071>] user_path_create+0x41/0x60
      [  960.291687]  [<ffffffff89e2e5e8>] SyS_symlinkat+0x58/0x110
      [  960.292335]  [<ffffffff8a3206d5>] ? system_call_after_swapgs+0xa2/0x146
      [  960.293033]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  960.293880]  [<ffffffff89e2e6b6>] SyS_symlink+0x16/0x20
      [  960.294530]  [<ffffffff8a320795>] system_call_fastpath+0x1c/0x21
      [  960.295233]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  960.295935] INFO: task touch:5326 blocked for more than 120 seconds.
      [  960.296628] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  960.297687] touch           D ffff8831aad8eeb0     0  5326  10280 0x00000080
      [  960.298689] Call Trace:
      [  960.298949]  [<ffffffff8a314f49>] schedule_preempt_disabled+0x29/0x70
      [  960.299603]  [<ffffffff8a312d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [  960.300357]  [<ffffffff8a3120ef>] mutex_lock+0x1f/0x2f
      [  960.300970]  [<ffffffff89e2b37f>] do_last+0x28f/0x12c0
      [  960.301611]  [<ffffffff89e2c487>] path_openat+0xd7/0x640
      [  960.302331]  [<ffffffff89e2e01d>] do_filp_open+0x4d/0xb0
      [  960.303145]  [<ffffffff89e3b4c4>] ? __alloc_fd+0xc4/0x170
      [  960.303935]  [<ffffffff89e1a327>] do_sys_open+0x137/0x240
      [  960.304594]  [<ffffffff8a3206d5>] ? system_call_after_swapgs+0xa2/0x146
      [  960.305239]  [<ffffffff89e1a44e>] SyS_open+0x1e/0x20
      [  960.305818]  [<ffffffff8a320795>] system_call_fastpath+0x1c/0x21
      [  960.306572]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  960.307450] INFO: task setfattr:5397 blocked for more than 120 seconds.
      [  960.308286] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  960.309232] setfattr        D ffff8831a6433f40     0  5397  10258 0x00000080
      [  960.310116] Call Trace:
      [  960.310390]  [<ffffffff89e27040>] ? generic_permission+0x110/0x1d0
      [  960.311113]  [<ffffffff8a314f49>] schedule_preempt_disabled+0x29/0x70
      [  960.312040]  [<ffffffff8a312d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [  960.312870]  [<ffffffff8a3120ef>] mutex_lock+0x1f/0x2f
      [  960.313448]  [<ffffffff89e43612>] vfs_setxattr+0x62/0xc0
      [  960.314029]  [<ffffffff89e4379c>] setxattr+0x12c/0x1c0
      [  960.314591]  [<ffffffff89df6ec2>] ? kmem_cache_free+0x1e2/0x200
      [  960.315167]  [<ffffffff89e2ccdd>] ? putname+0x3d/0x60
      [  960.315661]  [<ffffffff89e2def2>] ? user_path_at_empty+0x72/0xc0
      [  960.316354]  [<ffffffff89e1e418>] ? __sb_start_write+0x58/0x110
      [  960.316947]  [<ffffffff8a31b57c>] ? __do_page_fault+0x1bc/0x4f0
      [  960.317592]  [<ffffffff89e43a97>] SyS_setxattr+0xb7/0x100
      [  960.318115]  [<ffffffff8a320795>] system_call_fastpath+0x1c/0x21
      [  960.318753]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  960.319582] INFO: task ln:5444 blocked for more than 120 seconds.
      [  960.320167] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  960.320996] ln              D ffff8831abf86eb0     0  5444  10267 0x00000080
      [  960.321733] Call Trace:
      [  960.322040]  [<ffffffff89e2a2f2>] ? path_lookupat+0x122/0x8b0
      [  960.322603]  [<ffffffff89df5a59>] ? ___slab_alloc+0x209/0x4f0
      [  960.323223]  [<ffffffff8a314f49>] schedule_preempt_disabled+0x29/0x70
      [  960.323886]  [<ffffffff8a312d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [  960.324486]  [<ffffffff8a3120ef>] mutex_lock+0x1f/0x2f
      [  960.325130]  [<ffffffff89e2abc5>] filename_create+0x85/0x180
      [  960.325688]  [<ffffffff89df7b9b>] ? kmem_cache_alloc+0x19b/0x1f0
      [  960.326313]  [<ffffffff89e2cd4f>] ? getname_flags+0x4f/0x1a0
      [  960.326988]  [<ffffffff89e2cdc4>] ? getname_flags+0xc4/0x1a0
      [  960.327537]  [<ffffffff89e2d071>] user_path_create+0x41/0x60
      [  960.328165]  [<ffffffff89e2e5e8>] SyS_symlinkat+0x58/0x110
      [  960.328705]  [<ffffffff8a3206d5>] ? system_call_after_swapgs+0xa2/0x146
      [  960.329367]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  960.330116]  [<ffffffff89e2e6b6>] SyS_symlink+0x16/0x20
      [  960.330631]  [<ffffffff8a320795>] system_call_fastpath+0x1c/0x21
      [  960.331304]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  960.332054] INFO: task lfs:10586 blocked for more than 120 seconds.
      [  960.332725] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  960.333471] lfs             D ffff8831abe38fd0     0 10586  10241 0x00000080
      [  960.334225] Call Trace:
      [  960.334539]  [<ffffffffc0d167e2>] ? ll_dcompare+0x72/0x2e0 [lustre]
      [  960.335222]  [<ffffffff8a314f49>] schedule_preempt_disabled+0x29/0x70
      [  960.335853]  [<ffffffff8a312d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [  960.336513]  [<ffffffff8a3120ef>] mutex_lock+0x1f/0x2f
      [  960.337026]  [<ffffffffc0d32a9f>] ll_migrate+0x16f/0xcc0 [lustre]
      [  960.337669]  [<ffffffffc0d21a35>] ll_dir_ioctl+0x59c5/0x6560 [lustre]
      [  960.338587]  [<ffffffff89dc5a9c>] ? handle_pte_fault+0x2dc/0xc30
      [  960.339201]  [<ffffffff89e3dde4>] ? mntput+0x24/0x40
      [  960.339693]  [<ffffffff89dc7c3d>] ? handle_mm_fault+0x39d/0x9b0
      [  960.340288]  [<ffffffff89e30350>] do_vfs_ioctl+0x350/0x560
      [  960.340855]  [<ffffffff8a31b57c>] ? __do_page_fault+0x1bc/0x4f0
      [  960.341908]  [<ffffffff89e30601>] SyS_ioctl+0xa1/0xc0
      [  960.342852]  [<ffffffff8a3206d5>] ? system_call_after_swapgs+0xa2/0x146
      [  960.343595]  [<ffffffff8a320795>] system_call_fastpath+0x1c/0x21
      [  960.344181]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  960.345175] INFO: task cat:10688 blocked for more than 120 seconds.
      [  960.346073] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  960.346939] cat             D ffff8831a95acf10     0 10688  10231 0x00000080
      [  960.347930] Call Trace:
      [  960.348208]  [<ffffffff89ed0a1e>] ? security_capable+0x1e/0x20
      [  960.348956]  [<ffffffff8a314f49>] schedule_preempt_disabled+0x29/0x70
      [  960.350080]  [<ffffffff8a312d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [  960.350886]  [<ffffffff8a3120ef>] mutex_lock+0x1f/0x2f
      [  960.351527]  [<ffffffff8a30bcde>] lookup_slow+0x33/0xa7
      [  960.352142]  [<ffffffff89e2927f>] link_path_walk+0x80f/0x8b0
      [  960.352945]  [<ffffffff89e2c465>] path_openat+0xb5/0x640
      [  960.353620]  [<ffffffff89cf9172>] ? ktime_get+0x52/0xe0
      [  960.354387]  [<ffffffffc0d285ef>] ? ll_getattr+0x22f/0x580 [lustre]
      [  960.355148]  [<ffffffff89e2e01d>] do_filp_open+0x4d/0xb0
      [  960.355707]  [<ffffffff89e3b4c4>] ? __alloc_fd+0xc4/0x170
      [  960.356563]  [<ffffffff89e1a327>] do_sys_open+0x137/0x240
      [  960.357315]  [<ffffffff8a3206d5>] ? system_call_after_swapgs+0xa2/0x146
      [  960.358497]  [<ffffffff89e1a44e>] SyS_open+0x1e/0x20
      [  960.359026]  [<ffffffff8a320795>] system_call_fastpath+0x1c/0x21
      [  960.359646]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  960.360691] INFO: task chmod:11121 blocked for more than 120 seconds.
      [  960.361650] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  960.362816] chmod           D ffff8831a32f8fd0     0 11121  10234 0x00000080
      [  960.363567] Call Trace:
      [  960.363824]  [<ffffffff89e2cd4f>] ? getname_flags+0x4f/0x1a0
      [  960.364484]  [<ffffffff8a314f49>] schedule_preempt_disabled+0x29/0x70
      [  960.365104]  [<ffffffff8a312d07>] __mutex_lock_slowpath+0xc7/0x1d0
      [  960.365787]  [<ffffffff8a3120ef>] mutex_lock+0x1f/0x2f
      [  960.366323]  [<ffffffff89e189c8>] chmod_common+0x78/0x160
      [  960.366924]  [<ffffffff8a31b57c>] ? __do_page_fault+0x1bc/0x4f0
      [  960.367607]  [<ffffffff89e19d17>] SyS_fchmodat+0x57/0xc0
      [  960.368125]  [<ffffffff8a320795>] system_call_fastpath+0x1c/0x21
      [  960.368777]  [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146
      [  990.121506] LustreError: 4369:0:(llite_lib.c:1559:ll_md_setattr()) md_setattr fails: rc = -4
      

      On MDS2, 4 (vm5), we see the following three call traces in the console log:

      [  259.348108] Lustre: DEBUG MARKER: == racer test 1: racer on clients: onyx-30vm1.onyx.whamcloud.com,onyx-30vm2 DURATION=300 ============= 16:25:55 (1536596755)
      [  351.363812] LNet: Service thread pid 22468 was inactive for 62.20s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [  351.367067] Pid: 22468, comm: mdt00_009 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018
      [  351.368822] Call Trace:
      [  351.369374]  [<ffffffffc0d3df61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
      [  351.370937]  [<ffffffffc0d3ed83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
      [  351.372390]  [<ffffffffc10e63d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt]
      [  351.373833]  [<ffffffffc10e6a80>] mdt_object_lock_internal+0x70/0x330 [mdt]
      [  351.375390]  [<ffffffffc10e6d60>] mdt_object_lock+0x20/0x30 [mdt]
      [  351.376680]  [<ffffffffc112186b>] mdt_brw_enqueue+0x44b/0x760 [mdt]
      [  351.377955]  [<ffffffffc10d44bf>] mdt_intent_brw+0x1f/0x30 [mdt]
      [  351.379141]  [<ffffffffc10ec9c8>] mdt_intent_policy+0x2f8/0xd10 [mdt]
      [  351.380500]  [<ffffffffc0d24e9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc]
      [  351.381929]  [<ffffffffc0d4d503>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc]
      [  351.383341]  [<ffffffffc0dd39b2>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [  351.384680]  [<ffffffffc0ddc2fa>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [  351.386091]  [<ffffffffc0d7f48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [  351.387592]  [<ffffffffc0d82cc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]
      [  351.388691]  [<ffffffffb9abb621>] kthread+0xd1/0xe0
      [  351.389324]  [<ffffffffba1205f7>] ret_from_fork_nospec_end+0x0/0x39
      [  351.390034]  [<ffffffffffffffff>] 0xffffffffffffffff
      [  351.390691] LustreError: dumping log to /tmp/lustre-log.1536596849.22468
      [  351.992686] Pid: 22484, comm: mdt00_019 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018
      [  351.994620] Call Trace:
      [  351.995150]  [<ffffffffc0d3df61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
      [  351.996680]  [<ffffffffc0d3ed83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
      [  351.998519]  [<ffffffffc10e63d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt]
      [  352.000268]  [<ffffffffc10e6a80>] mdt_object_lock_internal+0x70/0x330 [mdt]
      [  352.001606]  [<ffffffffc10e8f1a>] mdt_object_find_lock+0x6a/0x1a0 [mdt]
      [  352.002977]  [<ffffffffc1104b4e>] mdt_reint_setxattr+0x1ce/0xf20 [mdt]
      [  352.004283]  [<ffffffffc1103293>] mdt_reint_rec+0x83/0x210 [mdt]
      [  352.005561]  [<ffffffffc10e21e2>] mdt_reint_internal+0x6b2/0xa80 [mdt]
      [  352.006896]  [<ffffffffc10ed447>] mdt_reint+0x67/0x140 [mdt]
      [  352.007959]  [<ffffffffc0ddc2fa>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [  352.009256]  [<ffffffffc0d7f48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [  352.010650]  [<ffffffffc0d82cc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]
      [  352.011847]  [<ffffffffb9abb621>] kthread+0xd1/0xe0
      [  352.012767]  [<ffffffffba1205f7>] ret_from_fork_nospec_end+0x0/0x39
      [  352.013960]  [<ffffffffffffffff>] 0xffffffffffffffff
      [  352.899805] Pid: 22464, comm: mdt00_005 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018
      [  352.901685] Call Trace:
      [  352.902186]  [<ffffffffc0d3df61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
      [  352.903489]  [<ffffffffc0d3ed83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
      [  352.905019]  [<ffffffffc10e63d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt]
      [  352.906256]  [<ffffffffc10e6a80>] mdt_object_lock_internal+0x70/0x330 [mdt]
      [  352.907593]  [<ffffffffc10e6df7>] mdt_object_lock_try+0x27/0xb0 [mdt]
      [  352.908889]  [<ffffffffc10e8547>] mdt_getattr_name_lock+0x1297/0x1c00 [mdt]
      [  352.910163]  [<ffffffffc10efae5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
      [  352.911631]  [<ffffffffc10ec9c8>] mdt_intent_policy+0x2f8/0xd10 [mdt]
      [  352.912921]  [<ffffffffc0d24e9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc]
      [  352.914176]  [<ffffffffc0d4d503>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc]
      [  352.915530]  [<ffffffffc0dd39b2>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [  352.916805]  [<ffffffffc0ddc2fa>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [  352.918088]  [<ffffffffc0d7f48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [  352.919538]  [<ffffffffc0d82cc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]
      [  352.920825]  [<ffffffffb9abb621>] kthread+0xd1/0xe0
      [  352.921765]  [<ffffffffba1205f7>] ret_from_fork_nospec_end+0x0/0x39
      [  352.922967]  [<ffffffffffffffff>] 0xffffffffffffffff
      [  352.924148] LustreError: dumping log to /tmp/lustre-log.1536596850.22464
      [  353.155770] Pid: 14392, comm: mdt00_002 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018
      [  353.157590] Call Trace:
      [  353.158104]  [<ffffffffc0d3df61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
      [  353.159532]  [<ffffffffc0d3ed83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
      [  353.160957]  [<ffffffffc10e63d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt]
      [  353.162198]  [<ffffffffc10e6a80>] mdt_object_lock_internal+0x70/0x330 [mdt]
      [  353.163514]  [<ffffffffc10e6df7>] mdt_object_lock_try+0x27/0xb0 [mdt]
      [  353.164788]  [<ffffffffc10e8547>] mdt_getattr_name_lock+0x1297/0x1c00 [mdt]
      [  353.166069]  [<ffffffffc10efae5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
      [  353.167514]  [<ffffffffc10ec9c8>] mdt_intent_policy+0x2f8/0xd10 [mdt]
      [  353.168828]  [<ffffffffc0d24e9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc]
      [  353.170114]  [<ffffffffc0d4d503>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc]
      [  353.171477]  [<ffffffffc0dd39b2>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [  353.172752]  [<ffffffffc0ddc2fa>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [  353.174046]  [<ffffffffc0d7f48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [  353.175514]  [<ffffffffc0d82cc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]
      [  353.176790]  [<ffffffffb9abb621>] kthread+0xd1/0xe0
      [  353.177702]  [<ffffffffba1205f7>] ret_from_fork_nospec_end+0x0/0x39
      [  353.179138]  [<ffffffffffffffff>] 0xffffffffffffffff
      [  354.435795] Pid: 22466, comm: mdt00_007 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018
      [  354.437623] Call Trace:
      [  354.438121]  [<ffffffffc0d3df61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
      [  354.439416]  [<ffffffffc0d3ed83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
      [  354.440744]  [<ffffffffc10e63d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt]
      [  354.442189]  [<ffffffffc10e6a80>] mdt_object_lock_internal+0x70/0x330 [mdt]
      [  354.443492]  [<ffffffffc10e6df7>] mdt_object_lock_try+0x27/0xb0 [mdt]
      [  354.444784]  [<ffffffffc10e8547>] mdt_getattr_name_lock+0x1297/0x1c00 [mdt]
      [  354.446055]  [<ffffffffc10efae5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
      [  354.447309]  [<ffffffffc10ec9c8>] mdt_intent_policy+0x2f8/0xd10 [mdt]
      [  354.448497]  [<ffffffffc0d24e9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc]
      [  354.450007]  [<ffffffffc0d4d503>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc]
      [  354.451294]  [<ffffffffc0dd39b2>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [  354.452545]  [<ffffffffc0ddc2fa>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [  354.453934]  [<ffffffffc0d7f48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [  354.455351]  [<ffffffffc0d82cc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]
      [  354.456626]  [<ffffffffb9abb621>] kthread+0xd1/0xe0
      [  354.457598]  [<ffffffffba1205f7>] ret_from_fork_nospec_end+0x0/0x39
      [  354.458800]  [<ffffffffffffffff>] 0xffffffffffffffff
      [  354.459764] LustreError: dumping log to /tmp/lustre-log.1536596852.22466
      [  355.715812] LNet: Service thread pid 22486 was inactive for 62.01s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
      [  355.718240] LustreError: dumping log to /tmp/lustre-log.1536596853.22486
      [  355.843781] LNet: Service thread pid 22494 was inactive for 62.11s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
      [  390.164855] LustreError: 14386:0:(ldlm_lockd.c:251:expired_lock_main()) ### lock callback timer expired after 99s: evicting client at 10.2.4.94@tcp  ns: mdt-lustre-MDT0001_UUID lock: ffff92c1dc98ed80/0x1d42b9ca91f94e31 lrc: 3/0,0 mode: PW/PW res: [0x240000402:0x5d:0x0].0x0 bits 0x40/0x0 rrc: 12 type: IBT flags: 0x60200400000020 nid: 10.2.4.94@tcp remote: 0xef3380579c7186d1 expref: 88 pid: 22490 timeout: 390 lvb_type: 0
      [  390.173809] LustreError: 22468:0:(ldlm_lockd.c:1342:ldlm_handle_enqueue0()) ### lock on destroyed export ffff92c1dd765400 ns: mdt-lustre-MDT0001_UUID lock: ffff92c1dc98f680/0x1d42b9ca91f94ed2 lrc: 3/0,0 mode: PW/PW res: [0x240000402:0x5d:0x0].0x0 bits 0x40/0x0 rrc: 9 type: IBT flags: 0x50200000000000 nid: 10.2.4.94@tcp remote: 0xef3380579c718717 expref: 18 pid: 22468 timeout: 0 lvb_type: 0
      [  390.179772] LNet: Service thread pid 22468 completed after 101.02s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [  390.180082] LNet: Service thread pid 22484 completed after 100.40s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [  390.180474] LNet: Service thread pid 22466 completed after 97.81s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [  390.180626] LustreError: 14392:0:(ldlm_lockd.c:1342:ldlm_handle_enqueue0()) ### lock on destroyed export ffff92c1dd765400 ns: mdt-lustre-MDT0001_UUID lock: ffff92c1dd03f8c0/0x1d42b9ca91f96aa8 lrc: 3/0,0 mode: PR/PR res: [0x240000402:0x5d:0x0].0x0 bits 0x1b/0x0 rrc: 7 type: IBT flags: 0x50200000000000 nid: 10.2.4.94@tcp remote: 0xef3380579c7192d9 expref: 4 pid: 14392 timeout: 0 lvb_type: 0
      [  390.180862] LNet: Service thread pid 14392 completed after 99.10s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [  390.181233] LNet: Service thread pid 22464 completed after 99.30s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [  390.181528] LNet: Service thread pid 22486 completed after 96.47s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [  390.181813] LNet: Service thread pid 22494 completed after 96.45s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [  468.993496] LNetError: 14367:0:(lib-msg.c:794:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)
      [  468.994385] LNetError: 14366:0:(lib-msg.c:794:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)
      [  555.134662] LustreError: 14422:0:(llog_cat.c:801:llog_cat_cancel_records()) lustre-MDT0000-osp-MDT0001: fail to cancel 1 of 1 llog-records: rc = -116
      [  555.136979] LustreError: 14422:0:(llog_cat.c:801:llog_cat_cancel_records()) lustre-MDT0000-osp-MDT0001: fail to cancel 1 of 1 llog-records: rc = -116
      [  555.139399] LustreError: 14422:0:(llog_cat.c:801:llog_cat_cancel_records()) lustre-MDT0000-osp-MDT0001: fail to cancel 1 of 1 llog-records: rc = -116
      [  555.141809] LustreError: 14422:0:(llog_cat.c:801:llog_cat_cancel_records()) lustre-MDT0000-osp-MDT0001: fail to cancel 1 of 1 llog-records: rc = -116
      [  555.144689] LustreError: 14422:0:(llog_cat.c:801:llog_cat_cancel_records()) lustre-MDT0000-osp-MDT0001: fail to cancel 1 of 1 llog-records: rc = -116
      

      Description continued in the first comment due to JIRA description line limits.

      Attachments

        1. fir-md1-s1_compl_ast.log
          159 kB
          Stephane Thiell
        2. fir-md1-s2_compl_ast.log
          84 kB
          Stephane Thiell

        Issue Links

          Activity

            People

              laisiyao Lai Siyao
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: