Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.12.0, Lustre 2.12.1, Lustre 2.12.3, Lustre 2.12.6
-
3
-
9223372036854775807
Description
For https://testing.whamcloud.com/test_sets/122c1c1e-b529-11e8-8c12-52540065bddc, we see that racer test_1 hangs while running in a DNE environment.
On the first client (vm1), we see that dir_create is hung. From the client console log, we see
[ 641.878967] LustreError: 29399:0:(lov_object.c:1208:lov_layout_change()) lustre-clilov-ffff8c19f9a62000: cannot apply new layout on [0x240000404:0x23d:0x0] : rc = -22 [ 641.881511] LustreError: 29399:0:(lcommon_cl.c:187:cl_file_inode_init()) Failure to initialize cl object [0x240000404:0x23d:0x0]: -22 [ 649.716504] LustreError: 5568:0:(lov_object.c:1208:lov_layout_change()) lustre-clilov-ffff8c19f9a62000: cannot apply new layout on [0x240000404:0x23d:0x0] : rc = -22 [ 649.719224] LustreError: 5568:0:(lcommon_cl.c:187:cl_file_inode_init()) Failure to initialize cl object [0x240000404:0x23d:0x0]: -22 [ 715.441686] Lustre: lustre-OST0003-osc-ffff8c19f9a62000: Connection to lustre-OST0003 (at 10.2.4.95@tcp) was lost; in progress operations using this service will wait for recovery to complete [ 715.444729] LustreError: 167-0: lustre-OST0003-osc-ffff8c19f9a62000: This client was evicted by lustre-OST0003; in progress operations using this service will fail. [ 715.446755] LustreError: 18480:0:(ldlm_resource.c:1146:ldlm_resource_complain()) lustre-OST0003-osc-ffff8c19f9a62000: namespace resource [0x57:0x0:0x0].0x0 (ffff8c19dec760c0) refcount nonzero (1) after lock cleanup; forcing cleanup. [ 719.586481] Lustre: 4306:0:(client.c:2126:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1536596966/real 1536596966] req@ffff8c19e35b1800 x1611238269322960/t0(0) o36->lustre-MDT0000-mdc-ffff8c19fbdd9000@10.2.4.96@tcp:12/10 lens 608/4768 e 0 to 1 dl 1536597121 ref 2 fl Rpc:IX/0/ffffffff rc 0/-1 [ 719.589415] LustreError: 4306:0:(llite_lib.c:1559:ll_md_setattr()) md_setattr fails: rc = -4 [ 719.593107] Lustre: lustre-MDT0000-mdc-ffff8c19fbdd9000: Connection restored to 10.2.4.96@tcp (at 10.2.4.96@tcp) [ 840.233545] INFO: task dir_create.sh:28323 blocked for more than 120 seconds. [ 840.234422] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 840.235280] dir_create.sh D ffff8c19e2766eb0 0 28323 1 0x00000084 [ 840.236116] Call Trace: [ 840.236610] [<ffffffffc080985c>] ? jobid_interpret_string+0x12c/0x1c0 [obdclass] [ 840.237611] [<ffffffff9f514029>] schedule+0x29/0x70 [ 840.238134] [<ffffffff9f511999>] schedule_timeout+0x239/0x2c0 [ 840.238757] [<ffffffff9f1516dd>] ? radix_tree_gang_lookup+0xcd/0x150 [ 840.239748] [<ffffffffc0ba9ab7>] ? osc_page_gang_lookup+0x307/0x370 [osc] [ 840.240501] [<ffffffff9f5143dd>] wait_for_completion+0xfd/0x140 [ 840.241191] [<ffffffff9eecf670>] ? wake_up_state+0x20/0x20 [ 840.241775] [<ffffffffc0ba3688>] osc_io_setattr_end+0xe8/0x200 [osc] [ 840.242554] [<ffffffffc0bf4fa0>] ? lov_io_iter_fini_wrapper+0x50/0x50 [lov] [ 840.243333] [<ffffffffc0803f70>] cl_io_end+0x60/0x150 [obdclass] [ 840.243993] [<ffffffffc0bf507b>] lov_io_end_wrapper+0xdb/0xe0 [lov] [ 840.244789] [<ffffffffc0bf5507>] lov_io_call.isra.5+0x87/0x140 [lov] [ 840.245626] [<ffffffffc0bf55f6>] lov_io_end+0x36/0xb0 [lov] [ 840.246235] [<ffffffffc0803f70>] cl_io_end+0x60/0x150 [obdclass] [ 840.246867] [<ffffffffc080661f>] cl_io_loop+0x13f/0xc70 [obdclass] [ 840.247719] [<ffffffffc0c959c0>] cl_setattr_ost+0x250/0x3c0 [lustre] [ 840.248424] [<ffffffffc0c71b6f>] ll_setattr_raw+0xe7f/0xee0 [lustre] [ 840.249146] [<ffffffffc0c71c3c>] ll_setattr+0x6c/0xd0 [lustre] [ 840.249881] [<ffffffff9f039f14>] notify_change+0x2c4/0x420 [ 840.250539] [<ffffffff9f0f3a5e>] ? process_measurement+0x8e/0x250 [ 840.251186] [<ffffffff9f019185>] do_truncate+0x75/0xc0 [ 840.251733] [<ffffffff9f02b717>] do_last+0x627/0x12c0 [ 840.252361] [<ffffffff9f02c487>] path_openat+0xd7/0x640 [ 840.252995] [<ffffffff9f02e01d>] do_filp_open+0x4d/0xb0 [ 840.253633] [<ffffffff9f03b4c4>] ? __alloc_fd+0xc4/0x170 [ 840.254238] [<ffffffff9f01a327>] do_sys_open+0x137/0x240 [ 840.254805] [<ffffffff9f5206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 840.255637] [<ffffffff9f01a44e>] SyS_open+0x1e/0x20 [ 840.256167] [<ffffffff9f520795>] system_call_fastpath+0x1c/0x21 [ 840.256788] [<ffffffff9f5206e1>] ? system_call_after_swapgs+0xae/0x146
On the second client (vm2), we see it is stuck in setfattr, mv, ln, touch, lfs, cat, chmod, … ending finally with “md_setattr fails: rc = -4”. From the client console log, we see:
[ 960.234201] INFO: task setfattr:4379 blocked for more than 120 seconds. [ 960.235152] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 960.236088] setfattr D ffff88319d295ee0 0 4379 10275 0x00000080 [ 960.236836] Call Trace: [ 960.237118] [<ffffffff89e27040>] ? generic_permission+0x110/0x1d0 [ 960.237930] [<ffffffff8a314f49>] schedule_preempt_disabled+0x29/0x70 [ 960.238864] [<ffffffff8a312d07>] __mutex_lock_slowpath+0xc7/0x1d0 [ 960.239574] [<ffffffff8a3120ef>] mutex_lock+0x1f/0x2f [ 960.240178] [<ffffffff89e432af>] vfs_removexattr+0x5f/0x130 [ 960.241016] [<ffffffff89e433d5>] removexattr+0x55/0x80 [ 960.241569] [<ffffffff89e2ccdd>] ? putname+0x3d/0x60 [ 960.242271] [<ffffffff89e2def2>] ? user_path_at_empty+0x72/0xc0 [ 960.242971] [<ffffffff89e1e418>] ? __sb_start_write+0x58/0x110 [ 960.243748] [<ffffffff8a31b57c>] ? __do_page_fault+0x1bc/0x4f0 [ 960.244509] [<ffffffff89e3e03c>] ? mnt_want_write+0x2c/0x50 [ 960.245274] [<ffffffff89e441d4>] SyS_removexattr+0x94/0xd0 [ 960.245851] [<ffffffff8a320795>] system_call_fastpath+0x1c/0x21 [ 960.246428] [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146 [ 960.247093] INFO: task setfattr:4708 blocked for more than 120 seconds. [ 960.247889] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 960.248730] setfattr D ffff88319d285ee0 0 4708 10256 0x00000080 [ 960.249474] Call Trace: [ 960.249731] [<ffffffff89e27040>] ? generic_permission+0x110/0x1d0 [ 960.250601] [<ffffffff8a314f49>] schedule_preempt_disabled+0x29/0x70 [ 960.251286] [<ffffffff8a312d07>] __mutex_lock_slowpath+0xc7/0x1d0 [ 960.251901] [<ffffffff8a3120ef>] mutex_lock+0x1f/0x2f [ 960.252397] [<ffffffff89e432af>] vfs_removexattr+0x5f/0x130 [ 960.253026] [<ffffffff89e433d5>] removexattr+0x55/0x80 [ 960.253722] [<ffffffff89e2ccdd>] ? putname+0x3d/0x60 [ 960.254310] [<ffffffff89e2def2>] ? user_path_at_empty+0x72/0xc0 [ 960.255005] [<ffffffff89e1e418>] ? __sb_start_write+0x58/0x110 [ 960.255580] [<ffffffff8a31b57c>] ? __do_page_fault+0x1bc/0x4f0 [ 960.256302] [<ffffffff89e3e03c>] ? mnt_want_write+0x2c/0x50 [ 960.256943] [<ffffffff89e441d4>] SyS_removexattr+0x94/0xd0 [ 960.257558] [<ffffffff8a320795>] system_call_fastpath+0x1c/0x21 [ 960.258238] [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146 [ 960.258922] INFO: task mv:4894 blocked for more than 120 seconds. [ 960.259514] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 960.260406] mv D ffff8831abf83f40 0 4894 10227 0x00000080 [ 960.261156] Call Trace: [ 960.261461] [<ffffffff89ed1072>] ? security_inode_permission+0x22/0x30 [ 960.262321] [<ffffffff8a314f49>] schedule_preempt_disabled+0x29/0x70 [ 960.263071] [<ffffffff8a312d07>] __mutex_lock_slowpath+0xc7/0x1d0 [ 960.263708] [<ffffffff8a3120ef>] mutex_lock+0x1f/0x2f [ 960.264213] [<ffffffff8a30bcde>] lookup_slow+0x33/0xa7 [ 960.264920] [<ffffffff89e2aa08>] path_lookupat+0x838/0x8b0 [ 960.265479] [<ffffffff89df7a35>] ? kmem_cache_alloc+0x35/0x1f0 [ 960.266156] [<ffffffff89e2cd4f>] ? getname_flags+0x4f/0x1a0 [ 960.266720] [<ffffffff89e2aaab>] filename_lookup+0x2b/0xc0 [ 960.267252] [<ffffffff89e2dee7>] user_path_at_empty+0x67/0xc0 [ 960.267959] [<ffffffff89d25092>] ? from_kgid_munged+0x12/0x20 [ 960.268570] [<ffffffff89e2142f>] ? cp_new_stat+0x14f/0x180 [ 960.269147] [<ffffffff89e2df51>] user_path_at+0x11/0x20 [ 960.269667] [<ffffffff89e20f23>] vfs_fstatat+0x63/0xc0 [ 960.270189] [<ffffffff89e214f1>] SYSC_newlstat+0x31/0x60 [ 960.270976] [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146 [ 960.271851] [<ffffffff8a3206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 960.272595] [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146 [ 960.273350] [<ffffffff8a3206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 960.274040] [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146 [ 960.274856] [<ffffffff8a3206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 960.275794] [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146 [ 960.276704] [<ffffffff8a3206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 960.277453] [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146 [ 960.278135] [<ffffffff8a3206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 960.278810] [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146 [ 960.279628] [<ffffffff89e2177e>] SyS_newlstat+0xe/0x10 [ 960.280204] [<ffffffff8a320795>] system_call_fastpath+0x1c/0x21 [ 960.280914] [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146 [ 960.281654] INFO: task ln:5093 blocked for more than 120 seconds. [ 960.282238] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 960.283146] ln D ffff88319d24af70 0 5093 10229 0x00000080 [ 960.283887] Call Trace: [ 960.284172] [<ffffffff89e2a2f2>] ? path_lookupat+0x122/0x8b0 [ 960.284908] [<ffffffff89df5a59>] ? ___slab_alloc+0x209/0x4f0 [ 960.285587] [<ffffffff8a314f49>] schedule_preempt_disabled+0x29/0x70 [ 960.286261] [<ffffffff8a312d07>] __mutex_lock_slowpath+0xc7/0x1d0 [ 960.287045] [<ffffffff8a3120ef>] mutex_lock+0x1f/0x2f [ 960.287796] [<ffffffff89e2abc5>] filename_create+0x85/0x180 [ 960.288559] [<ffffffff89df7b9b>] ? kmem_cache_alloc+0x19b/0x1f0 [ 960.289344] [<ffffffff89e2cd4f>] ? getname_flags+0x4f/0x1a0 [ 960.290171] [<ffffffff89e2cdc4>] ? getname_flags+0xc4/0x1a0 [ 960.290901] [<ffffffff89e2d071>] user_path_create+0x41/0x60 [ 960.291687] [<ffffffff89e2e5e8>] SyS_symlinkat+0x58/0x110 [ 960.292335] [<ffffffff8a3206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 960.293033] [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146 [ 960.293880] [<ffffffff89e2e6b6>] SyS_symlink+0x16/0x20 [ 960.294530] [<ffffffff8a320795>] system_call_fastpath+0x1c/0x21 [ 960.295233] [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146 [ 960.295935] INFO: task touch:5326 blocked for more than 120 seconds. [ 960.296628] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 960.297687] touch D ffff8831aad8eeb0 0 5326 10280 0x00000080 [ 960.298689] Call Trace: [ 960.298949] [<ffffffff8a314f49>] schedule_preempt_disabled+0x29/0x70 [ 960.299603] [<ffffffff8a312d07>] __mutex_lock_slowpath+0xc7/0x1d0 [ 960.300357] [<ffffffff8a3120ef>] mutex_lock+0x1f/0x2f [ 960.300970] [<ffffffff89e2b37f>] do_last+0x28f/0x12c0 [ 960.301611] [<ffffffff89e2c487>] path_openat+0xd7/0x640 [ 960.302331] [<ffffffff89e2e01d>] do_filp_open+0x4d/0xb0 [ 960.303145] [<ffffffff89e3b4c4>] ? __alloc_fd+0xc4/0x170 [ 960.303935] [<ffffffff89e1a327>] do_sys_open+0x137/0x240 [ 960.304594] [<ffffffff8a3206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 960.305239] [<ffffffff89e1a44e>] SyS_open+0x1e/0x20 [ 960.305818] [<ffffffff8a320795>] system_call_fastpath+0x1c/0x21 [ 960.306572] [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146 [ 960.307450] INFO: task setfattr:5397 blocked for more than 120 seconds. [ 960.308286] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 960.309232] setfattr D ffff8831a6433f40 0 5397 10258 0x00000080 [ 960.310116] Call Trace: [ 960.310390] [<ffffffff89e27040>] ? generic_permission+0x110/0x1d0 [ 960.311113] [<ffffffff8a314f49>] schedule_preempt_disabled+0x29/0x70 [ 960.312040] [<ffffffff8a312d07>] __mutex_lock_slowpath+0xc7/0x1d0 [ 960.312870] [<ffffffff8a3120ef>] mutex_lock+0x1f/0x2f [ 960.313448] [<ffffffff89e43612>] vfs_setxattr+0x62/0xc0 [ 960.314029] [<ffffffff89e4379c>] setxattr+0x12c/0x1c0 [ 960.314591] [<ffffffff89df6ec2>] ? kmem_cache_free+0x1e2/0x200 [ 960.315167] [<ffffffff89e2ccdd>] ? putname+0x3d/0x60 [ 960.315661] [<ffffffff89e2def2>] ? user_path_at_empty+0x72/0xc0 [ 960.316354] [<ffffffff89e1e418>] ? __sb_start_write+0x58/0x110 [ 960.316947] [<ffffffff8a31b57c>] ? __do_page_fault+0x1bc/0x4f0 [ 960.317592] [<ffffffff89e43a97>] SyS_setxattr+0xb7/0x100 [ 960.318115] [<ffffffff8a320795>] system_call_fastpath+0x1c/0x21 [ 960.318753] [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146 [ 960.319582] INFO: task ln:5444 blocked for more than 120 seconds. [ 960.320167] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 960.320996] ln D ffff8831abf86eb0 0 5444 10267 0x00000080 [ 960.321733] Call Trace: [ 960.322040] [<ffffffff89e2a2f2>] ? path_lookupat+0x122/0x8b0 [ 960.322603] [<ffffffff89df5a59>] ? ___slab_alloc+0x209/0x4f0 [ 960.323223] [<ffffffff8a314f49>] schedule_preempt_disabled+0x29/0x70 [ 960.323886] [<ffffffff8a312d07>] __mutex_lock_slowpath+0xc7/0x1d0 [ 960.324486] [<ffffffff8a3120ef>] mutex_lock+0x1f/0x2f [ 960.325130] [<ffffffff89e2abc5>] filename_create+0x85/0x180 [ 960.325688] [<ffffffff89df7b9b>] ? kmem_cache_alloc+0x19b/0x1f0 [ 960.326313] [<ffffffff89e2cd4f>] ? getname_flags+0x4f/0x1a0 [ 960.326988] [<ffffffff89e2cdc4>] ? getname_flags+0xc4/0x1a0 [ 960.327537] [<ffffffff89e2d071>] user_path_create+0x41/0x60 [ 960.328165] [<ffffffff89e2e5e8>] SyS_symlinkat+0x58/0x110 [ 960.328705] [<ffffffff8a3206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 960.329367] [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146 [ 960.330116] [<ffffffff89e2e6b6>] SyS_symlink+0x16/0x20 [ 960.330631] [<ffffffff8a320795>] system_call_fastpath+0x1c/0x21 [ 960.331304] [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146 [ 960.332054] INFO: task lfs:10586 blocked for more than 120 seconds. [ 960.332725] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 960.333471] lfs D ffff8831abe38fd0 0 10586 10241 0x00000080 [ 960.334225] Call Trace: [ 960.334539] [<ffffffffc0d167e2>] ? ll_dcompare+0x72/0x2e0 [lustre] [ 960.335222] [<ffffffff8a314f49>] schedule_preempt_disabled+0x29/0x70 [ 960.335853] [<ffffffff8a312d07>] __mutex_lock_slowpath+0xc7/0x1d0 [ 960.336513] [<ffffffff8a3120ef>] mutex_lock+0x1f/0x2f [ 960.337026] [<ffffffffc0d32a9f>] ll_migrate+0x16f/0xcc0 [lustre] [ 960.337669] [<ffffffffc0d21a35>] ll_dir_ioctl+0x59c5/0x6560 [lustre] [ 960.338587] [<ffffffff89dc5a9c>] ? handle_pte_fault+0x2dc/0xc30 [ 960.339201] [<ffffffff89e3dde4>] ? mntput+0x24/0x40 [ 960.339693] [<ffffffff89dc7c3d>] ? handle_mm_fault+0x39d/0x9b0 [ 960.340288] [<ffffffff89e30350>] do_vfs_ioctl+0x350/0x560 [ 960.340855] [<ffffffff8a31b57c>] ? __do_page_fault+0x1bc/0x4f0 [ 960.341908] [<ffffffff89e30601>] SyS_ioctl+0xa1/0xc0 [ 960.342852] [<ffffffff8a3206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 960.343595] [<ffffffff8a320795>] system_call_fastpath+0x1c/0x21 [ 960.344181] [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146 [ 960.345175] INFO: task cat:10688 blocked for more than 120 seconds. [ 960.346073] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 960.346939] cat D ffff8831a95acf10 0 10688 10231 0x00000080 [ 960.347930] Call Trace: [ 960.348208] [<ffffffff89ed0a1e>] ? security_capable+0x1e/0x20 [ 960.348956] [<ffffffff8a314f49>] schedule_preempt_disabled+0x29/0x70 [ 960.350080] [<ffffffff8a312d07>] __mutex_lock_slowpath+0xc7/0x1d0 [ 960.350886] [<ffffffff8a3120ef>] mutex_lock+0x1f/0x2f [ 960.351527] [<ffffffff8a30bcde>] lookup_slow+0x33/0xa7 [ 960.352142] [<ffffffff89e2927f>] link_path_walk+0x80f/0x8b0 [ 960.352945] [<ffffffff89e2c465>] path_openat+0xb5/0x640 [ 960.353620] [<ffffffff89cf9172>] ? ktime_get+0x52/0xe0 [ 960.354387] [<ffffffffc0d285ef>] ? ll_getattr+0x22f/0x580 [lustre] [ 960.355148] [<ffffffff89e2e01d>] do_filp_open+0x4d/0xb0 [ 960.355707] [<ffffffff89e3b4c4>] ? __alloc_fd+0xc4/0x170 [ 960.356563] [<ffffffff89e1a327>] do_sys_open+0x137/0x240 [ 960.357315] [<ffffffff8a3206d5>] ? system_call_after_swapgs+0xa2/0x146 [ 960.358497] [<ffffffff89e1a44e>] SyS_open+0x1e/0x20 [ 960.359026] [<ffffffff8a320795>] system_call_fastpath+0x1c/0x21 [ 960.359646] [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146 [ 960.360691] INFO: task chmod:11121 blocked for more than 120 seconds. [ 960.361650] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 960.362816] chmod D ffff8831a32f8fd0 0 11121 10234 0x00000080 [ 960.363567] Call Trace: [ 960.363824] [<ffffffff89e2cd4f>] ? getname_flags+0x4f/0x1a0 [ 960.364484] [<ffffffff8a314f49>] schedule_preempt_disabled+0x29/0x70 [ 960.365104] [<ffffffff8a312d07>] __mutex_lock_slowpath+0xc7/0x1d0 [ 960.365787] [<ffffffff8a3120ef>] mutex_lock+0x1f/0x2f [ 960.366323] [<ffffffff89e189c8>] chmod_common+0x78/0x160 [ 960.366924] [<ffffffff8a31b57c>] ? __do_page_fault+0x1bc/0x4f0 [ 960.367607] [<ffffffff89e19d17>] SyS_fchmodat+0x57/0xc0 [ 960.368125] [<ffffffff8a320795>] system_call_fastpath+0x1c/0x21 [ 960.368777] [<ffffffff8a3206e1>] ? system_call_after_swapgs+0xae/0x146 [ 990.121506] LustreError: 4369:0:(llite_lib.c:1559:ll_md_setattr()) md_setattr fails: rc = -4
On MDS2, 4 (vm5), we see the following three call traces in the console log:
[ 259.348108] Lustre: DEBUG MARKER: == racer test 1: racer on clients: onyx-30vm1.onyx.whamcloud.com,onyx-30vm2 DURATION=300 ============= 16:25:55 (1536596755) [ 351.363812] LNet: Service thread pid 22468 was inactive for 62.20s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 351.367067] Pid: 22468, comm: mdt00_009 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018 [ 351.368822] Call Trace: [ 351.369374] [<ffffffffc0d3df61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc] [ 351.370937] [<ffffffffc0d3ed83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc] [ 351.372390] [<ffffffffc10e63d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt] [ 351.373833] [<ffffffffc10e6a80>] mdt_object_lock_internal+0x70/0x330 [mdt] [ 351.375390] [<ffffffffc10e6d60>] mdt_object_lock+0x20/0x30 [mdt] [ 351.376680] [<ffffffffc112186b>] mdt_brw_enqueue+0x44b/0x760 [mdt] [ 351.377955] [<ffffffffc10d44bf>] mdt_intent_brw+0x1f/0x30 [mdt] [ 351.379141] [<ffffffffc10ec9c8>] mdt_intent_policy+0x2f8/0xd10 [mdt] [ 351.380500] [<ffffffffc0d24e9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc] [ 351.381929] [<ffffffffc0d4d503>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc] [ 351.383341] [<ffffffffc0dd39b2>] tgt_enqueue+0x62/0x210 [ptlrpc] [ 351.384680] [<ffffffffc0ddc2fa>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [ 351.386091] [<ffffffffc0d7f48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 351.387592] [<ffffffffc0d82cc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc] [ 351.388691] [<ffffffffb9abb621>] kthread+0xd1/0xe0 [ 351.389324] [<ffffffffba1205f7>] ret_from_fork_nospec_end+0x0/0x39 [ 351.390034] [<ffffffffffffffff>] 0xffffffffffffffff [ 351.390691] LustreError: dumping log to /tmp/lustre-log.1536596849.22468 [ 351.992686] Pid: 22484, comm: mdt00_019 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018 [ 351.994620] Call Trace: [ 351.995150] [<ffffffffc0d3df61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc] [ 351.996680] [<ffffffffc0d3ed83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc] [ 351.998519] [<ffffffffc10e63d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt] [ 352.000268] [<ffffffffc10e6a80>] mdt_object_lock_internal+0x70/0x330 [mdt] [ 352.001606] [<ffffffffc10e8f1a>] mdt_object_find_lock+0x6a/0x1a0 [mdt] [ 352.002977] [<ffffffffc1104b4e>] mdt_reint_setxattr+0x1ce/0xf20 [mdt] [ 352.004283] [<ffffffffc1103293>] mdt_reint_rec+0x83/0x210 [mdt] [ 352.005561] [<ffffffffc10e21e2>] mdt_reint_internal+0x6b2/0xa80 [mdt] [ 352.006896] [<ffffffffc10ed447>] mdt_reint+0x67/0x140 [mdt] [ 352.007959] [<ffffffffc0ddc2fa>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [ 352.009256] [<ffffffffc0d7f48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 352.010650] [<ffffffffc0d82cc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc] [ 352.011847] [<ffffffffb9abb621>] kthread+0xd1/0xe0 [ 352.012767] [<ffffffffba1205f7>] ret_from_fork_nospec_end+0x0/0x39 [ 352.013960] [<ffffffffffffffff>] 0xffffffffffffffff [ 352.899805] Pid: 22464, comm: mdt00_005 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018 [ 352.901685] Call Trace: [ 352.902186] [<ffffffffc0d3df61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc] [ 352.903489] [<ffffffffc0d3ed83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc] [ 352.905019] [<ffffffffc10e63d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt] [ 352.906256] [<ffffffffc10e6a80>] mdt_object_lock_internal+0x70/0x330 [mdt] [ 352.907593] [<ffffffffc10e6df7>] mdt_object_lock_try+0x27/0xb0 [mdt] [ 352.908889] [<ffffffffc10e8547>] mdt_getattr_name_lock+0x1297/0x1c00 [mdt] [ 352.910163] [<ffffffffc10efae5>] mdt_intent_getattr+0x2b5/0x480 [mdt] [ 352.911631] [<ffffffffc10ec9c8>] mdt_intent_policy+0x2f8/0xd10 [mdt] [ 352.912921] [<ffffffffc0d24e9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc] [ 352.914176] [<ffffffffc0d4d503>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc] [ 352.915530] [<ffffffffc0dd39b2>] tgt_enqueue+0x62/0x210 [ptlrpc] [ 352.916805] [<ffffffffc0ddc2fa>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [ 352.918088] [<ffffffffc0d7f48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 352.919538] [<ffffffffc0d82cc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc] [ 352.920825] [<ffffffffb9abb621>] kthread+0xd1/0xe0 [ 352.921765] [<ffffffffba1205f7>] ret_from_fork_nospec_end+0x0/0x39 [ 352.922967] [<ffffffffffffffff>] 0xffffffffffffffff [ 352.924148] LustreError: dumping log to /tmp/lustre-log.1536596850.22464 [ 353.155770] Pid: 14392, comm: mdt00_002 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018 [ 353.157590] Call Trace: [ 353.158104] [<ffffffffc0d3df61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc] [ 353.159532] [<ffffffffc0d3ed83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc] [ 353.160957] [<ffffffffc10e63d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt] [ 353.162198] [<ffffffffc10e6a80>] mdt_object_lock_internal+0x70/0x330 [mdt] [ 353.163514] [<ffffffffc10e6df7>] mdt_object_lock_try+0x27/0xb0 [mdt] [ 353.164788] [<ffffffffc10e8547>] mdt_getattr_name_lock+0x1297/0x1c00 [mdt] [ 353.166069] [<ffffffffc10efae5>] mdt_intent_getattr+0x2b5/0x480 [mdt] [ 353.167514] [<ffffffffc10ec9c8>] mdt_intent_policy+0x2f8/0xd10 [mdt] [ 353.168828] [<ffffffffc0d24e9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc] [ 353.170114] [<ffffffffc0d4d503>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc] [ 353.171477] [<ffffffffc0dd39b2>] tgt_enqueue+0x62/0x210 [ptlrpc] [ 353.172752] [<ffffffffc0ddc2fa>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [ 353.174046] [<ffffffffc0d7f48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 353.175514] [<ffffffffc0d82cc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc] [ 353.176790] [<ffffffffb9abb621>] kthread+0xd1/0xe0 [ 353.177702] [<ffffffffba1205f7>] ret_from_fork_nospec_end+0x0/0x39 [ 353.179138] [<ffffffffffffffff>] 0xffffffffffffffff [ 354.435795] Pid: 22466, comm: mdt00_007 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Thu Aug 30 12:08:10 UTC 2018 [ 354.437623] Call Trace: [ 354.438121] [<ffffffffc0d3df61>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc] [ 354.439416] [<ffffffffc0d3ed83>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc] [ 354.440744] [<ffffffffc10e63d7>] mdt_object_local_lock+0x4e7/0xb20 [mdt] [ 354.442189] [<ffffffffc10e6a80>] mdt_object_lock_internal+0x70/0x330 [mdt] [ 354.443492] [<ffffffffc10e6df7>] mdt_object_lock_try+0x27/0xb0 [mdt] [ 354.444784] [<ffffffffc10e8547>] mdt_getattr_name_lock+0x1297/0x1c00 [mdt] [ 354.446055] [<ffffffffc10efae5>] mdt_intent_getattr+0x2b5/0x480 [mdt] [ 354.447309] [<ffffffffc10ec9c8>] mdt_intent_policy+0x2f8/0xd10 [mdt] [ 354.448497] [<ffffffffc0d24e9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc] [ 354.450007] [<ffffffffc0d4d503>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc] [ 354.451294] [<ffffffffc0dd39b2>] tgt_enqueue+0x62/0x210 [ptlrpc] [ 354.452545] [<ffffffffc0ddc2fa>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [ 354.453934] [<ffffffffc0d7f48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 354.455351] [<ffffffffc0d82cc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc] [ 354.456626] [<ffffffffb9abb621>] kthread+0xd1/0xe0 [ 354.457598] [<ffffffffba1205f7>] ret_from_fork_nospec_end+0x0/0x39 [ 354.458800] [<ffffffffffffffff>] 0xffffffffffffffff [ 354.459764] LustreError: dumping log to /tmp/lustre-log.1536596852.22466 [ 355.715812] LNet: Service thread pid 22486 was inactive for 62.01s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [ 355.718240] LustreError: dumping log to /tmp/lustre-log.1536596853.22486 [ 355.843781] LNet: Service thread pid 22494 was inactive for 62.11s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [ 390.164855] LustreError: 14386:0:(ldlm_lockd.c:251:expired_lock_main()) ### lock callback timer expired after 99s: evicting client at 10.2.4.94@tcp ns: mdt-lustre-MDT0001_UUID lock: ffff92c1dc98ed80/0x1d42b9ca91f94e31 lrc: 3/0,0 mode: PW/PW res: [0x240000402:0x5d:0x0].0x0 bits 0x40/0x0 rrc: 12 type: IBT flags: 0x60200400000020 nid: 10.2.4.94@tcp remote: 0xef3380579c7186d1 expref: 88 pid: 22490 timeout: 390 lvb_type: 0 [ 390.173809] LustreError: 22468:0:(ldlm_lockd.c:1342:ldlm_handle_enqueue0()) ### lock on destroyed export ffff92c1dd765400 ns: mdt-lustre-MDT0001_UUID lock: ffff92c1dc98f680/0x1d42b9ca91f94ed2 lrc: 3/0,0 mode: PW/PW res: [0x240000402:0x5d:0x0].0x0 bits 0x40/0x0 rrc: 9 type: IBT flags: 0x50200000000000 nid: 10.2.4.94@tcp remote: 0xef3380579c718717 expref: 18 pid: 22468 timeout: 0 lvb_type: 0 [ 390.179772] LNet: Service thread pid 22468 completed after 101.02s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [ 390.180082] LNet: Service thread pid 22484 completed after 100.40s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [ 390.180474] LNet: Service thread pid 22466 completed after 97.81s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [ 390.180626] LustreError: 14392:0:(ldlm_lockd.c:1342:ldlm_handle_enqueue0()) ### lock on destroyed export ffff92c1dd765400 ns: mdt-lustre-MDT0001_UUID lock: ffff92c1dd03f8c0/0x1d42b9ca91f96aa8 lrc: 3/0,0 mode: PR/PR res: [0x240000402:0x5d:0x0].0x0 bits 0x1b/0x0 rrc: 7 type: IBT flags: 0x50200000000000 nid: 10.2.4.94@tcp remote: 0xef3380579c7192d9 expref: 4 pid: 14392 timeout: 0 lvb_type: 0 [ 390.180862] LNet: Service thread pid 14392 completed after 99.10s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [ 390.181233] LNet: Service thread pid 22464 completed after 99.30s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [ 390.181528] LNet: Service thread pid 22486 completed after 96.47s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [ 390.181813] LNet: Service thread pid 22494 completed after 96.45s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [ 468.993496] LNetError: 14367:0:(lib-msg.c:794:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0) [ 468.994385] LNetError: 14366:0:(lib-msg.c:794:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0) [ 555.134662] LustreError: 14422:0:(llog_cat.c:801:llog_cat_cancel_records()) lustre-MDT0000-osp-MDT0001: fail to cancel 1 of 1 llog-records: rc = -116 [ 555.136979] LustreError: 14422:0:(llog_cat.c:801:llog_cat_cancel_records()) lustre-MDT0000-osp-MDT0001: fail to cancel 1 of 1 llog-records: rc = -116 [ 555.139399] LustreError: 14422:0:(llog_cat.c:801:llog_cat_cancel_records()) lustre-MDT0000-osp-MDT0001: fail to cancel 1 of 1 llog-records: rc = -116 [ 555.141809] LustreError: 14422:0:(llog_cat.c:801:llog_cat_cancel_records()) lustre-MDT0000-osp-MDT0001: fail to cancel 1 of 1 llog-records: rc = -116 [ 555.144689] LustreError: 14422:0:(llog_cat.c:801:llog_cat_cancel_records()) lustre-MDT0000-osp-MDT0001: fail to cancel 1 of 1 llog-records: rc = -116
Description continued in the first comment due to JIRA description line limits.