[LU-11751] racer deadlocks due to DOM glimpse request Created: 10/Dec/18  Updated: 16/Jan/22  Resolved: 16/Jan/22

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.12.2
Fix Version/s: None

Type: Bug Priority: Major
Reporter: James Nunez (Inactive) Assignee: Mikhail Pershin
Resolution: Cannot Reproduce Votes: 0
Labels: None

Attachments: Text File fir-md1-s1-kernel-20190428.log     Text File fir-md1-s2-kernel-20190428.log    
Issue Links:
Related
is related to LU-11359 racer test 1 times out with client hu... Resolved
is related to LU-11285 don't stop on the first blocked lock ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

racer test_1 hangs in locking. This issue looks a lot like LU-10852, but the stack trace is a bit different and this failure is not in a DNE environment. So, opening a new ticket.

Looking at the logs at https://testing.whamcloud.com/test_sets/1c19f7d0-f9d7-11e8-b216-52540065bddc, we can see that man processes on the clients are blocked. From client 2 (vm9), in the client console we see

[46895.114548] LustreError: 11-0: lustre-MDT0000-mdc-ffff9d96fa9a7800: operation ldlm_enqueue to node 10.2.8.143@tcp failed: rc = -107
[46895.115945] Lustre: lustre-MDT0000-mdc-ffff9d96fa9a7800: Connection to lustre-MDT0000 (at 10.2.8.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
[46895.119762] LustreError: 167-0: lustre-MDT0000-mdc-ffff9d96fa9a7800: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
[46895.133418] LustreError: 2635:0:(file.c:4393:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000403:0x942:0x0] error: rc = -5
[46895.140482] LustreError: 2167:0:(vvp_io.c:1495:vvp_io_init()) lustre: refresh file layout [0x200000405:0x70a:0x0] error -108.
[46895.146261] LustreError: 3286:0:(lmv_obd.c:1250:lmv_fid_alloc()) Can't alloc new fid, rc -19
[46895.147487] LustreError: 18765:0:(mdc_locks.c:1257:mdc_intent_getattr_async_interpret()) ldlm_cli_enqueue_fini: -108
[46895.168736] LustreError: 18884:0:(mdc_request.c:1427:mdc_read_page()) lustre-MDT0000-mdc-ffff9d96fa9a7800: [0x200000402:0x23c:0x0] lock enqueue fails: rc = -108
[46895.208950] LustreError: 19579:0:(file.c:216:ll_close_inode_openhandle()) lustre-clilmv-ffff9d96fa9a7800: inode [0x200000402:0x23c:0x0] mdc close failed: rc = -108
[46895.222023] LustreError: 12501:0:(ldlm_resource.c:1146:ldlm_resource_complain()) lustre-MDT0000-mdc-ffff9d96fa9a7800: namespace resource [0x200000402:0x23d:0x0].0x0 (ffff9d96f98e9e40) refcount nonzero (1) after lock cleanup; forcing cleanup.
[46895.633613] LustreError: 13031:0:(file.c:4393:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000007:0x1:0x0] error: rc = -108
[46895.634939] LustreError: 13031:0:(file.c:4393:ll_inode_revalidate_fini()) Skipped 609 previous similar messages
[46895.635482] Lustre: lustre-MDT0000-mdc-ffff9d96fa9a7800: Connection restored to 10.2.8.143@tcp (at 10.2.8.143@tcp)
[46920.333393] INFO: task setfattr:32731 blocked for more than 120 seconds.
[46920.334207] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[46920.335014] setfattr        D ffff9d9695c36eb0     0 32731    638 0x00000080
[46920.335773] Call Trace:
[46920.336068]  [<ffffffffae22ae2e>] ? generic_permission+0x15e/0x1d0
[46920.336704]  [<ffffffffae719e59>] schedule_preempt_disabled+0x29/0x70
[46920.337337]  [<ffffffffae717c17>] __mutex_lock_slowpath+0xc7/0x1d0
[46920.337958]  [<ffffffffae716fff>] mutex_lock+0x1f/0x2f
[46920.338471]  [<ffffffffae24700f>] vfs_removexattr+0x5f/0x130
[46920.339043]  [<ffffffffae247135>] removexattr+0x55/0x80
[46920.339578]  [<ffffffffae230a7d>] ? putname+0x3d/0x60
[46920.340101]  [<ffffffffae231c92>] ? user_path_at_empty+0x72/0xc0
[46920.340711]  [<ffffffffae2221c8>] ? __sb_start_write+0x58/0x110
[46920.341297]  [<ffffffffae72056c>] ? __do_page_fault+0x1bc/0x4f0
[46920.341900]  [<ffffffffae241d9c>] ? mnt_want_write+0x2c/0x50
[46920.342455]  [<ffffffffae247f34>] SyS_removexattr+0x94/0xd0
[46920.343017]  [<ffffffffae72579b>] system_call_fastpath+0x22/0x27
[46920.343616]  [<ffffffffae7256e1>] ? system_call_after_swapgs+0xae/0x146
[46920.344266] INFO: task cp:3161 blocked for more than 120 seconds.
[46920.344878] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[46920.345642] cp              D ffff9d96e3c45ee0     0  3161    645 0x00000080
[46920.346377] Call Trace:
[46920.346659]  [<ffffffffae2d5152>] ? security_inode_permission+0x22/0x30
[46920.347301]  [<ffffffffae719e59>] schedule_preempt_disabled+0x29/0x70
[46920.347946]  [<ffffffffae717c17>] __mutex_lock_slowpath+0xc7/0x1d0
[46920.348543]  [<ffffffffae716fff>] mutex_lock+0x1f/0x2f
[46920.349083]  [<ffffffffae710bee>] lookup_slow+0x33/0xa7
[46920.349608]  [<ffffffffae22e7a8>] path_lookupat+0x838/0x8b0
[46920.350178]  [<ffffffffae1970cb>] ? unlock_page+0x2b/0x30
[46920.350732]  [<ffffffffae1faf39>] ? kmem_cache_alloc+0x179/0x1f0
[46920.351320]  [<ffffffffae230aef>] ? getname_flags+0x4f/0x1a0
[46920.351894]  [<ffffffffae22e84b>] filename_lookup+0x2b/0xc0
[46920.352442]  [<ffffffffae231c87>] user_path_at_empty+0x67/0xc0
[46920.353036]  [<ffffffffae127b72>] ? from_kgid_munged+0x12/0x20
[46920.353630]  [<ffffffffae2251df>] ? cp_new_stat+0x14f/0x180
[46920.354180]  [<ffffffffae231cf1>] user_path_at+0x11/0x20
[46920.354718]  [<ffffffffae224cd3>] vfs_fstatat+0x63/0xc0
[46920.355234]  [<ffffffffae22523e>] SYSC_newstat+0x2e/0x60
[46920.355776]  [<ffffffffae7256e1>] ? system_call_after_swapgs+0xae/0x146
[46920.356427]  [<ffffffffae7256d5>] ? system_call_after_swapgs+0xa2/0x146
[46920.357093]  [<ffffffffae7256e1>] ? system_call_after_swapgs+0xae/0x146
[46920.357748]  [<ffffffffae7256d5>] ? system_call_after_swapgs+0xa2/0x146
[46920.358394]  [<ffffffffae7256e1>] ? system_call_after_swapgs+0xae/0x146
[46920.359050]  [<ffffffffae7256d5>] ? system_call_after_swapgs+0xa2/0x146
[46920.359715]  [<ffffffffae7256e1>] ? system_call_after_swapgs+0xae/0x146
[46920.360355]  [<ffffffffae7256d5>] ? system_call_after_swapgs+0xa2/0x146
[46920.361018]  [<ffffffffae7256e1>] ? system_call_after_swapgs+0xae/0x146
[46920.361676]  [<ffffffffae7256d5>] ? system_call_after_swapgs+0xa2/0x146
[46920.362314]  [<ffffffffae7256e1>] ? system_call_after_swapgs+0xae/0x146
[46920.362967]  [<ffffffffae22551e>] SyS_newstat+0xe/0x10
[46920.363469]  [<ffffffffae72579b>] system_call_fastpath+0x22/0x27
[46920.364082]  [<ffffffffae7256e1>] ? system_call_after_swapgs+0xae/0x146
[46920.364739] INFO: task ln:3169 blocked for more than 120 seconds.
[46920.365340] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[46920.366106] ln              D ffff9d96d8966eb0     0  3169    616 0x00000080
[46920.366863] Call Trace:
[46920.367122]  [<ffffffffae22e092>] ? path_lookupat+0x122/0x8b0
[46920.367698]  [<ffffffffae1f8e19>] ? ___slab_alloc+0x209/0x4f0
[46920.368264]  [<ffffffffae719e59>] schedule_preempt_disabled+0x29/0x70

On client 1 (vm10), we see dir_create, ls, mkdir and other calls blocked

[46775.380960] LustreError: 167-0: lustre-MDT0000-mdc-ffff9b35bbd7d000: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
[46775.392616] LustreError: 14241:0:(file.c:4393:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000401:0x122:0x0] error: rc = -5
[46775.394009] LustreError: 14241:0:(file.c:4393:ll_inode_revalidate_fini()) Skipped 2 previous similar messages
[46775.395224] LustreError: 30588:0:(mdc_locks.c:1257:mdc_intent_getattr_async_interpret()) ldlm_cli_enqueue_fini: -108
[46775.402186] LustreError: 18530:0:(file.c:216:ll_close_inode_openhandle()) lustre-clilmv-ffff9b35bbd7d000: inode [0x200000404:0x148:0x0] mdc close failed: rc = -108
[46775.403682] LustreError: 18530:0:(file.c:216:ll_close_inode_openhandle()) Skipped 1 previous similar message
[46775.415258] LustreError: 17871:0:(llite_lib.c:1547:ll_md_setattr()) md_setattr fails: rc = -108
[46775.421674] LustreError: 18956:0:(lmv_obd.c:1250:lmv_fid_alloc()) Can't alloc new fid, rc -19
[46775.431989] Lustre: lustre-MDT0000-mdc-ffff9b35bbd7d000: Connection restored to 10.2.8.143@tcp (at 10.2.8.143@tcp)
[46920.246223] INFO: task dir_create.sh:4489 blocked for more than 120 seconds.
[46920.247112] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[46920.247952] dir_create.sh   D ffff9b35baf1af70     0  4489   4447 0x00000080
[46920.248755] Call Trace:
[46920.249111]  [<ffffffffc0ea57e2>] ? ll_dcompare+0x72/0x2e0 [lustre]
[46920.249780]  [<ffffffff94319e59>] schedule_preempt_disabled+0x29/0x70
[46920.250439]  [<ffffffff94317c17>] __mutex_lock_slowpath+0xc7/0x1d0
[46920.251074]  [<ffffffff94316fff>] mutex_lock+0x1f/0x2f
[46920.251612]  [<ffffffff94310bee>] lookup_slow+0x33/0xa7
[46920.252169]  [<ffffffff93e2d01f>] link_path_walk+0x80f/0x8b0
[46920.252746]  [<ffffffff93e30205>] path_openat+0xb5/0x640
[46920.253285]  [<ffffffff93e31dbd>] do_filp_open+0x4d/0xb0
[46920.253848]  [<ffffffff93e3f1e4>] ? __alloc_fd+0xc4/0x170
[46920.254401]  [<ffffffff93e1e0d7>] do_sys_open+0x137/0x240
[46920.254959]  [<ffffffff943256d5>] ? system_call_after_swapgs+0xa2/0x146
[46920.255631]  [<ffffffff93e1e1fe>] SyS_open+0x1e/0x20
[46920.256150]  [<ffffffff9432579b>] system_call_fastpath+0x22/0x27
[46920.256760]  [<ffffffff943256e1>] ? system_call_after_swapgs+0xae/0x146
[46920.257485] INFO: task ls:6624 blocked for more than 120 seconds.
[46920.258108] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[46920.258901] ls              D ffff9b359eb6eeb0     0  6624   4495 0x00000080
[46920.259661] Call Trace:
[46920.259932]  [<ffffffff93cc6081>] ? in_group_p+0x31/0x40
[46920.260487]  [<ffffffff94319e59>] schedule_preempt_disabled+0x29/0x70
[46920.261139]  [<ffffffff94317c17>] __mutex_lock_slowpath+0xc7/0x1d0
[46920.261777]  [<ffffffff94316fff>] mutex_lock+0x1f/0x2f
[46920.262300]  [<ffffffff93e2f11f>] do_last+0x28f/0x12c0
[46920.262826]  [<ffffffff93e30227>] path_openat+0xd7/0x640
[46920.263362]  [<ffffffff93e30aef>] ? getname_flags+0x4f/0x1a0
[46920.263957]  [<ffffffffc0eb2159>] ? ll_file_data_put+0x89/0x180 [lustre]
[46920.264625]  [<ffffffff93e31dbd>] do_filp_open+0x4d/0xb0
[46920.265173]  [<ffffffff93e3f1e4>] ? __alloc_fd+0xc4/0x170
[46920.265739]  [<ffffffff93e1e0d7>] do_sys_open+0x137/0x240
[46920.266281]  [<ffffffff943256d5>] ? system_call_after_swapgs+0xa2/0x146
[46920.266939]  [<ffffffff93e1e214>] SyS_openat+0x14/0x20
[46920.267473]  [<ffffffff9432579b>] system_call_fastpath+0x22/0x27
[46920.268081]  [<ffffffff943256e1>] ? system_call_after_swapgs+0xae/0x146
[46920.268778] INFO: task ln:8693 blocked for more than 120 seconds.
[46920.269388] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[46920.270180] ln              D ffff9b3595468fd0     0  8693   4475 0x00000080
[46920.270940] Call Trace:
[46920.271201]  [<ffffffff93e2e092>] ? path_lookupat+0x122/0x8b0
[46920.271790]  [<ffffffff94319e59>] schedule_preempt_disabled+0x29/0x70

ON the MDS (vm12), we see many stack traces like the following

[46559.329558] Lustre: DEBUG MARKER: == racer test 1: racer on clients: onyx-41vm10,onyx-41vm9.onyx.whamcloud.com DURATION=900 ============ 09:34:14 (1544088854)
[46564.017497] LustreError: 5798:0:(mdt_lvb.c:430:mdt_lvbo_fill()) lustre-MDT0000: small buffer size 544 for EA 568 (max_mdsize 568): rc = -34
[46643.846755] LNet: Service thread pid 5790 was inactive for 62.06s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[46643.848574] Pid: 5790, comm: mdt00_011 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Wed Dec 5 03:41:24 UTC 2018
[46643.849563] Call Trace:
[46643.849882]  [<ffffffffc0ef8031>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
[46643.850848]  [<ffffffffc0ef8dcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
[46643.851624]  [<ffffffffc11e951b>] mdt_object_local_lock+0x50b/0xb20 [mdt]
[46643.852657]  [<ffffffffc11e9ba0>] mdt_object_lock_internal+0x70/0x360 [mdt]
[46643.853470]  [<ffffffffc11e9f47>] mdt_object_lock_try+0x27/0xb0 [mdt]
[46643.854167]  [<ffffffffc11eb687>] mdt_getattr_name_lock+0x1287/0x1c30 [mdt]
[46643.854970]  [<ffffffffc11f2ba5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
[46643.855694]  [<ffffffffc11efa08>] mdt_intent_policy+0x2e8/0xd00 [mdt]
[46643.856459]  [<ffffffffc0edeec6>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
[46643.857209]  [<ffffffffc0f078a7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
[46643.858036]  [<ffffffffc0f8e242>] tgt_enqueue+0x62/0x210 [ptlrpc]
[46643.858866]  [<ffffffffc0f9529a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[46643.859648]  [<ffffffffc0f3991b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[46643.860560]  [<ffffffffc0f3d24c>] ptlrpc_main+0xafc/0x1fb0 [ptlrpc]
[46643.861315]  [<ffffffff95cbdf21>] kthread+0xd1/0xe0
[46643.861876]  [<ffffffff963255f7>] ret_from_fork_nospec_end+0x0/0x39
[46643.862590]  [<ffffffffffffffff>] 0xffffffffffffffff
[46643.863234] LustreError: dumping log to /tmp/lustre-log.1544088939.5790
…
[47031.433294] Pid: 5782, comm: mdt00_004 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Wed Dec 5 03:41:24 UTC 2018
[47031.434325] Call Trace:
[47031.434609]  [<ffffffffc0ef8031>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
[47031.435489]  [<ffffffffc0ef8dcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
[47031.436322]  [<ffffffffc11e951b>] mdt_object_local_lock+0x50b/0xb20 [mdt]
[47031.437360]  [<ffffffffc11e9ba0>] mdt_object_lock_internal+0x70/0x360 [mdt]
[47031.438194]  [<ffffffffc11ec09a>] mdt_object_find_lock+0x6a/0x1a0 [mdt]
[47031.438983]  [<ffffffffc120aa5e>] mdt_reint_setxattr+0x1ce/0xfd0 [mdt]
[47031.439782]  [<ffffffffc12085d3>] mdt_reint_rec+0x83/0x210 [mdt]
[47031.440507]  [<ffffffffc11e51a3>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
[47031.441242]  [<ffffffffc11f0487>] mdt_reint+0x67/0x140 [mdt]
[47031.441942]  [<ffffffffc0f9529a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[47031.442833]  [<ffffffffc0f3991b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[47031.443725]  [<ffffffffc0f3d24c>] ptlrpc_main+0xafc/0x1fb0 [ptlrpc]
[47031.444586]  [<ffffffff95cbdf21>] kthread+0xd1/0xe0
[47031.445234]  [<ffffffff963255f7>] ret_from_fork_nospec_end+0x0/0x39
[47031.445925]  [<ffffffffffffffff>] 0xffffffffffffffff
[47031.446664] LustreError: dumping log to /tmp/lustre-log.1544089327.5782

Additional logs for this issue are at
https://testing.whamcloud.com/sub_tests/04a3b75e-f729-11e8-bfe1-52540065bddc
https://testing.whamcloud.com/sub_tests/24b8ffb6-f9ce-11e8-bb6b-52540065bddc



 Comments   
Comment by Alex Zhuravlev [ 12/Dec/18 ]

I was unable to reproduce this with RACER_ENABLE_DOM=false

 

Comment by Peter Jones [ 12/Dec/18 ]

Mike

Is this something that you already know about?

Peter

Comment by Mikhail Pershin [ 12/Dec/18 ]

it looks like LU-11359 duplicate

Comment by Mikhail Pershin [ 27/Apr/19 ]

This looks like LU-11359 but has different cause. The problem exists with DOM files and is related to glimpse. When getattr RPC is returned to MDC without size then llite is asking for size attribute immediately by glimpse lock request. But at the moment original GETATTR lock is still in busy state, usually with 0x13 inodebits. Glimpse is PR lock with DOM bits - 0x40. The problem is that any other lock between these two may stuck on the first one and block the glimpse in waiting queue. But without finished glimpse the first lock will stay forever. This situation appeared along with DOM because glimpses on OSTs are about different namespace and resource.

I think that patch from LU-11285 will help with the problem because locks with non-crossing ibits will be not blocked in waiting queue, but ideally glimpse should be done atomically. I have also 'glimpse-ahead' patch for that which was considered as an optimization but it will take time to apply it for current master and pass tests. This will be done under this ticket.

Comment by Patrick Farrell (Inactive) [ 27/Apr/19 ]

Mike,

Is it not possible to make the PR DOM lock part of the getattr request on MDC?  Or would that cause problems (or at least be wasteful) if there was an existing DOM lock?

Comment by Mikhail Pershin [ 28/Apr/19 ]

Patrick, it is already so, if we CAN return DOM lock, but if not then separate glimpse is issued to get size. What I am thinking about is to do glimpse from server in advance, if we cannot return DOM bit and there will be glimpse request for sure.

Comment by Mikhail Pershin [ 28/Apr/19 ]

But first I have to check how glimpse works right now, my understanding it should not be blocked in waiting queue.

Comment by Stephane Thiell [ 28/Apr/19 ]

I believe we are getting this issue right now, example on FID 0x2c001ad81:0xe26:0x0 but there are others (0x13 + 0x40):

Apr 28 10:16:47 fir-md1-s1 kernel: LustreError: 35049:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1556471717, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0002_UUID lock: ffff8b5012ae7980/0x378007fdc6c24f96 lrc: 3/1,0 mode: --/PR res: [0x2c001ad81:0xe26:0x0].0x0 bits 0x13/0x8 rrc: 143 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 35049 timeout: 0 lvb_type: 0

Apr 28 10:17:47 fir-md1-s1 kernel: LustreError: 20444:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 149s: evicting client at 10.8.7.17@o2ib6 ns: mdt-fir-MDT0002_UUID lock: ffff8b51f73e72c0/0x378007fdc5f51cf6 lrc: 3/0,0 mode: PW/PW res: [0x2c001ad81:0xe26:0x0].0x0 bits 0x40/0x0 rrc: 143 type: IBT flags: 0x60200400000020 nid: 10.8.7.17@o2ib6 remote: 0x1b8755abd081ffc4 expref: 10 pid: 34487 timeout: 471701 lvb_type: 0
[root@fir-rbh01 ~]# lfs fid2path /fir 0x2c001ad81:0xe26:0x0
/fir/users/jmz/geo_activity/activitySplit/logs/testRunOutput
[root@fir-rbh01 ~]# stat /fir/users/jmz/geo_activity/activitySplit/logs/testRunOutput
  File: ‘/fir/users/jmz/geo_activity/activitySplit/logs/testRunOutput’
  Size: 956       	Blocks: 16         IO Block: 4194304 regular file
Device: e64e03a8h/3863872424d	Inode: 198160228309536294  Links: 1
Access: (0644/-rw-r--r--)  Uid: (331789/     jmz)   Gid: (11886/    euan)
Access: 2019-04-28 01:46:50.000000000 -0700
Modify: 2019-04-28 10:28:15.000000000 -0700
Change: 2019-04-28 10:28:15.000000000 -0700
 Birth: -

followed after some time by the typical ldlm_completion_ast traces.

Even restarting the MDTs doesn't seem to fix the issue... what can we do apart from remounting with abort_recov?

Comment by Stephane Thiell [ 28/Apr/19 ]

Attached fir-md1-s1-kernel-20190428.log and fir-md1-s2-kernel-20190428.log which are kernel logs from both MDS.

Comment by Stephane Thiell [ 28/Apr/19 ]

I commented in LU-11285, I think we have a great dk trace with +dlmtrace. Hope that helps to fix this definitively!! For us, the issue has been resolved (for now) after holding this user's jobs and restarting MDT3 in abort_recov (didn't work otherwise...).

Comment by Mikhail Pershin [ 29/Apr/19 ]

Stephane, thanks for logs, I didn't find the reason yet, working on that

Comment by Stephane Thiell [ 01/May/19 ]

Mike, I just checked and since this last event, so about two days now, I haven't seen any new ldlm_completion_ast call trace at all on both MDS, this is surprising (in a good way!). Before that, we had many of them per day. But after the problem reported above, I added the patch for LU-11285 so... perhaps it did help. We're currently running 2.12.0 + the following patches:

5081aa7 LU-10777 dom: disable read-on-open with resend
455d39b LU-11285 ldlm: reprocess whole waiting queue for IBITS
2eec4f8 LU-12018 quota: do not start a thread under memory pressure
1819063 LU-11359 mdt: fix mdt_dom_discard_data() timeouts
3ed10f4 LU-11964 mdc: prevent glimpse lock count grow
565011c LU-12037 mdt: add option for cross-MDT rename
b6be1d9 LU-12065 lnd: increase CQ entries
6b2c97b LU-12037 mdt: call mdt_dom_discard_data() after rename unlock

Also we're now 100% sure that all clients are running with LU-11359, so async_discard is enabled everywhere, this is important too I guess.

I'll update if things change (which usually occurs shortly after posting a comment here... )

Comment by Stephane Thiell [ 06/May/19 ]

Quick update... our Fir system (running the patches above) has been stable last week and last weekend. Note that we're "only" running patchset #11 of LU-11359, and I see that since then it has evolved quite a bit...

Comment by Peter Jones [ 11/May/19 ]

This seems encouraging enough to consider this ticket a duplicate (though it is hard to pin onto exactly which patch it is a duplicate of  )

Comment by James Nunez (Inactive) [ 10/Feb/20 ]

Mike - I think we are still seeing this issue. Would you please take a look at the racer hang at https://testing.whamcloud.com/test_sets/611e867e-4b76-11ea-a1c8-52540065bddc ?

If this is not the same hang, I'll open a new ticket.

I know this ticket was closed as a duplicate, but I don't know what ticket it is duplicating and, thus, can't see if that one is still open.

Comment by Mikhail Pershin [ 10/Feb/20 ]

James. I don't see evidences for that so far, neither server nor client has traces of DOM locks, but server stack traces contains osp_md_object_lock call and mdt_rename_lock on servers, so I tend to think this is DNE issue, like LU-12037. Can you check what is mdt.*.enable_remote_rename parameter on servers? Probably that is remote rename DNE issue.

Generated at Sat Feb 10 02:46:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.