[LU-14300] sanity-sec test 18 hangs in txg_quiesce on MDS Created: 06/Jan/21 Updated: 15/Oct/21 Resolved: 11/Oct/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0 |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | Sergey Cheremencev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
ZFS and single MDS |
||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
sanity-sec test_18 hangs. It looks like this issue started on 15 MAY 2020 with https://testing.whamcloud.com/test_sets/8514ff93-a5ac-45f0-9584-dff4d4e42eae for Lustre 2.13.53.163. Looking at the suite_log for https://testing.whamcloud.com/test_sets/6875529f-ecc2-432e-9f20-e8810a08448c, the last thing we see is CMD: trevis-201vm1.trevis.whamcloud.com runas -u60003 -g60003 -G60003 lfs quota -q /mnt/lustre running as uid/gid/euid/egid 60003/60003/60003/60003, groups: 60003 [lfs] [quota] [-q] [/mnt/lustre] CMD: trevis-201vm2 chmod 000 /mnt/lustre/d18.sanity-sec CMD: trevis-201vm2 chmod 001 /mnt/lustre/d18.sanity-sec CMD: trevis-201vm2 chmod 002 /mnt/lustre/d18.sanity-sec CMD: trevis-201vm2 chmod 003 /mnt/lustre/d18.sanity-sec CMD: trevis-201vm2 chmod 004 /mnt/lustre/d18.sanity-sec CMD: trevis-201vm2 chmod 005 /mnt/lustre/d18.sanity-sec CMD: trevis-201vm2 chmod 006 /mnt/lustre/d18.sanity-sec Looking at the MDS (vm4) console log, we see the hang [32726.826140] Lustre: mdt00_003: service thread pid 819155 was inactive for 41.957 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [32726.826328] Pid: 816983, comm: mdt_rdpg00_000 4.18.0-240.1.1.el8_lustre.x86_64 #1 SMP Wed Dec 9 21:09:51 UTC 2020 [32726.829004] Lustre: Skipped 1 previous similar message [32726.830561] Call Trace TBD: [32726.831789] Pid: 819155, comm: mdt00_003 4.18.0-240.1.1.el8_lustre.x86_64 #1 SMP Wed Dec 9 21:09:51 UTC 2020 [32726.833277] Call Trace TBD: [32808.745683] INFO: task txg_quiesce:816666 blocked for more than 120 seconds. [32808.746826] Tainted: P OE --------- - - 4.18.0-240.1.1.el8_lustre.x86_64 #1 [32808.748143] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [32808.749328] txg_quiesce D 0 816666 2 0x80004080 [32808.750184] Call Trace: [32808.750601] __schedule+0x2a6/0x700 [32808.751160] schedule+0x38/0xa0 [32808.751671] cv_wait_common+0xfb/0x130 [spl] [32808.752325] ? finish_wait+0x80/0x80 [32808.752958] txg_quiesce_thread+0x2ac/0x3a0 [zfs] [32808.753741] ? txg_sync_thread+0x480/0x480 [zfs] [32808.754470] ? __thread_exit+0x20/0x20 [spl] [32808.755136] thread_generic_wrapper+0x6f/0x80 [spl] [32808.755892] kthread+0x112/0x130 [32808.756397] ? kthread_flush_work_fn+0x10/0x10 [32808.757084] ret_from_fork+0x35/0x40 [32808.757657] INFO: task mdt_rdpg00_000:816983 blocked for more than 120 seconds. [32808.758764] Tainted: P OE --------- - - 4.18.0-240.1.1.el8_lustre.x86_64 #1 [32808.760077] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [32808.761260] mdt_rdpg00_000 D 0 816983 2 0x80004080 [32808.762114] Call Trace: [32808.762502] __schedule+0x2a6/0x700 [32808.763046] schedule+0x38/0xa0 [32808.763536] rwsem_down_write_slowpath+0x32d/0x4e0 [32808.764303] ? dmu_tx_callback_register+0x57/0x80 [zfs] [32808.765124] qti_lqes_write_lock+0x87/0x190 [lquota] [32808.765896] qmt_dqacq0+0x4eb/0x2380 [lquota] [32808.766566] ? lqe_locate_find+0x546/0x7c0 [lquota] [32808.767316] ? qmt_dqacq+0x668/0x790 [lquota] [32808.767996] qmt_dqacq+0x668/0x790 [lquota] [32808.768669] mdt_quota_dqacq+0x59/0x120 [mdt] [32808.769391] tgt_request_handle+0xc78/0x1910 [ptlrpc] [32808.770205] ptlrpc_server_handle_request+0x31a/0xba0 [ptlrpc] [32808.771130] ptlrpc_main+0xba4/0x14a0 [ptlrpc] [32808.771910] ? __schedule+0x2ae/0x700 [32808.772528] ? ptlrpc_register_service+0xfb0/0xfb0 [ptlrpc] [32808.773381] kthread+0x112/0x130 [32808.773908] ? kthread_flush_work_fn+0x10/0x10 [32808.774601] ret_from_fork+0x35/0x40 [32808.775157] INFO: task qmt_reba_lustre:817021 blocked for more than 120 seconds. [32808.776291] Tainted: P OE --------- - - 4.18.0-240.1.1.el8_lustre.x86_64 #1 [32808.777627] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [32808.778806] qmt_reba_lustre D 0 817021 2 0x80004080 [32808.779652] Call Trace: [32808.780033] __schedule+0x2a6/0x700 [32808.780586] schedule+0x38/0xa0 [32808.781080] rwsem_down_write_slowpath+0x32d/0x4e0 [32808.781827] ? cfs_hash_bd_lookup_intent+0x2a/0xc0 [libcfs] [32808.782688] qmt_reba_thread+0x6ca/0x9b0 [lquota] [32808.783407] ? qmt_glimpse_lock.isra.19+0xfb0/0xfb0 [lquota] [32808.784269] kthread+0x112/0x130 [32808.784777] ? kthread_flush_work_fn+0x10/0x10 [32808.785452] ret_from_fork+0x35/0x40 [32808.786011] INFO: task mdt00_003:819155 blocked for more than 120 seconds. [32808.787045] Tainted: P OE --------- - - 4.18.0-240.1.1.el8_lustre.x86_64 #1 [32808.788345] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [32808.789522] mdt00_003 D 0 819155 2 0x80004080 [32808.790365] Call Trace: [32808.790760] __schedule+0x2a6/0x700 [32808.791290] ? __wake_up_common+0x7a/0x190 [32808.791922] schedule+0x38/0xa0 [32808.792406] io_schedule+0x12/0x40 [32808.792941] cv_wait_common+0xaf/0x130 [spl] [32808.793592] ? finish_wait+0x80/0x80 [32808.794184] txg_wait_synced_impl+0xc2/0x110 [zfs] [32808.794967] txg_wait_synced+0xc/0x40 [zfs] [32808.795619] osd_trans_stop+0x51f/0x560 [osd_zfs] [32808.796374] top_trans_stop+0x9f/0xd30 [ptlrpc] [32808.797089] ? mdd_acl_chmod+0x185/0x4b0 [mdd] [32808.797791] lod_trans_stop+0x9b/0x340 [lod] [32808.798453] ? mdd_attr_set+0x54a/0xc20 [mdd] [32808.799129] mdd_trans_stop+0x29/0x17f [mdd] [32808.799808] mdt_reint_setattr+0xc7a/0x12a0 [mdt] [32808.800565] ? nodemap_get_from_exp+0xea/0x270 [ptlrpc] [32808.801372] mdt_reint_rec+0x11f/0x250 [mdt] [32808.802044] mdt_reint_internal+0x498/0x780 [mdt] [32808.802778] mdt_reint+0x5e/0x100 [mdt] [32808.803400] tgt_request_handle+0xc78/0x1910 [ptlrpc] [32808.804212] ptlrpc_server_handle_request+0x31a/0xba0 [ptlrpc] [32808.805135] ptlrpc_main+0xba4/0x14a0 [ptlrpc] [32808.805824] ? __schedule+0x2ae/0x700 [32808.806412] ? ptlrpc_register_service+0xfb0/0xfb0 [ptlrpc] [32808.808415] kthread+0x112/0x130 [32808.808950] ? kthread_flush_work_fn+0x10/0x10 [32808.809648] ret_from_fork+0x35/0x40 Logs for more hangs are at |
| Comments |
| Comment by James Nunez (Inactive) [ 06/Jan/21 ] |
|
Sergey - Thanks |
| Comment by Peter Jones [ 18/Jan/21 ] |
|
sergey how is your investigation progressing? |
| Comment by Sergey Cheremencev [ 21/Jan/21 ] |
|
It seems qmt is stuck due to problems at ZFS layer: [25302.786844] Pid: 15606, comm: mdt_rdpg00_004 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC 2020 [25302.788312] Call Trace: [25302.788742] [<ffffffffc065d325>] cv_wait_common+0x125/0x150 [spl] [25302.789802] [<ffffffffc065d365>] __cv_wait+0x15/0x20 [spl] [25302.790749] [<ffffffffc07afa5b>] dmu_tx_wait+0x20b/0x3b0 [zfs] [25302.791765] [<ffffffffc07afc91>] dmu_tx_assign+0x91/0x490 [zfs] [25302.792780] [<ffffffffc11a0109>] osd_trans_start+0x199/0x440 [osd_zfs] [25302.793914] [<ffffffffc1159351>] qmt_trans_start_with_slv+0x331/0x7f0 [lquota] [25302.795114] [<ffffffffc1150c83>] qmt_dqacq0+0x2d3/0x1ac0 [lquota] [25302.796090] [<ffffffffc1152d77>] qmt_dqacq+0x707/0x810 [lquota] [25302.797089] [<ffffffffc1332fd2>] mdt_quota_dqacq+0x62/0x150 [mdt] [25302.798125] [<ffffffffc0fbf6ba>] tgt_request_handle+0x7ea/0x1750 [ptlrpc] [25302.799602] [<ffffffffc0f5f1c6>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [25302.800862] [<ffffffffc0f63d1c>] ptlrpc_main+0xb3c/0x14e0 [ptlrpc] [25302.801919] [<ffffffffa94c6691>] kthread+0xd1/0xe0 [25302.802755] [<ffffffffa9b92d37>] ret_from_fork_nospec_end+0x0/0x39 [25302.803791] [<ffffffffffffffff>] 0xffffffffffffffff As qmt_dqacq0 takes write semaphore some of quota threads that also need this semaphore may also hung: [23041.077700] INFO: task qmt_reba_lustre:31100 blocked for more than 120 seconds. [23041.078865] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [23041.080041] qmt_reba_lustre D ffff9606d327d230 0 31100 2 0x00000080 [23041.081222] Call Trace: [23041.081617] [<ffffffffa9b85da9>] schedule+0x29/0x70 [23041.082386] [<ffffffffa9b874e5>] rwsem_down_write_failed+0x215/0x3c0 [23041.083442] [<ffffffffa9796777>] call_rwsem_down_write_failed+0x17/0x30 [23041.084458] [<ffffffffa9b850fd>] down_write+0x2d/0x3d [23041.085322] [<ffffffffc1157b24>] qmt_reba_thread+0xa54/0xa70 [lquota] [23041.086332] [<ffffffffc11570d0>] ? qmt_glimpse_lock.isra.15+0xaa0/0xaa0 [lquota] [23041.087526] [<ffffffffa94c6691>] kthread+0xd1/0xe0 [23041.088287] [<ffffffffa94c65c0>] ? insert_kthread_work+0x40/0x40 [23041.089278] [<ffffffffa9b92d37>] ret_from_fork_nospec_begin+0x21/0x21 [23041.090278] [<ffffffffa94c65c0>] ? insert_kthread_work+0x40/0x40 There are also several hung tasks not related to quota, but related to ZFS: [22920.963274] INFO: task txg_quiesce:30858 blocked for more than 120 seconds. [22920.964413] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [22920.965608] txg_quiesce D ffff9606ffc1acc0 0 30858 2 0x00000080 [22920.966876] Call Trace: [22920.967286] [<ffffffffa94d3820>] ? task_rq_unlock+0x20/0x20 [22920.968202] [<ffffffffa9b85da9>] schedule+0x29/0x70 [22920.968981] [<ffffffffc065d325>] cv_wait_common+0x125/0x150 [spl] [22920.969995] [<ffffffffa94c7780>] ? wake_up_atomic_t+0x30/0x30 [22920.970890] [<ffffffffc065d365>] __cv_wait+0x15/0x20 [spl] [22920.971852] [<ffffffffc081165b>] txg_quiesce_thread+0x2cb/0x3c0 [zfs] [22920.972878] [<ffffffffc0811390>] ? txg_init+0x2b0/0x2b0 [zfs] [22920.973845] [<ffffffffc0664e03>] thread_generic_wrapper+0x73/0x80 [spl] [22920.974875] [<ffffffffc0664d90>] ? __thread_exit+0x20/0x20 [spl] [22920.975863] [<ffffffffa94c6691>] kthread+0xd1/0xe0 [22920.976638] [<ffffffffa94c65c0>] ? insert_kthread_work+0x40/0x40 [22920.977638] [<ffffffffa9b92d37>] ret_from_fork_nospec_begin+0x21/0x21 [22920.978633] [<ffffffffa94c65c0>] ? insert_kthread_work+0x40/0x40 [22920.979624] INFO: task qmt_reba_lustre:31100 blocked for more than 120 seconds. And [22920.993099] INFO: task mdt00_003:638 blocked for more than 120 seconds. [22920.994173] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [22920.995356] mdt00_003 D ffff9606de0f9070 0 638 2 0x00000080 [22920.996534] Call Trace: [22920.996979] [<ffffffffa9b85da9>] schedule+0x29/0x70 [22920.997749] [<ffffffffa9b838b1>] schedule_timeout+0x221/0x2d0 [22920.998702] [<ffffffffa94db9a2>] ? default_wake_function+0x12/0x20 [22920.999675] [<ffffffffa946d39e>] ? kvm_clock_get_cycles+0x1e/0x20 [22921.000681] [<ffffffffa9b8549d>] io_schedule_timeout+0xad/0x130 [22921.001601] [<ffffffffa94c7306>] ? prepare_to_wait_exclusive+0x56/0x90 [22921.002668] [<ffffffffa9b85538>] io_schedule+0x18/0x20 [22921.003472] [<ffffffffc065d2b2>] cv_wait_common+0xb2/0x150 [spl] [22921.004458] [<ffffffffa94c7780>] ? wake_up_atomic_t+0x30/0x30 [22921.005357] [<ffffffffc065d388>] __cv_wait_io+0x18/0x20 [spl] [22921.006336] [<ffffffffc0810e75>] txg_wait_synced_impl+0xe5/0x130 [zfs] [22921.007375] [<ffffffffc0810ed0>] txg_wait_synced+0x10/0x50 [zfs] [22921.008375] [<ffffffffc11a39bd>] osd_trans_stop+0x54d/0x5f0 [osd_zfs] [22921.009430] [<ffffffffc0fcf9a3>] top_trans_stop+0xa3/0xbf0 [ptlrpc] [22921.010470] [<ffffffffc146f7ff>] ? lod_xattr_get+0x37f/0x890 [lod] [22921.011438] [<ffffffffc1449afc>] lod_trans_stop+0x25c/0x340 [lod] [22921.012452] [<ffffffffc15071f9>] ? mdd_acl_chmod+0x169/0x490 [mdd] [22921.013419] [<ffffffffc150961e>] mdd_trans_stop+0x2e/0x174 [mdd] [22921.014412] [<ffffffffc14fd2df>] mdd_attr_set+0x65f/0xc80 [mdd] [22921.015356] [<ffffffffc1362436>] mdt_attr_set+0x1c6/0x7b0 [mdt] [22921.016349] [<ffffffffc13633aa>] mdt_reint_setattr+0x70a/0xf40 [mdt] [22921.017346] [<ffffffffc13514c1>] ? mdt_root_squash+0x21/0x430 [mdt] [22921.018391] [<ffffffffc1350d3b>] ? ucred_set_audit_enabled.isra.15+0x3b/0x60 [mdt] [22921.019571] [<ffffffffc13663e3>] mdt_reint_rec+0x83/0x210 [mdt] [22921.020566] [<ffffffffc133ea70>] mdt_reint_internal+0x720/0xaf0 [mdt] [22921.021574] [<ffffffffc134a607>] mdt_reint+0x67/0x140 [mdt] [22921.022527] [<ffffffffc0fbf6ba>] tgt_request_handle+0x7ea/0x1750 [ptlrpc] [22921.023591] [<ffffffffc0bccbee>] ? libcfs_nid2str_r+0xfe/0x130 [lnet] [22921.024674] [<ffffffffc0f5f1c6>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [22921.025863] [<ffffffffc0f63d1c>] ptlrpc_main+0xb3c/0x14e0 [ptlrpc] [22921.026879] [<ffffffffa94d53fe>] ? finish_task_switch+0x4e/0x1c0 [22921.027835] [<ffffffffc0f631e0>] ? ptlrpc_register_service+0xf90/0xf90 [ptlrpc] [22921.029014] [<ffffffffa94c6691>] kthread+0xd1/0xe0 [22921.029772] [<ffffffffa94c65c0>] ? insert_kthread_work+0x40/0x40 [22921.030761] [<ffffffffa9b92d37>] ret_from_fork_nospec_begin+0x21/0x21 [22921.031759] [<ffffffffa94c65c0>] ? insert_kthread_work+0x40/0x40 I believe the problem is in ZFS layer that becomes stuck causing other tasks to hung. |
| Comment by Cory Spitz [ 21/Jan/21 ] |
|
jamesanunez, just double-checking here, have you seen any instances of this failure with ldiskfs backends? |
| Comment by James Nunez (Inactive) [ 26/Jan/21 ] |
|
Cory - No, we do not see this hang with ldiskfs backends. So far, we've only see this for ZFS. |
| Comment by Alex Zhuravlev [ 28/Jan/21 ] |
|
lots of threads were waiting for transaction: [26058.883940] mdt_rdpg00_036 D ffff9606de7be2a0 0 15624 2 0x00000080 [26058.885070] Call Trace: [26058.885515] [<ffffffffa9b85da9>] schedule+0x29/0x70 [26058.886284] [<ffffffffc065d325>] cv_wait_common+0x125/0x150 [spl] [26058.887285] [<ffffffffa94c7780>] ? wake_up_atomic_t+0x30/0x30 [26058.888176] [<ffffffffc065d365>] __cv_wait+0x15/0x20 [spl] [26058.889126] [<ffffffffc07afa5b>] dmu_tx_wait+0x20b/0x3b0 [zfs] [26058.890036] [<ffffffffa9b84022>] ? mutex_lock+0x12/0x2f [26058.890920] [<ffffffffc07afc91>] dmu_tx_assign+0x91/0x490 [zfs] [26058.891845] [<ffffffffc11a0109>] osd_trans_start+0x199/0x440 [osd_zfs] [26058.892911] [<ffffffffc1159351>] qmt_trans_start_with_slv+0x331/0x7f0 [lquota] [26058.894032] [<ffffffffc1150c83>] qmt_dqacq0+0x2d3/0x1ac0 [lquota] [26058.895052] [<ffffffffc115b573>] ? qti_lqes_add+0x1a3/0x5c0 [lquota] [26058.896037] [<ffffffffc1163854>] ? qmt_pool_lqes_lookup+0x2b4/0x8f0 [lquota] [26058.897182] [<ffffffffc1152d77>] qmt_dqacq+0x707/0x810 [lquota] ... [26057.768904] [<ffffffffc0810e75>] txg_wait_synced_impl+0xe5/0x130 [zfs] [26057.769985] [<ffffffffc0810ed0>] txg_wait_synced+0x10/0x50 [zfs] [26057.770930] [<ffffffffc11a39bd>] osd_trans_stop+0x54d/0x5f0 [osd_zfs] [26057.772015] [<ffffffffc0fcf9a3>] top_trans_stop+0xa3/0xbf0 [ptlrpc] [26057.773004] [<ffffffffc146f7ff>] ? lod_xattr_get+0x37f/0x890 [lod] [26057.774018] [<ffffffffc1449afc>] lod_trans_stop+0x25c/0x340 [lod] [26057.774972] [<ffffffffc15071f9>] ? mdd_acl_chmod+0x169/0x490 [mdd] [26057.775985] [<ffffffffc150961e>] mdd_trans_stop+0x2e/0x174 [mdd] [26057.776919] [<ffffffffc14fd2df>] mdd_attr_set+0x65f/0xc80 [mdd] [26057.777886] [<ffffffffc1362436>] mdt_attr_set+0x1c6/0x7b0 [mdt] [26057.778810] [<ffffffffc13633aa>] mdt_reint_setattr+0x70a/0xf40 [mdt] [26057.779852] [<ffffffffc13514c1>] ? mdt_root_squash+0x21/0x430 [mdt] [26057.780825] [<ffffffffc1350d3b>] ? ucred_set_audit_enabled.isra.15+0x3b/0x60 [mdt] [26057.782047] [<ffffffffc13663e3>] mdt_reint_rec+0x83/0x210 [mdt] [26057.782974] [<ffffffffc133ea70>] mdt_reint_internal+0x720/0xaf0 [mdt] [26057.784028] [<ffffffffc134a607>] mdt_reint+0x67/0x140 [mdt] notice none of the threads above hold quota-related mutex/lock. but then another thread, holding a transaction, waits for quota semaphore. [26057.805469] mdt_rdpg00_002 D ffff9606f92b6a40 0 7935 2 0x00000080 [26057.806680] Call Trace: [26057.807100] [<ffffffffa9b85da9>] schedule+0x29/0x70 [26057.807954] [<ffffffffa9b874e5>] rwsem_down_write_failed+0x215/0x3c0 [26057.808966] [<ffffffffa9796777>] call_rwsem_down_write_failed+0x17/0x30 [26057.810072] [<ffffffffc065e6e0>] ? spl_kmem_alloc+0xf0/0x150 [spl] [26057.811046] [<ffffffffa9b850fd>] down_write+0x2d/0x3d [26057.811903] [<ffffffffc115d381>] qti_lqes_write_lock+0xb1/0x1b0 [lquota] [26057.813020] [<ffffffffc1150c9e>] qmt_dqacq0+0x2ee/0x1ac0 [lquota] [26057.813998] [<ffffffffa9626702>] ? kmem_cache_free+0x1e2/0x200 [26057.814980] [<ffffffffc1163854>] ? qmt_pool_lqes_lookup+0x2b4/0x8f0 [lquota] [26057.816096] [<ffffffffc1152d77>] qmt_dqacq+0x707/0x810 [lquota] [26057.817117] [<ffffffffc0f52aff>] ? lustre_pack_reply_flags+0x6f/0x1e0 [ptlrpc] [26057.818231] [<ffffffffc1332fd2>] mdt_quota_dqacq+0x62/0x150 [mdt] |
| Comment by Gerrit Updater [ 28/Jan/21 ] |
|
James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41354 |
| Comment by Alex Zhuravlev [ 28/Jan/21 ] |
|
while I'm not 100$ sure, I haven't seen an evidence/explanation why this is ZFS-related |
| Comment by James Nunez (Inactive) [ 01/Feb/21 ] |
|
Just a note to document where we've seen this issue and for what configurations. We've seen this issue only on the master branch starting with Lustre 2.13.53.163 for ZFS with a single MDS (not DNE) for full and full-patchless test sessions. We've seen this hang for the following RHEL and ZFS versions: |
| Comment by Sergey Cheremencev [ 01/Feb/21 ] |
|
Yes, Alex is right in his comment. |
| Comment by Gerrit Updater [ 02/Feb/21 ] |
|
James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41384 |
| Comment by James Nunez (Inactive) [ 02/Feb/21 ] |
|
Sergey - I don't have a crash dump, but debug logs are collected. Each node has a debug log, look for .debug.; https://testing.whamcloud.com/test_sets/94c956a6-8eee-472d-8613-1324746db4a7. Is that what you're looking for/helpful or do we need to set debug to -1? |
| Comment by Gerrit Updater [ 02/Feb/21 ] |
|
James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41390 |
| Comment by Alex Zhuravlev [ 02/Feb/21 ] |
|
qti_lqes_write_lock() can be taking few locks. are those ordered somehow? otherwise, I guess, few threads can hit a deadlock. |
| Comment by Sergey Cheremencev [ 02/Feb/21 ] |
No, those are not ordered. I agree this could be one of possible reasons. |
| Comment by Gerrit Updater [ 15/Apr/21 ] |
|
Sergey Cheremencev (sergey.cheremencev@hpe.com) uploaded a new patch: https://review.whamcloud.com/43325 |
| Comment by Gerrit Updater [ 15/Apr/21 ] |
|
Sergey Cheremencev (sergey.cheremencev@hpe.com) uploaded a new patch: https://review.whamcloud.com/43326 |
| Comment by Gerrit Updater [ 22/Jul/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43326/ |
| Comment by Sergey Cheremencev [ 04/Aug/21 ] |
|
The issue is solved by https://review.whamcloud.com/43326/. James, do you agree to close the issue? I guess https://review.whamcloud.com/#/c/41390/ should be abandoned. |
| Comment by James Nunez (Inactive) [ 04/Aug/21 ] |
|
It looks like we haven't seen this crash in a while. Thus, I'm fine with closing this ticket. The patch https://review.whamcloud.com/#/c/43325/ has not landed. Do we want to land this patch before closing this ticket? |
| Comment by Sergey Cheremencev [ 05/Aug/21 ] |
|
https://review.whamcloud.com/#/c/43325/ is not needed as it has been landed as a part of https://review.whamcloud.com/43326/. |