[LU-10048] osd-ldiskfs to truncate outside of main transaction Created: 29/Sep/17  Updated: 07/Jul/22  Resolved: 27/Aug/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.10.4, Lustre 2.10.5
Fix Version/s: Lustre 2.13.0

Type: Bug Priority: Major
Reporter: Alex Zhuravlev Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None

Attachments: File crash_bt_jbd2_locked_20210412.log    
Issue Links:
Blocker
is blocked by LU-11764 osd_declare_qid() should not rely on ... Open
Duplicate
is duplicated by LU-11613 MDS and OSS locked up wait_transactio... Resolved
is duplicated by LU-8786 Terrible i/o performance of a test ap... Resolved
Related
is related to LU-5152 Can't enforce block quota when unpriv... Resolved
is related to LU-8806 LFSCK hangs on MDT - osp_precreate_cl... Resolved
is related to LU-11685 removing file names and freeing inode... Open
is related to LU-11465 OSS/MDS deadlock in 2.10.5 Resolved
is related to LU-5994 DT transaction start and object lock ... Resolved
is related to LU-13234 ldiskfs/namei.c:3310 ldiskfs_orphan_a... Resolved
is related to LU-12977 fix i_mutex for ldiskfs_truncate() in... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

this is needed to implement (transaction first; locking next) order to unify locking among MDT/OST/OUT



 Comments   
Comment by Andreas Dilger [ 27/Oct/17 ]

The https://review.whamcloud.com/27488 patch is for ldiskfs, while the LU-8806 patch is for ZFS. Presumably they both need to be fixed?

Comment by Gerrit Updater [ 13/Feb/18 ]

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: https://review.whamcloud.com/31293
Subject: LU-10048 ofd: take local locks within transaction
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a0ff9fb6374689e50674901f3ff3ede857d61848

Comment by Gerrit Updater [ 14/Jun/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27488/
Subject: LU-10048 osd: async truncate
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: cf29a5e7bfa254ccfcea023028fe7da80503c512

Comment by Andreas Dilger [ 08/Oct/18 ]

Still one more patch to land.

Comment by Lukasz Flis [ 10/Oct/18 ]

is there a backport for b2_10 available or planned?

Alex pointed this issue as duplicate of LU-11465  but we were not able to cherrypick changes in b2_10

We are experiencing MDT/OST lock-ups on 2_10_5 few times a day in the worst case

 

 

Comment by Gerrit Updater [ 06/Nov/18 ]

Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33586
Subject: LU-10048 osd: async truncate
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 414b3ed9fd5aa975da26b8847e9cfe8a188b59ce

Comment by Gerrit Updater [ 06/Nov/18 ]

Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33587
Subject: LU-10048 ofd: take local locks within transaction
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 7b6cdb405a519b11537316956affc4bfa4467f8d

Comment by Mahmoud Hanafi [ 16/Nov/18 ]

Are the 2.10 ok to use? or do they still need additional work?

 

Comment by Peter Jones [ 17/Nov/18 ]

Mahmoud

I would recommend holding off for now

Peter

Comment by Andreas Dilger [ 20/Nov/18 ]

I think the https://review.whamcloud.com/33586 patch "LU-10048 osd: async truncate" is relatively safe - it has been in master for several months already, but I'm not sure it will fix the issue completely. Also, the patch https://review.whamcloud.com/33682
"Revert LU-5152 quota: enforce block quota for chgrp" may also help with MDT/OST lockups. It reverts a patch that was landed in 2.10.4 that introduced a circular dependency between servers in the quota handling.

Comment by Gerrit Updater [ 27/Aug/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/31293/
Subject: LU-10048 ofd: take local locks within transaction
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9f79d4488fbb466647d1d09c2e6a1d3555d062fc

Comment by Peter Jones [ 27/Aug/19 ]

So...is this ok to mark as resolved now?

Comment by Peter Jones [ 27/Aug/19 ]

Andreas thinks yes

Comment by Gerrit Updater [ 04/Mar/20 ]

Mark Roper (markroper@gmail.com) uploaded a new patch: https://review.whamcloud.com/37797
Subject: LU-10048 ofd: take local locks within transaction
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: afa505898952b9c094bd6fb4d9911032e4eefb68

Comment by Aurelien Degremont (Inactive) [ 05/Mar/20 ]

We submitted a backport of this patch to b2_10 if anybody is looking for it.

@Lukasz, if you are still looking for it

 

Comment by Gerrit Updater [ 12/Apr/21 ]

Etienne AUJAMES (eaujames@ddn.com) uploaded a new patch: https://review.whamcloud.com/43277
Subject: LU-10048 ofd: take local locks within transaction
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 4dc5634dcdf3381a7ea698605a3c7a345f689a4c

Comment by Etienne Aujames [ 12/Apr/21 ]

Hello,

I have backported the https://review.whamcloud.com/43277 ("LU-10048 ofd: take local locks within transaction") along with the https://review.whamcloud.com/43278 ("LU-13093 osd: fix osd_attr_set race").

It seems we trigger that bug in 2.12.6 with external journal (on flash dev) for rotational disk while several migrate thread is running on a robinhood client (512 thread).

jbd2 journal seems to be locked when the transaction is in T_LOCKED.

transaction_t.t_state = T_LOCKED
transaction_t.t_handle_count = 40
transaction_t.t_updates = 1
nbr of task in j_wait_transaction_locked = 324
nbr of task in j_wait_update = 0
Comment by Alex Zhuravlev [ 12/Apr/21 ]

there is yet another patch under LU-10048 - LU-10048 osd: async truncate, cf29a5e7bfa254ccfcea023028fe7da80503c512 in master

Comment by Alex Zhuravlev [ 12/Apr/21 ]

as for fast journal - we've got number of nodes running all the tests on RAM-backed devices 24h a day.

Comment by Etienne Aujames [ 12/Apr/21 ]

The "LU-10048 osd: async truncate" has already landed on b2_12.

Comment by Alex Zhuravlev [ 12/Apr/21 ]

please, generate full backtrace (

echo t >/proc/sysrq-trigger

) and attach to the ticket

Comment by Etienne Aujames [ 12/Apr/21 ]

The issue seems to occurs relatively often with a lot of migrate threads (today 4 times on different OSTs).

We will test the backports quickly on a production environment (maybe this week).

I will try to get backtrace from the crashdump (manually triggered) tomorrow.

Comment by Etienne Aujames [ 14/Apr/21 ]

I have added the our backtrace to this tickets: crash_bt_jbd2_locked_20210412.log

Comment by Etienne Aujames [ 20/Apr/21 ]

Hello Alex,

Did you have the time to look to our backtrace?

If you need more data from the crashdump I can get you some.

Comment by Etienne Aujames [ 11/May/21 ]

Hello,

We have applied the https://review.whamcloud.com/43277 ("LU-10048 ofd: take local locks within transaction") + https://review.whamcloud.com/43278 ("LU-13093 osd: fix osd_attr_set race") on the problematic filesystem. The issue never occurred after.

We were able to reproduce this issue in 5/10 min with many creations of small files and several threads doing file migrations ("lfs migrate" between OSTs).

Generated at Sat Feb 10 02:31:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.