[LU-10859] Deadlock with heavy memory pressure Created: 28/Mar/18 Updated: 11/May/20 Resolved: 09/Apr/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.12.0, Lustre 2.10.4 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Wang Shilong (Inactive) | Assignee: | Wang Shilong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Environment: |
RHEL7 Server |
||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
On one Customer site, we hit following deadlock: Thread 1: ofd_object_punch osd_punch ldiskfs_truncate ldiskfs_inode_attach_jinode ... do_try_to_free_pages lu_cache_shrink mutex_lock -->try to hold @lu_sites_guard kswapd thread2: kthread shrink_slab lu_cache_shrink mutex_lock ---->hold already. ... dqget ldiskfs_acquire_dquot jbd2__journal_start-->blocked to wait for more credits. Thread3: kthread kjournald2 jbd2_journal_commit_transaction-->blocked to wait Thread2 finished, since Thread1 add a handle into transaction. So deadlock happens because of Thread1 wait Thread2, Thread2 wait Thread3.. but Thread3 wait Thread1.... This problem still exists even we have switched @lu_sites_guard into a read/write lock, sine we hold write lock at lu_cahce_shrink(). Fixed the problem by making ldiskfs_inode_attach_jinode() use GPF_NOFS. |
| Comments |
| Comment by Gerrit Updater [ 28/Mar/18 ] |
|
Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/31806 |
| Comment by Peter Jones [ 28/Mar/18 ] |
|
Thanks wangshilong
|
| Comment by Bruno Faccini (Inactive) [ 28/Mar/18 ] |
|
yes, thanks to create both this public ticket and Gerrit change in my place ... |
| Comment by Wang Shilong (Inactive) [ 28/Mar/18 ] |
|
Hi Bruno, sorry for that, we are a bit urgent for this issue. |
| Comment by Andreas Dilger [ 28/Mar/18 ] |
|
Bob, can you please add a follow-on patch for SLES, either using the same patch (if it applies cleanly) or new patches as needed, once this initial patch has passed review & testing. |
| Comment by Wang Shilong (Inactive) [ 28/Mar/18 ] |
|
Yup, Andreas, I should included SLES updates too, but it looks a bit hard for me to grab source codes for it, Ihara also reminded we need patch for ubuntu14+16 for master too.
|
| Comment by Peter Jones [ 28/Mar/18 ] |
|
We only support ubuntu16 on master - not ubuntu14 |
| Comment by Chris Hunter (Inactive) [ 28/Mar/18 ] |
|
Is this related to |
| Comment by Wang Shilong (Inactive) [ 28/Mar/18 ] |
|
Hello Chris, our ES3 have already included that patch, but still hit the problem, so it is different.
|
| Comment by Chris Hunter (Inactive) [ 29/Mar/18 ] |
|
patch kernel_patch filename is "GPF_NOFS" but alloc flag is GFP_NOFS
|
| Comment by Gerrit Updater [ 29/Mar/18 ] |
|
Bob Glossman (bob.glossman@intel.com) uploaded a new patch: https://review.whamcloud.com/31825 |
| Comment by Bob Glossman (Inactive) [ 29/Mar/18 ] |
done |
| Comment by Wang Shilong (Inactive) [ 30/Mar/18 ] |
|
Hello chunteraa , Thanks for reminding, I refreshed the patch to fix that. |
| Comment by Gerrit Updater [ 09/Apr/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31806/ |
| Comment by Peter Jones [ 09/Apr/18 ] |
|
Landed for 2.12 |
| Comment by Wang Shilong (Inactive) [ 17/Apr/18 ] |
|
FYI, we'd better include this fix into b2_10 LTS branch. |
| Comment by Gerrit Updater [ 18/Apr/18 ] |
|
Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/32058 |
| Comment by Gerrit Updater [ 03/May/18 ] |
|
John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/32058/ |
| Comment by Chris Horn [ 31/May/18 ] |
|
Bob Glossman (bob.glossman@intel.com) uploaded a new patch: https://review.whamcloud.com/31825
Edit: Nevermind, I misread the patches |