[LU-10859] Deadlock with heavy memory pressure Created: 28/Mar/18  Updated: 11/May/20  Resolved: 09/Apr/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.12.0, Lustre 2.10.4

Type: Bug Priority: Minor
Reporter: Wang Shilong (Inactive) Assignee: Wang Shilong (Inactive)
Resolution: Fixed Votes: 0
Labels: patch
Environment:

RHEL7 Server
Lustre 2.7.x series


Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   
On one Customer site, we hit following deadlock:

    Thread 1:

    ofd_object_punch

     osd_punch

      ldiskfs_truncate

       ldiskfs_inode_attach_jinode

         ...

         do_try_to_free_pages

          lu_cache_shrink

           mutex_lock -->try to hold @lu_sites_guard

    

    kswapd thread2:

    kthread

     shrink_slab

      lu_cache_shrink

        mutex_lock ---->hold already.

         ...

         dqget

          ldiskfs_acquire_dquot

           jbd2__journal_start-->blocked to wait for more credits.

    

    Thread3:

    kthread

     kjournald2

      jbd2_journal_commit_transaction-->blocked to wait Thread2 finished,

                                 since Thread1 add a handle into transaction.

    

    So deadlock happens because of Thread1 wait Thread2, Thread2 wait Thread3..

    but Thread3 wait Thread1....

    

    This problem still exists even we have switched @lu_sites_guard

    into a read/write lock, sine we hold write lock at lu_cahce_shrink().

    

    Fixed the problem by making ldiskfs_inode_attach_jinode() use

    GPF_NOFS.

 



 Comments   
Comment by Gerrit Updater [ 28/Mar/18 ]

Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/31806
Subject: LU-10859 ldiskfs: fix deadlock with heavy preassure
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 475c42e60f16281b9fd5f928559e2f0a4a4b6952

Comment by Peter Jones [ 28/Mar/18 ]

Thanks wangshilong

 

Comment by Bruno Faccini (Inactive) [ 28/Mar/18 ]

yes, thanks to create both this public ticket and Gerrit change in my place ...

Comment by Wang Shilong (Inactive) [ 28/Mar/18 ]

Hi Bruno, sorry for that, we are a bit urgent for this issue.

Comment by Andreas Dilger [ 28/Mar/18 ]

Bob, can you please add a follow-on patch for SLES, either using the same patch (if it applies cleanly) or new patches as needed, once this initial patch has passed review & testing.

Comment by Wang Shilong (Inactive) [ 28/Mar/18 ]

Yup, Andreas, I should included SLES updates too, but it looks a bit hard for me to grab source codes for it, Ihara also reminded we need patch for ubuntu14+16 for master too.

 

Comment by Peter Jones [ 28/Mar/18 ]

We only support ubuntu16 on master - not ubuntu14

Comment by Chris Hunter (Inactive) [ 28/Mar/18 ]

Is this related to LU-9728 ?

Comment by Wang Shilong (Inactive) [ 28/Mar/18 ]

Hello Chris,

  our ES3 have already included that patch, but still hit the problem, so it is different.

 

 

Comment by Chris Hunter (Inactive) [ 29/Mar/18 ]

patch LU-9728 uses GFP_HIGHUSER for allocations instead of GFP_NOFS

kernel_patch filename is "GPF_NOFS" but alloc flag is GFP_NOFS
https://www.kernel.org/doc/gorman/html/understand/understand009.html

 

Comment by Gerrit Updater [ 29/Mar/18 ]

Bob Glossman (bob.glossman@intel.com) uploaded a new patch: https://review.whamcloud.com/31825
Subject: LU-10859 ldiskfs: extend previous fix to SLES
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c4956f2dff93c428b3040c1e03d08c14ec6232c8

Comment by Bob Glossman (Inactive) [ 29/Mar/18 ]

Bob, can you please add a follow-on patch for SLES, either using the same patch (if it applies cleanly) or new patches as needed, once this initial patch has passed review & testing.

done

Comment by Wang Shilong (Inactive) [ 30/Mar/18 ]

Hello chunteraa ,

Thanks for reminding, I refreshed the patch to fix that.

Comment by Gerrit Updater [ 09/Apr/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31806/
Subject: LU-10859 ldiskfs: fix deadlock with heavy memory preassure
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0506e1bd6a6d5fafe7fc5e558aa1b75e456c2642

Comment by Peter Jones [ 09/Apr/18 ]

Landed for 2.12

Comment by Wang Shilong (Inactive) [ 17/Apr/18 ]

FYI, we'd better include this fix into b2_10 LTS branch.

Comment by Gerrit Updater [ 18/Apr/18 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/32058
Subject: LU-10859 ldiskfs: fix deadlock with heavy memory preassure
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: c5a9c83471aa5b6e0a593b7b99760e86c8311bee

Comment by Gerrit Updater [ 03/May/18 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/32058/
Subject: LU-10859 ldiskfs: fix deadlock with heavy memory preassure
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 0595d92ad03ab9d975d599aad204d746aff991b3

Comment by Chris Horn [ 31/May/18 ]

Bob Glossman (bob.glossman@intel.com) uploaded a new patch: https://review.whamcloud.com/31825
Subject: LU-10859 ldiskfs: extend previous fix to SLES
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c4956f2dff93c428b3040c1e03d08c14ec6232c8


 

This patch was abandoned but its changes were never rolled into the primary patch for this ticket ( https://review.whamcloud.com/31806/). Should Bob's patch be revived?

 

Edit: Nevermind, I misread the patches

Generated at Sat Feb 10 02:38:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.