[LU-13922] Client blocked in lstat() Created: 25/Aug/20  Updated: 22/Oct/20  Resolved: 20/Sep/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.5
Fix Version/s: Lustre 2.14.0, Lustre 2.12.6

Type: Bug Priority: Major
Reporter: Stephane Thiell Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None
Environment:

CentOS 7.6 (3.10.0-957.27.2.el7_lustre.pl2.x86_64, Lustre 2.12.5 + 2 patches from LU-13599


Attachments: Text File fir-md1-s4_foreach-bt_2020-08-25-08-27-56.txt     Text File fir-md1-s4_vmcore-dmesg_2020-08-25-08-27-56.txt    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Recently, with Lustre 2.12.5, we experienced a blocked Lustre client on the Robinhood server with the following task info: 

Aug 24 23:05:15 fir-rbh01 kernel: INFO: task robinhood:10783 blocked for more than 120 seconds.
Aug 24 23:05:15 fir-rbh01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 24 23:05:15 fir-rbh01 kernel: robinhood       D ffffa13de5779040     0 10783      1 0x00000080
Aug 24 23:05:15 fir-rbh01 kernel: Call Trace:
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa84fa3d2>] ? security_inode_permission+0x22/0x30
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa896be39>] schedule_preempt_disabled+0x29/0x70
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa8969db7>] __mutex_lock_slowpath+0xc7/0x1d0
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa896919f>] mutex_lock+0x1f/0x2f
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa8961dd2>] lookup_slow+0x33/0xa7
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa8453268>] path_lookupat+0x838/0x8b0
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa84342b2>] ? __mem_cgroup_commit_charge+0xe2/0x2f0
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa841d665>] ? kmem_cache_alloc+0x35/0x1f0
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa845415f>] ? getname_flags+0x4f/0x1a0
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa845330b>] filename_lookup+0x2b/0xc0
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa84552f7>] user_path_at_empty+0x67/0xc0
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa83ed3ad>] ? handle_mm_fault+0x39d/0x9b0
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa8455361>] user_path_at+0x11/0x20
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa8448223>] vfs_fstatat+0x63/0xc0
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa8448641>] SYSC_newlstat+0x31/0x60
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa8972925>] ? do_page_fault+0x35/0x90
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa8448aae>] SyS_newlstat+0xe/0x10
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa8977ddb>] system_call_fastpath+0x22/0x27
Aug 24 23:05:15 fir-rbh01 kernel: INFO: task robinhood:10784 blocked for more than 120 seconds.
Aug 24 23:05:15 fir-rbh01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 24 23:05:15 fir-rbh01 kernel: robinhood       D ffffa13de5778000     0 10784      1 0x00000080
Aug 24 23:05:15 fir-rbh01 kernel: Call Trace:
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa84fa3d2>] ? security_inode_permission+0x22/0x30
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa896be39>] schedule_preempt_disabled+0x29/0x70
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa8969db7>] __mutex_lock_slowpath+0xc7/0x1d0
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa896919f>] mutex_lock+0x1f/0x2f
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa8961dd2>] lookup_slow+0x33/0xa7
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa8453268>] path_lookupat+0x838/0x8b0
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa836a57d>] ? tracing_record_cmdline+0x1d/0x120
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa841d665>] ? kmem_cache_alloc+0x35/0x1f0
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa845415f>] ? getname_flags+0x4f/0x1a0
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa845330b>] filename_lookup+0x2b/0xc0
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa84552f7>] user_path_at_empty+0x67/0xc0
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa8455361>] user_path_at+0x11/0x20
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa8448223>] vfs_fstatat+0x63/0xc0
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa8448641>] SYSC_newlstat+0x31/0x60
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa833a176>] ? __audit_syscall_exit+0x1e6/0x280
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa8448aae>] SyS_newlstat+0xe/0x10
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa8977ddb>] system_call_fastpath+0x22/0x27

The rest of the clients were ok but I guess a inode was blocked. The only solution I found was to restart the MDS (here, fir-MDT0003).

I took a crash dump of the MDS when this happened and is it available on the FTP server as fir-md1-s4_vmcore_2020-08-25-08-27-56.

Also, I'm providing the following files from the crash dump:

Lately, our MDS has been highly loaded with a lot of mdt_rdpg threads, but until then didn't block on any inode. This is different and worth investigating in my opinion.

I'm wondering if this could be due to DNE as I'm seeing a lot of those:

PID: 21603  TASK: ffff9cf0f86ec100  CPU: 41  COMMAND: "mdt_rdpg01_017"
 #0 [ffff9d00ff888e48] crash_nmi_callback at ffffffff87456027
 #1 [ffff9d00ff888e58] nmi_handle at ffffffff87b6f91c
 #2 [ffff9d00ff888eb0] do_nmi at ffffffff87b6fb3d
 #3 [ffff9d00ff888ef0] end_repeat_nmi at ffffffff87b6ed89
    [exception RIP: fld_cache_lookup+157]
    RIP: ffffffffc089d9dd  RSP: ffff9ceff320ba50  RFLAGS: 00000202
    RAX: 0000000000000003  RBX: ffff9d10f868cf00  RCX: ffff9d10f27ad6c0
    RDX: ffff9d103898cbb8  RSI: ffff9d10f27ad6c0  RDI: ffff9d10f868cf28
    RBP: ffff9ceff320ba68   R8: 00000002800432da   R9: 0000000000000007
    R10: ffff9ce79d0a6700  R11: ffff9ce79d0a6700  R12: 00000002800432e6
    R13: ffff9d103898cbb8  R14: 00000002800432e6  R15: ffff9d00456b7000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #4 [ffff9ceff320ba50] fld_cache_lookup at ffffffffc089d9dd [fld]
 #5 [ffff9ceff320ba70] fld_local_lookup at ffffffffc089ef02 [fld]
 #6 [ffff9ceff320bab8] osd_fld_lookup at ffffffffc13ae0e8 [osd_ldiskfs]
 #7 [ffff9ceff320bac8] osd_remote_fid at ffffffffc13ae233 [osd_ldiskfs]
 #8 [ffff9ceff320bb08] osd_it_ea_rec at ffffffffc13b73eb [osd_ldiskfs]
 #9 [ffff9ceff320bb70] lod_it_rec at ffffffffc15f6fb7 [lod]
#10 [ffff9ceff320bb80] mdd_dir_page_build at ffffffffc167edd7 [mdd]
#11 [ffff9ceff320bbe8] dt_index_walk at ffffffffc0d082a0 [obdclass]
#12 [ffff9ceff320bc58] mdd_readpage at ffffffffc16805bf [mdd]
#13 [ffff9ceff320bc90] mdt_readpage at ffffffffc14dc82a [mdt]
#14 [ffff9ceff320bcd0] tgt_request_handle at ffffffffc0ff866a [ptlrpc]
#15 [ffff9ceff320bd58] ptlrpc_server_handle_request at ffffffffc0f9b44b [ptlrpc]
#16 [ffff9ceff320bdf8] ptlrpc_main at ffffffffc0f9edb4 [ptlrpc]
#17 [ffff9ceff320bec8] kthread at ffffffff874c2e81
#18 [ffff9ceff320bf50] ret_from_fork_nospec_begin at ffffffff87b77c24

Thanks!
Stephane
 



 Comments   
Comment by Peter Jones [ 26/Aug/20 ]

Lai

Could you please advise?

Thanks

Peter

Comment by Lai Siyao [ 01/Sep/20 ]

fld_cache_lookup() doesn't look to be the cause of this deadlock, because all places it's called are taking read lock. But it can be optimized: it doesn't look necessary to check FID is remote in readdir, I'll update a patch for this.

Comment by Gerrit Updater [ 01/Sep/20 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39782
Subject: LU-13922 osd-ldiskfs: no need to add OI cache in readdir
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 03120b1a1266fcbd77e00709623d11803a91f655

Comment by Stephane Thiell [ 01/Sep/20 ]

Lai, that sounds great! Thanks! After migrating a lot of files to this MDT0003, we've seen the overall MDS load increasing with mdt_rdpg03 threads taking a lot of cpu time. We will be happy to test your patch when it's ready.

Comment by Stephane Thiell [ 05/Sep/20 ]

Hi Lai,

Thanks so much! We're running with your patch on this MDS. It seems to help. MDS load went down from 200+ to 10-20 with the same workload. BTW, we have identified a user who is likely generating this load: she has 2500 running jobs and uses Python's glob on large directories (110k+ entries). Perfect for testing readdir scalability!

Comment by Lai Siyao [ 05/Sep/20 ]

Great, it will be landed when review passed.

Comment by Stephane Thiell [ 08/Sep/20 ]

We have been using this patch on top of 2.12.5 for 3 full days now on a MDS in production and the server load is definitely reduced. MDS is also more responsive from clients even under heavy readdir() load from specific jobs. No side-effect noticed so far. Thanks Lai!

Comment by Gerrit Updater [ 20/Sep/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39782/
Subject: LU-13922 osd-ldiskfs: no need to add OI cache in readdir
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: bc5934632df10aaa02b32b8254a473c14c6f8104

Comment by Peter Jones [ 20/Sep/20 ]

Landed for 2.14

Comment by Gerrit Updater [ 03/Oct/20 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40135
Subject: LU-13922 osd-ldiskfs: no need to add OI cache in readdir
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 71927c664d6cfe8b1d931394739a23e7ceec79bd

Comment by Gerrit Updater [ 22/Oct/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40135/
Subject: LU-13922 osd-ldiskfs: no need to add OI cache in readdir
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: bdef16d60713743e832fbb9d14ecb5cd116398c7

Generated at Sat Feb 10 03:05:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.