Loading...

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.14.0, Lustre 2.12.6
Affects Version/s: Lustre 2.12.5
Labels:
None
Environment:

Hide
CentOS 7.6 (3.10.0-957.27.2.el7_lustre.pl2.x86_64, Lustre 2.12.5 + 2 patches from ~~LU-13599~~

Show
CentOS 7.6 (3.10.0-957.27.2.el7_lustre.pl2.x86_64, Lustre 2.12.5 + 2 patches from LU-13599

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Recently, with Lustre 2.12.5, we experienced a blocked Lustre client on the Robinhood server with the following task info:

Aug 24 23:05:15 fir-rbh01 kernel: INFO: task robinhood:10783 blocked for more than 120 seconds.
Aug 24 23:05:15 fir-rbh01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 24 23:05:15 fir-rbh01 kernel: robinhood       D ffffa13de5779040     0 10783      1 0x00000080
Aug 24 23:05:15 fir-rbh01 kernel: Call Trace:
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa84fa3d2>] ? security_inode_permission+0x22/0x30
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa896be39>] schedule_preempt_disabled+0x29/0x70
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa8969db7>] __mutex_lock_slowpath+0xc7/0x1d0
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa896919f>] mutex_lock+0x1f/0x2f
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa8961dd2>] lookup_slow+0x33/0xa7
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa8453268>] path_lookupat+0x838/0x8b0
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa84342b2>] ? __mem_cgroup_commit_charge+0xe2/0x2f0
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa841d665>] ? kmem_cache_alloc+0x35/0x1f0
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa845415f>] ? getname_flags+0x4f/0x1a0
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa845330b>] filename_lookup+0x2b/0xc0
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa84552f7>] user_path_at_empty+0x67/0xc0
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa83ed3ad>] ? handle_mm_fault+0x39d/0x9b0
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa8455361>] user_path_at+0x11/0x20
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa8448223>] vfs_fstatat+0x63/0xc0
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa8448641>] SYSC_newlstat+0x31/0x60
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa8972925>] ? do_page_fault+0x35/0x90
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa8448aae>] SyS_newlstat+0xe/0x10
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa8977ddb>] system_call_fastpath+0x22/0x27
Aug 24 23:05:15 fir-rbh01 kernel: INFO: task robinhood:10784 blocked for more than 120 seconds.
Aug 24 23:05:15 fir-rbh01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 24 23:05:15 fir-rbh01 kernel: robinhood       D ffffa13de5778000     0 10784      1 0x00000080
Aug 24 23:05:15 fir-rbh01 kernel: Call Trace:
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa84fa3d2>] ? security_inode_permission+0x22/0x30
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa896be39>] schedule_preempt_disabled+0x29/0x70
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa8969db7>] __mutex_lock_slowpath+0xc7/0x1d0
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa896919f>] mutex_lock+0x1f/0x2f
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa8961dd2>] lookup_slow+0x33/0xa7
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa8453268>] path_lookupat+0x838/0x8b0
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa836a57d>] ? tracing_record_cmdline+0x1d/0x120
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa841d665>] ? kmem_cache_alloc+0x35/0x1f0
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa845415f>] ? getname_flags+0x4f/0x1a0
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa845330b>] filename_lookup+0x2b/0xc0
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa84552f7>] user_path_at_empty+0x67/0xc0
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa8455361>] user_path_at+0x11/0x20
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa8448223>] vfs_fstatat+0x63/0xc0
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa8448641>] SYSC_newlstat+0x31/0x60
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa833a176>] ? __audit_syscall_exit+0x1e6/0x280
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa8448aae>] SyS_newlstat+0xe/0x10
Aug 24 23:05:15 fir-rbh01 kernel:  [<ffffffffa8977ddb>] system_call_fastpath+0x22/0x27

The rest of the clients were ok but I guess a inode was blocked. The only solution I found was to restart the MDS (here, fir-MDT0003).

I took a crash dump of the MDS when this happened and is it available on the FTP server as fir-md1-s4_vmcore_2020-08-25-08-27-56.

Also, I'm providing the following files from the crash dump:

vmcore dmesg as fir-md1-s4_vmcore-dmesg_2020-08-25-08-27-56.txt
vmcore's foreach bt as fir-md1-s4_foreach-bt_2020-08-25-08-27-56.txt

Lately, our MDS has been highly loaded with a lot of mdt_rdpg threads, but until then didn't block on any inode. This is different and worth investigating in my opinion.

I'm wondering if this could be due to DNE as I'm seeing a lot of those:

PID: 21603  TASK: ffff9cf0f86ec100  CPU: 41  COMMAND: "mdt_rdpg01_017"
 #0 [ffff9d00ff888e48] crash_nmi_callback at ffffffff87456027
 #1 [ffff9d00ff888e58] nmi_handle at ffffffff87b6f91c
 #2 [ffff9d00ff888eb0] do_nmi at ffffffff87b6fb3d
 #3 [ffff9d00ff888ef0] end_repeat_nmi at ffffffff87b6ed89
    [exception RIP: fld_cache_lookup+157]
    RIP: ffffffffc089d9dd  RSP: ffff9ceff320ba50  RFLAGS: 00000202
    RAX: 0000000000000003  RBX: ffff9d10f868cf00  RCX: ffff9d10f27ad6c0
    RDX: ffff9d103898cbb8  RSI: ffff9d10f27ad6c0  RDI: ffff9d10f868cf28
    RBP: ffff9ceff320ba68   R8: 00000002800432da   R9: 0000000000000007
    R10: ffff9ce79d0a6700  R11: ffff9ce79d0a6700  R12: 00000002800432e6
    R13: ffff9d103898cbb8  R14: 00000002800432e6  R15: ffff9d00456b7000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #4 [ffff9ceff320ba50] fld_cache_lookup at ffffffffc089d9dd [fld]
 #5 [ffff9ceff320ba70] fld_local_lookup at ffffffffc089ef02 [fld]
 #6 [ffff9ceff320bab8] osd_fld_lookup at ffffffffc13ae0e8 [osd_ldiskfs]
 #7 [ffff9ceff320bac8] osd_remote_fid at ffffffffc13ae233 [osd_ldiskfs]
 #8 [ffff9ceff320bb08] osd_it_ea_rec at ffffffffc13b73eb [osd_ldiskfs]
 #9 [ffff9ceff320bb70] lod_it_rec at ffffffffc15f6fb7 [lod]
#10 [ffff9ceff320bb80] mdd_dir_page_build at ffffffffc167edd7 [mdd]
#11 [ffff9ceff320bbe8] dt_index_walk at ffffffffc0d082a0 [obdclass]
#12 [ffff9ceff320bc58] mdd_readpage at ffffffffc16805bf [mdd]
#13 [ffff9ceff320bc90] mdt_readpage at ffffffffc14dc82a [mdt]
#14 [ffff9ceff320bcd0] tgt_request_handle at ffffffffc0ff866a [ptlrpc]
#15 [ffff9ceff320bd58] ptlrpc_server_handle_request at ffffffffc0f9b44b [ptlrpc]
#16 [ffff9ceff320bdf8] ptlrpc_main at ffffffffc0f9edb4 [ptlrpc]
#17 [ffff9ceff320bec8] kthread at ffffffff874c2e81
#18 [ffff9ceff320bf50] ret_from_fork_nospec_begin at ffffffff87b77c24

Thanks!
Stephane

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

fir-md1-s4_vmcore-dmesg_2020-08-25-08-27-56.txt
394 kB
25/Aug/20 5:55 PM
fir-md1-s4_foreach-bt_2020-08-25-08-27-56.txt
890 kB
25/Aug/20 5:55 PM

Client blocked in lstat()

Details

Description

Attachments

Attachments

Activity

People

Dates