[LU-11201] NMI watchdog: BUG: soft lockup in lfsck_namespace - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.12.0, Lustre 2.10.6
Affects Version/s: Lustre 2.10.4
Labels:
- llnl
Environment:
lustre 2.10.4_1.chaos
kernel 3.10.0-862.9.1.1chaos.ch6.x86_64
RHEL 7.5 based
MDT

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

About 4 minutes after recovery ended post MDS startup, the console started reporting soft lockups as below. The node started refusing connections from pdsh and ltop running on the mgmt node started reporting stale data (not getting updates from cerebro).

The watchdog is repeatedly reporting a stack about every 40 seconds. Usually it is the same stack, below, with the same PID and same CPU (CPU#6). lfs check servers on a compute node with the file system mounted still shows lquake-MDT0008 as active and the node still responds to lctl ping. An ls of a directory stored on MDT0008 hangs in ptlrpc_set_wait() in
ldlm_cli_enqueue() in mdc_enqueue().

NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [lfsck_namespace:26532]
Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ko2iblnd(OE) lnet(OE) libcfs(OE) nfsv3 nfs_acl ib_ucm sb_edac intel_powerclamp coretemp rpcrdma intel_rapl rdma_ucm iosf_mbi ib_umad ib_uverbs ib_ipoib ib_iser kvm rdma_cm iw_cm libiscsi irqbypass ib_cm iTCO_wdt iTCO_vendor_support mlx5_ib ib_core joydev mlx5_core pcspkr mlxfw devlink lpc_ich i2c_i801 ioatdma ses enclosure sch_fq_codel sg ipmi_si shpchp acpi_cpufreq acpi_power_meter zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) binfmt_misc msr_safe(OE) ip_tables rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache dm_round_robin sd_mod crc_t10dif crct10dif_generic 8021q garp mrp stp llc mgag200 crct10dif_pclmul crct10dif_common i2c_algo_bit crc32_pclmul drm_kms_helper scsi_transport_iscsi crc32c_intel syscopyarea sysfillrect ghash_clmulni_intel sysimgblt dm_multipath ixgbe fb_sys_fops aesni_intel ttm lrw mxm_wmi ahci dca gf128mul libahci glue_helper ablk_helper drm mpt3sas cryptd ptp raid_class libata i2c_core pps_core scsi_transport_sas mdio ipmi_devintf ipmi_msghandler wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod
CPU: 3 PID: 26532 Comm: lfsck_namespace Kdump: loaded Tainted: P           OE  ------------   3.10.0-862.9.1.1chaos.ch6.x86_64 #1
Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
task: ffffa0df6ab94f10 ti: ffffa0df32650000 task.ti: ffffa0df32650000
RIP: 0010:[<ffffffffc1402f1e>]  [<ffffffffc1402f1e>] lfsck_namespace_filter_linkea_entry.isra.64+0x8e/0x180 [lfsck]
RSP: 0018:ffffa0df32653af0  EFLAGS: 00000202
RAX: 0000000000000000 RBX: ffffa0df32653ab8 RCX: ffffa0df2e1d4971
RDX: 0000000000000000 RSI: ffffa0df30ec8010 RDI: ffffa0df32653bc8
RBP: ffffa0df32653b38 R08: 0000000000000000 R09: 0000000000000025
R10: ffffa0df30ec8010 R11: 0000000000000000 R12: ffffa0df32653a70
R13: ffffa0df32653acc R14: ffffffffc144d09c R15: ffffa0df33d5e240
FS:  0000000000000000(0000) GS:ffffa0df7e6c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffff7ff8000 CR3: 0000005cd280e000 CR4: 00000000001607e0
Call Trace:
 [<ffffffffc140d63d>] ? __lfsck_links_read+0x13d/0x2d0 [lfsck]
 [<ffffffffc14159af>] lfsck_namespace_double_scan_one+0x49f/0x14b0 [lfsck]
 [<ffffffffc087d50e>] ? dmu_buf_rele+0xe/0x10 [zfs]
 [<ffffffffc090225f>] ? zap_unlockdir+0x3f/0x60 [zfs]
 [<ffffffffc1416d82>] lfsck_namespace_double_scan_one_trace_file+0x3c2/0x7e0 [lfsck]
 [<ffffffffc141a7bd>] lfsck_namespace_assistant_handler_p2+0x79d/0xa80 [lfsck]
 [<ffffffffb2e03426>] ? kfree+0x136/0x180
 [<ffffffffc11a9ad8>] ? ptlrpc_set_destroy+0x208/0x4f0 [ptlrpc]
 [<ffffffffc13fe6a4>] lfsck_assistant_engine+0x13e4/0x21a0 [lfsck]
 [<ffffffffb2cd5c20>] ? wake_up_state+0x20/0x20
 [<ffffffffc13fd2c0>] ? lfsck_master_engine+0x1450/0x1450 [lfsck]
 [<ffffffffb2cc0ad1>] kthread+0xd1/0xe0
 [<ffffffffb2cc0a00>] ? insert_kthread_work+0x40/0x40
 [<ffffffffb3344837>] ret_from_fork_nospec_begin+0x21/0x21
 [<ffffffffb2cc0a00>] ? insert_kthread_work+0x40/0x40

Attachments

Issue Links

is related to

LU-11419 lfsck does not complete phase2

Resolved

Activity

[LU-11201] NMI watchdog: BUG: soft lockup in lfsck_namespace

Lai Siyao added a comment - 07/Aug/18 4:00 PM

I'm working on a patch to check linkea entry validity on unpack, will commit later.

Lai Siyao added a comment - 07/Aug/18 4:00 PM I'm working on a patch to check linkea entry validity on unpack, will commit later.

Olaf Faaland added a comment - 06/Aug/18 10:33 PM

I confirmed the scan is not advancing because lfsck_namespace_filter_linkea_entry() is calling linkea_next_entry() and ldata->ld_lee != NULL and ldata->ld_reclen == 0.

Olaf Faaland added a comment - 06/Aug/18 10:33 PM I confirmed the scan is not advancing because lfsck_namespace_filter_linkea_entry() is calling linkea_next_entry() and ldata->ld_lee != NULL and ldata->ld_reclen == 0.

Olaf Faaland added a comment - 06/Aug/18 10:13 PM - edited

I ran with a debug patch and confirmed:

lfsck is stuck in the while loop within lfsck_namespace_filter_linkea_entry().
It always takes the else path and calls linkea_next_entry(ldata)
ldata->ld_lee is non-NULL but never changes

Olaf Faaland added a comment - 06/Aug/18 10:13 PM - edited I ran with a debug patch and confirmed: lfsck is stuck in the while loop within lfsck_namespace_filter_linkea_entry(). It always takes the else path and calls linkea_next_entry(ldata) ldata->ld_lee is non-NULL but never changes

Olaf Faaland added a comment - 06/Aug/18 8:03 PM

While the node was responsive, but I was getting the NMI watchdog warnings, I changed the debug and subsystem_debug masks and gathered debug logs. I'll attach them, with "responsive-" filename prefix.

After several minutes of the node being response and the NMI watchdog warnings appearing, I attempted to stop the MDT, which appeared to hang:

[root@jeti:127.0.0.1-2018-08-06-11:18:08]# pdsh -w e9 umount -t lustre -a

Within several seconds of the umount (maybe immediately, not sure) the node stopped responding to rsh and now seems to be in the same state I was reporting originally.

So it seems as if attempting to stop the MDT while lfsck is stuck in the loop you mention creates the symptoms that got my attention originally.

Olaf Faaland added a comment - 06/Aug/18 8:03 PM While the node was responsive, but I was getting the NMI watchdog warnings, I changed the debug and subsystem_debug masks and gathered debug logs. I'll attach them, with "responsive-" filename prefix. After several minutes of the node being response and the NMI watchdog warnings appearing, I attempted to stop the MDT, which appeared to hang: [root@jeti:127.0.0.1-2018-08-06-11:18:08]# pdsh -w e9 umount -t lustre -a Within several seconds of the umount (maybe immediately, not sure) the node stopped responding to rsh and now seems to be in the same state I was reporting originally. So it seems as if attempting to stop the MDT while lfsck is stuck in the loop you mention creates the symptoms that got my attention originally.

Olaf Faaland added a comment - 06/Aug/18 7:55 PM

I also notice every 3 minutes I was getting rcu INFO messages. Here's a subset:

2018-08-06 09:24:02 [314964.276905] INFO: rcu_sched self-detected stall on CPU[314964.277946] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 9, t=314348731 jiffies, g=13439, c=13438, q=148193)                                                                                                      
2018-08-06 09:27:02 [315144.268862] INFO: rcu_sched self-detected stall on CPU[315144.269911] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 9, t=314528736 jiffies, g=13439, c=13438, q=148199)                                                                                                      
2018-08-06 09:30:02 [315324.260816] INFO: rcu_sched self-detected stall on CPU[315324.261864] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 9, t=314708741 jiffies, g=13439, c=13438, q=148211)                                                                                                      
2018-08-06 09:33:02 [315504.252765] INFO: rcu_sched self-detected stall on CPU[315504.253806] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 9, t=314888746 jiffies, g=13439, c=13438, q=148219)                                                                                                      
2018-08-06 09:36:02 [315684.244711] INFO: rcu_sched self-detected stall on CPU[315684.245760] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 9, t=315068751 jiffies, g=13439, c=13438, q=148241)                                                                                                      
2018-08-06 09:39:02 [315864.236657] INFO: rcu_sched self-detected stall on CPU[315864.237702] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 14, t=315248756 jiffies, g=13439, c=13438, q=148245)                                                                                                     
2018-08-06 09:42:02 [316044.228615] INFO: rcu_sched self-detected stall on CPU[316044.229654] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 12, t=315428761 jiffies, g=13439, c=13438, q=148259)                                                                                                     
2018-08-06 09:45:02 [316224.220568] INFO: rcu_sched self-detected stall on CPU[316224.221613] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 14, t=315608766 jiffies, g=13439, c=13438, q=148265)

After crashing the node I imported the pool and mounted lustre. lfsck started and I began seeing the NMI watchdog warnings and associated stacks on the console (same stacks as before). The node is still responsive to rsh, and running top I see one core is at 100% sys all the time.

Olaf Faaland added a comment - 06/Aug/18 7:55 PM I also notice every 3 minutes I was getting rcu INFO messages. Here's a subset: 2018-08-06 09:24:02 [314964.276905] INFO: rcu_sched self-detected stall on CPU[314964.277946] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 9, t=314348731 jiffies, g=13439, c=13438, q=148193) 2018-08-06 09:27:02 [315144.268862] INFO: rcu_sched self-detected stall on CPU[315144.269911] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 9, t=314528736 jiffies, g=13439, c=13438, q=148199) 2018-08-06 09:30:02 [315324.260816] INFO: rcu_sched self-detected stall on CPU[315324.261864] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 9, t=314708741 jiffies, g=13439, c=13438, q=148211) 2018-08-06 09:33:02 [315504.252765] INFO: rcu_sched self-detected stall on CPU[315504.253806] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 9, t=314888746 jiffies, g=13439, c=13438, q=148219) 2018-08-06 09:36:02 [315684.244711] INFO: rcu_sched self-detected stall on CPU[315684.245760] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 9, t=315068751 jiffies, g=13439, c=13438, q=148241) 2018-08-06 09:39:02 [315864.236657] INFO: rcu_sched self-detected stall on CPU[315864.237702] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 14, t=315248756 jiffies, g=13439, c=13438, q=148245) 2018-08-06 09:42:02 [316044.228615] INFO: rcu_sched self-detected stall on CPU[316044.229654] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 12, t=315428761 jiffies, g=13439, c=13438, q=148259) 2018-08-06 09:45:02 [316224.220568] INFO: rcu_sched self-detected stall on CPU[316224.221613] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 14, t=315608766 jiffies, g=13439, c=13438, q=148265) After crashing the node I imported the pool and mounted lustre. lfsck started and I began seeing the NMI watchdog warnings and associated stacks on the console (same stacks as before). The node is still responsive to rsh, and running top I see one core is at 100% sys all the time.

Olaf Faaland added a comment - 06/Aug/18 6:22 PM

I've crashed the node. If there's anything useful I can get for you, from the crashdump, let me know.

Olaf Faaland added a comment - 06/Aug/18 6:22 PM I've crashed the node. If there's anything useful I can get for you, from the crashdump, let me know.

Lai Siyao added a comment - 04/Aug/18 2:36 PM

I see, it looks lfsck_namespace thread falls in dead loop in lfsck_namespace_filter_linkea_entry(), I'm reviewing related code.

Lai Siyao added a comment - 04/Aug/18 2:36 PM I see, it looks lfsck_namespace thread falls in dead loop in lfsck_namespace_filter_linkea_entry(), I'm reviewing related code.

Olaf Faaland added a comment - 03/Aug/18 5:41 PM

does the lfsck_namespace process consume 100% of cpu in command 'top'?

I performed SysRq show-backtrace-all-active-cpus(l) 12 times in 16 seconds. All 12 times, I got back the same results:

The lfsck_namespace process is on the same core in the same stack (CPU: 6 PID: 28738 Comm: lfsck_namespace)
One core is in the SysRq / serial console code, handling the request
The other 14 cores are reported as idle "NMI backtrace for cpu 9 skipped: idling at ..."

The NMI watchdog continues to report very regularly. Here is a sample:

2018-08-03 10:19:17 [59097.755662] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [lfsck_namespace:28738]
2018-08-03 10:19:45 [59125.753629] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [lfsck_namespace:28738]
2018-08-03 10:20:13 [59153.751597] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [lfsck_namespace:28738]
2018-08-03 10:20:41 [59181.749565] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [lfsck_namespace:28738]
2018-08-03 10:21:21 [59221.746662] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]
2018-08-03 10:21:49 [59249.744629] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]
2018-08-03 10:22:17 [59277.742596] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]
2018-08-03 10:22:45 [59305.740562] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]
2018-08-03 10:23:13 [59333.738528] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]
2018-08-03 10:23:41 [59361.736494] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]
2018-08-03 10:24:21 [59401.733593] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [lfsck_namespace:28738]
2018-08-03 10:24:49 [59429.731557] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [lfsck_namespace:28738]
2018-08-03 10:25:17 [59457.729523] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [lfsck_namespace:28738]
2018-08-03 10:25:45 [59485.727494] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]
2018-08-03 10:26:13 [59513.725459] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]
2018-08-03 10:26:41 [59541.723427] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]
2018-08-03 10:27:21 [59581.720524] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]
2018-08-03 10:27:49 [59609.718499] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]
2018-08-03 10:28:17 [59637.716461] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]

Looking at the last 29 instances, in 24 cases the interval between reports is 27.998 seconds, of which we know from the watchdog at least 22 of those seconds, the process was on the core continuously.

So certainly nearly 100% if not exactly.

Olaf Faaland added a comment - 03/Aug/18 5:41 PM does the lfsck_namespace process consume 100% of cpu in command 'top'? I performed SysRq show-backtrace-all-active-cpus(l) 12 times in 16 seconds. All 12 times, I got back the same results: The lfsck_namespace process is on the same core in the same stack (CPU: 6 PID: 28738 Comm: lfsck_namespace) One core is in the SysRq / serial console code, handling the request The other 14 cores are reported as idle "NMI backtrace for cpu 9 skipped: idling at ..." The NMI watchdog continues to report very regularly. Here is a sample: 2018-08-03 10:19:17 [59097.755662] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [lfsck_namespace:28738] 2018-08-03 10:19:45 [59125.753629] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [lfsck_namespace:28738] 2018-08-03 10:20:13 [59153.751597] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [lfsck_namespace:28738] 2018-08-03 10:20:41 [59181.749565] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [lfsck_namespace:28738] 2018-08-03 10:21:21 [59221.746662] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738] 2018-08-03 10:21:49 [59249.744629] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738] 2018-08-03 10:22:17 [59277.742596] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738] 2018-08-03 10:22:45 [59305.740562] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738] 2018-08-03 10:23:13 [59333.738528] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738] 2018-08-03 10:23:41 [59361.736494] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738] 2018-08-03 10:24:21 [59401.733593] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [lfsck_namespace:28738] 2018-08-03 10:24:49 [59429.731557] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [lfsck_namespace:28738] 2018-08-03 10:25:17 [59457.729523] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [lfsck_namespace:28738] 2018-08-03 10:25:45 [59485.727494] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738] 2018-08-03 10:26:13 [59513.725459] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738] 2018-08-03 10:26:41 [59541.723427] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738] 2018-08-03 10:27:21 [59581.720524] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738] 2018-08-03 10:27:49 [59609.718499] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738] 2018-08-03 10:28:17 [59637.716461] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738] Looking at the last 29 instances, in 24 cases the interval between reports is 27.998 seconds, of which we know from the watchdog at least 22 of those seconds, the process was on the core continuously. So certainly nearly 100% if not exactly.

Olaf Faaland added a comment - 03/Aug/18 5:21 PM

Or, I could crash the node and then examine the crash dump, likely getting the debug log for example.

Olaf Faaland added a comment - 03/Aug/18 5:21 PM Or, I could crash the node and then examine the crash dump, likely getting the debug log for example.

Olaf Faaland added a comment - 03/Aug/18 5:16 PM

I'm unable to login to the node. Attempts to rsh or ssh to it report connection refused. I don't know why that would be, since the console log does not indicate the OOMkiller ran. But whatever is going on is apparently affecting userspace processes on the node.

The console responds to SysRq, but does not emit a login prompt in response to input.

So SysRq is the only tool I know of right now to get information about the state of the node.

Olaf Faaland added a comment - 03/Aug/18 5:16 PM I'm unable to login to the node. Attempts to rsh or ssh to it report connection refused. I don't know why that would be, since the console log does not indicate the OOMkiller ran. But whatever is going on is apparently affecting userspace processes on the node. The console responds to SysRq, but does not emit a login prompt in response to input. So SysRq is the only tool I know of right now to get information about the state of the node.

Lai Siyao added a comment - 03/Aug/18 9:15 AM

does the lfsck_namespace process consume 100% of cpu in command 'top'?

Lai Siyao added a comment - 03/Aug/18 9:15 AM does the lfsck_namespace process consume 100% of cpu in command 'top'?

People

Assignee:: Lai Siyao

Reporter:: Olaf Faaland

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 03/Aug/18 2:36 AM

Updated:: 03/Jan/19 7:14 PM

Resolved:: 23/Aug/18 12:57 PM