[LU-11201] NMI watchdog: BUG: soft lockup in lfsck_namespace Created: 03/Aug/18  Updated: 03/Jan/19  Resolved: 23/Aug/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.4
Fix Version/s: Lustre 2.12.0, Lustre 2.10.6

Type: Bug Priority: Major
Reporter: Olaf Faaland Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: llnl
Environment:

lustre 2.10.4_1.chaos
kernel 3.10.0-862.9.1.1chaos.ch6.x86_64
RHEL 7.5 based
MDT


Issue Links:
Related
is related to LU-11419 lfsck does not complete phase2 Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

About 4 minutes after recovery ended post MDS startup, the console started reporting soft lockups as below. The node started refusing connections from pdsh and ltop running on the mgmt node started reporting stale data (not getting updates from cerebro).

The watchdog is repeatedly reporting a stack about every 40 seconds. Usually it is the same stack, below, with the same PID and same CPU (CPU#6). lfs check servers on a compute node with the file system mounted still shows lquake-MDT0008 as active and the node still responds to lctl ping. An ls of a directory stored on MDT0008 hangs in ptlrpc_set_wait() in
ldlm_cli_enqueue() in mdc_enqueue().

NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [lfsck_namespace:26532]
Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ko2iblnd(OE) lnet(OE) libcfs(OE) nfsv3 nfs_acl ib_ucm sb_edac intel_powerclamp coretemp rpcrdma intel_rapl rdma_ucm iosf_mbi ib_umad ib_uverbs ib_ipoib ib_iser kvm rdma_cm iw_cm libiscsi irqbypass ib_cm iTCO_wdt iTCO_vendor_support mlx5_ib ib_core joydev mlx5_core pcspkr mlxfw devlink lpc_ich i2c_i801 ioatdma ses enclosure sch_fq_codel sg ipmi_si shpchp acpi_cpufreq acpi_power_meter zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) binfmt_misc msr_safe(OE) ip_tables rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache dm_round_robin sd_mod crc_t10dif crct10dif_generic 8021q garp mrp stp llc mgag200 crct10dif_pclmul crct10dif_common i2c_algo_bit crc32_pclmul drm_kms_helper scsi_transport_iscsi crc32c_intel syscopyarea sysfillrect ghash_clmulni_intel sysimgblt dm_multipath ixgbe fb_sys_fops aesni_intel ttm lrw mxm_wmi ahci dca gf128mul libahci glue_helper ablk_helper drm mpt3sas cryptd ptp raid_class libata i2c_core pps_core scsi_transport_sas mdio ipmi_devintf ipmi_msghandler wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod
CPU: 3 PID: 26532 Comm: lfsck_namespace Kdump: loaded Tainted: P           OE  ------------   3.10.0-862.9.1.1chaos.ch6.x86_64 #1
Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
task: ffffa0df6ab94f10 ti: ffffa0df32650000 task.ti: ffffa0df32650000
RIP: 0010:[<ffffffffc1402f1e>]  [<ffffffffc1402f1e>] lfsck_namespace_filter_linkea_entry.isra.64+0x8e/0x180 [lfsck]
RSP: 0018:ffffa0df32653af0  EFLAGS: 00000202
RAX: 0000000000000000 RBX: ffffa0df32653ab8 RCX: ffffa0df2e1d4971
RDX: 0000000000000000 RSI: ffffa0df30ec8010 RDI: ffffa0df32653bc8
RBP: ffffa0df32653b38 R08: 0000000000000000 R09: 0000000000000025
R10: ffffa0df30ec8010 R11: 0000000000000000 R12: ffffa0df32653a70
R13: ffffa0df32653acc R14: ffffffffc144d09c R15: ffffa0df33d5e240
FS:  0000000000000000(0000) GS:ffffa0df7e6c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffff7ff8000 CR3: 0000005cd280e000 CR4: 00000000001607e0
Call Trace:
 [<ffffffffc140d63d>] ? __lfsck_links_read+0x13d/0x2d0 [lfsck]
 [<ffffffffc14159af>] lfsck_namespace_double_scan_one+0x49f/0x14b0 [lfsck]
 [<ffffffffc087d50e>] ? dmu_buf_rele+0xe/0x10 [zfs]
 [<ffffffffc090225f>] ? zap_unlockdir+0x3f/0x60 [zfs]
 [<ffffffffc1416d82>] lfsck_namespace_double_scan_one_trace_file+0x3c2/0x7e0 [lfsck]
 [<ffffffffc141a7bd>] lfsck_namespace_assistant_handler_p2+0x79d/0xa80 [lfsck]
 [<ffffffffb2e03426>] ? kfree+0x136/0x180
 [<ffffffffc11a9ad8>] ? ptlrpc_set_destroy+0x208/0x4f0 [ptlrpc]
 [<ffffffffc13fe6a4>] lfsck_assistant_engine+0x13e4/0x21a0 [lfsck]
 [<ffffffffb2cd5c20>] ? wake_up_state+0x20/0x20
 [<ffffffffc13fd2c0>] ? lfsck_master_engine+0x1450/0x1450 [lfsck]
 [<ffffffffb2cc0ad1>] kthread+0xd1/0xe0
 [<ffffffffb2cc0a00>] ? insert_kthread_work+0x40/0x40
 [<ffffffffb3344837>] ret_from_fork_nospec_begin+0x21/0x21
 [<ffffffffb2cc0a00>] ? insert_kthread_work+0x40/0x40


 Comments   
Comment by Olaf Faaland [ 03/Aug/18 ]

I've left the system in this state for now, in case there's more data I can usefully gather. It is a test system.

Comment by Olaf Faaland [ 03/Aug/18 ]

The later watchdog reports show a few different stacks. The one in lfsck_namespace_double_scan_one() appears 10 or 20x as often as the other stacks.

==> /tmp/t5.lfsck_namespace_double_scan_one <==
Call Trace:
 [<ffffffffc140d63d>] ? __lfsck_links_read+0x13d/0x2d0 [lfsck]
 [<ffffffffc14159af>] lfsck_namespace_double_scan_one+0x49f/0x14b0 [lfsck]
 [<ffffffffc087d50e>] ? dmu_buf_rele+0xe/0x10 [zfs]
 [<ffffffffc090225f>] ? zap_unlockdir+0x3f/0x60 [zfs]
 [<ffffffffc1416d82>] lfsck_namespace_double_scan_one_trace_file+0x3c2/0x7e0 [lfsck]
 [<ffffffffc141a7bd>] lfsck_namespace_assistant_handler_p2+0x79d/0xa80 [lfsck]
 [<ffffffffb2e03426>] ? kfree+0x136/0x180

==> /tmp/t5.sched_show_task <==
Call Trace:
 <IRQ>  [<ffffffffb2cd4d9f>] sched_show_task+0xbf/0x120
 [<ffffffffb2cd8b99>] dump_cpu_task+0x39/0x70
 [<ffffffffb2d51e60>] rcu_dump_cpu_stacks+0x90/0xd0
 [<ffffffffb2d55952>] rcu_check_callbacks+0x482/0x770
 [<ffffffffb2d09130>] ? tick_sched_do_timer+0x50/0x50
 [<ffffffffb2d09130>] ? tick_sched_do_timer+0x50/0x50
 [<ffffffffb2ca95f6>] update_process_times+0x46/0x80

==> /tmp/t5.schedule <==
Call Trace:
 [<ffffffff99cd2148>] ? check_preempt_curr+0x78/0xa0
 [<ffffffff9a337859>] schedule+0x29/0x70
 [<ffffffff9a3350d9>] schedule_timeout+0x289/0x310
 [<ffffffff99f6dda3>] ? number.isra.2+0x323/0x360
 [<ffffffff99cd5b75>] ? wake_up_process+0x15/0x20
 [<ffffffff99cb5a94>] ? wake_up_worker+0x24/0x30
 [<ffffffff99cb62d2>] ? insert_work+0x62/0xa0

==> /tmp/t5.zone_statistics <==
Call Trace:
 [<ffffffff99dc0bf8>] ? zone_statistics+0x88/0xa0
 [<ffffffff99da8d92>] ? get_page_from_freelist+0x502/0xa00
 [<ffffffffc0646ac3>] ? dbuf_find+0x1f3/0x220 [zfs]
 [<ffffffff99da940f>] ? __alloc_pages_nodemask+0x17f/0x470
 [<ffffffff99df5ba8>] ? alloc_pages_current+0x98/0x110
 [<ffffffffc0530798>] ? nvpair_value_common.part.19+0x38/0x170 [znvpair]
 [<ffffffffc0532122>] ? nvlist_lookup_common.part.71+0xa2/0xb0 [znvpair]
Comment by Olaf Faaland [ 03/Aug/18 ]

SysRQ show-backtrace-all-active-cpus(l) reports all CPUs idle except for the one with process lfsck_namespace and the one processing the SysRQ.

Comment by Peter Jones [ 03/Aug/18 ]

Lai

Could you please advise?

Thanks

Peter

Comment by Lai Siyao [ 03/Aug/18 ]

does the lfsck_namespace process consume 100% of cpu in command 'top'?

Comment by Olaf Faaland [ 03/Aug/18 ]

I'm unable to login to the node.  Attempts to rsh or ssh to it report connection refused.  I don't know why that would be, since the console log does not indicate the OOMkiller ran.  But whatever is going on is apparently affecting userspace processes on the node.

The console responds to SysRq, but does not emit a login prompt in response to input.

So SysRq is the only tool I know of right now to get information about the state of the node.

Comment by Olaf Faaland [ 03/Aug/18 ]

Or, I could crash the node and then examine the crash dump, likely getting the debug log for example.

Comment by Olaf Faaland [ 03/Aug/18 ]

does the lfsck_namespace process consume 100% of cpu in command 'top'?

I performed SysRq show-backtrace-all-active-cpus(l) 12 times in 16 seconds.  All 12 times, I got back the same results:

  • The lfsck_namespace process is on the same core in the same stack (CPU: 6 PID: 28738 Comm: lfsck_namespace)
  • One core is in the SysRq / serial console code, handling the request
  • The other 14 cores are reported as idle "NMI backtrace for cpu 9 skipped: idling at ..."

 The NMI watchdog continues to report very regularly. Here is a sample:

2018-08-03 10:19:17 [59097.755662] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [lfsck_namespace:28738]
2018-08-03 10:19:45 [59125.753629] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [lfsck_namespace:28738]
2018-08-03 10:20:13 [59153.751597] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [lfsck_namespace:28738]
2018-08-03 10:20:41 [59181.749565] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [lfsck_namespace:28738]
2018-08-03 10:21:21 [59221.746662] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]
2018-08-03 10:21:49 [59249.744629] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]
2018-08-03 10:22:17 [59277.742596] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]
2018-08-03 10:22:45 [59305.740562] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]
2018-08-03 10:23:13 [59333.738528] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]
2018-08-03 10:23:41 [59361.736494] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]
2018-08-03 10:24:21 [59401.733593] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [lfsck_namespace:28738]
2018-08-03 10:24:49 [59429.731557] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [lfsck_namespace:28738]
2018-08-03 10:25:17 [59457.729523] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [lfsck_namespace:28738]
2018-08-03 10:25:45 [59485.727494] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]
2018-08-03 10:26:13 [59513.725459] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]
2018-08-03 10:26:41 [59541.723427] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]
2018-08-03 10:27:21 [59581.720524] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]
2018-08-03 10:27:49 [59609.718499] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]
2018-08-03 10:28:17 [59637.716461] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]

Looking at the last 29 instances, in 24 cases the interval between reports is 27.998 seconds, of which we know from the watchdog at least 22 of those seconds, the process was on the core continuously.

So certainly nearly 100% if not exactly.

Comment by Lai Siyao [ 04/Aug/18 ]

I see, it looks lfsck_namespace thread falls in dead loop in lfsck_namespace_filter_linkea_entry(), I'm reviewing related code.

Comment by Olaf Faaland [ 06/Aug/18 ]

I've crashed the node. If there's anything useful I can get for you, from the crashdump, let me know.

Comment by Olaf Faaland [ 06/Aug/18 ]

I also notice every 3 minutes I was getting rcu INFO messages. Here's a subset:

2018-08-06 09:24:02 [314964.276905] INFO: rcu_sched self-detected stall on CPU[314964.277946] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 9, t=314348731 jiffies, g=13439, c=13438, q=148193)                                                                                                      
2018-08-06 09:27:02 [315144.268862] INFO: rcu_sched self-detected stall on CPU[315144.269911] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 9, t=314528736 jiffies, g=13439, c=13438, q=148199)                                                                                                      
2018-08-06 09:30:02 [315324.260816] INFO: rcu_sched self-detected stall on CPU[315324.261864] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 9, t=314708741 jiffies, g=13439, c=13438, q=148211)                                                                                                      
2018-08-06 09:33:02 [315504.252765] INFO: rcu_sched self-detected stall on CPU[315504.253806] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 9, t=314888746 jiffies, g=13439, c=13438, q=148219)                                                                                                      
2018-08-06 09:36:02 [315684.244711] INFO: rcu_sched self-detected stall on CPU[315684.245760] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 9, t=315068751 jiffies, g=13439, c=13438, q=148241)                                                                                                      
2018-08-06 09:39:02 [315864.236657] INFO: rcu_sched self-detected stall on CPU[315864.237702] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 14, t=315248756 jiffies, g=13439, c=13438, q=148245)                                                                                                     
2018-08-06 09:42:02 [316044.228615] INFO: rcu_sched self-detected stall on CPU[316044.229654] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 12, t=315428761 jiffies, g=13439, c=13438, q=148259)                                                                                                     
2018-08-06 09:45:02 [316224.220568] INFO: rcu_sched self-detected stall on CPU[316224.221613] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 14, t=315608766 jiffies, g=13439, c=13438, q=148265)                                                                                                     

After crashing the node I imported the pool and mounted lustre. lfsck started and I began seeing the NMI watchdog warnings and associated stacks on the console (same stacks as before). The node is still responsive to rsh, and running top I see one core is at 100% sys all the time.

Comment by Olaf Faaland [ 06/Aug/18 ]

While the node was responsive, but I was getting the NMI watchdog warnings, I changed the debug and subsystem_debug masks and gathered debug logs. I'll attach them, with "responsive-" filename prefix.

After several minutes of the node being response and the NMI watchdog warnings appearing, I attempted to stop the MDT, which appeared to hang:

[root@jeti:127.0.0.1-2018-08-06-11:18:08]# pdsh -w e9 umount -t lustre -a

Within several seconds of the umount (maybe immediately, not sure) the node stopped responding to rsh and now seems to be in the same state I was reporting originally.

So it seems as if attempting to stop the MDT while lfsck is stuck in the loop you mention creates the symptoms that got my attention originally.

Comment by Olaf Faaland [ 06/Aug/18 ]

I ran with a debug patch and confirmed:

  • lfsck is stuck in the while loop within lfsck_namespace_filter_linkea_entry().
  • It always takes the else path and calls linkea_next_entry(ldata)
  • ldata->ld_lee is non-NULL but never changes
Comment by Olaf Faaland [ 06/Aug/18 ]

I confirmed the scan is not advancing because lfsck_namespace_filter_linkea_entry() is calling  linkea_next_entry() and ldata->ld_lee != NULL and ldata->ld_reclen == 0.

Comment by Lai Siyao [ 07/Aug/18 ]

I'm working on a patch to check linkea entry validity on unpack, will commit later.

Comment by Gerrit Updater [ 08/Aug/18 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32958
Subject: LU-11201 lfsck: check linkea entry validity
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 487aff8854a59e897b1511f3389196573ca38c3f

Comment by Andreas Dilger [ 08/Aug/18 ]

The patch addresses the stuck thread doing the linkea iteration, but do we have any idea how/why the link xattr was broken in the first place? Is this a particularly old ZFS filesystem that might have some ancient bug that created bad link xattrs? Has LFSCK previously run successfully on this MDT with this version of Lustre?

Comment by Olaf Faaland [ 13/Aug/18 ]

I don't know how the link xattr was broken.

  • The pools were created and the targets formatted 9-12 months ago.  I could go back and get specific version information.  The zfs version there at the time was an 0.7 stable release, after the parallel object allocation fix.
  • LFSCK has been run successfully on this MDT in the past, with a prior 2.8.x lustre; but I'll have to look it up for more details.  I do not think LFSCK has been run on this MDT with this exact version of Lustre.
  • This file system has been bounced back and forth between Lustre 2.8 and Lustre 2.10 at few times.  I believe that's been without using any 2.10 specific features (e.g. PFL) but it's possible I'm mistaken about that.
Comment by Gerrit Updater [ 23/Aug/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32958/
Subject: LU-11201 lfsck: check linkea entry validity
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a5441a717c3a97071494ff51cfb72a117d12d96e

Comment by Peter Jones [ 23/Aug/18 ]

Landed for 2.12

Comment by Gerrit Updater [ 25/Aug/18 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33078
Subject: LU-11201 lfsck: check linkea entry validity
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 919685c9a1dc33067f1d5ac21ccc00fd8a42642d

Comment by Gerrit Updater [ 11/Sep/18 ]

John L. Hammond (jhammond@whamcloud.com) merged in patch https://review.whamcloud.com/33078/
Subject: LU-11201 lfsck: check linkea entry validity
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 4829fb05c6ca672775701de85bc495344ac619e9

Generated at Sat Feb 10 02:41:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.