[LU-9488] soft lockup in osd_inode_iteration() Created: 10/May/17  Updated: 06/Sep/17  Resolved: 10/Jun/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Major
Reporter: Cliff White (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

Soak test cluster


Attachments: Text File soak-8.console.log     Text File soak-8.vmcore-dmesg.txt    
Issue Links:
Duplicate
duplicates LU-9381 sanity-scrub test_9: CPU stuck in osd... Closed
Related
is related to LU-9952 soft lockup in osd_inode_iteration() ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Restarted testing with latest master, 3577
Competed two failovers with lfsck turned off.
Restarted with lfsck turned on
soak-5 (OSS) completed failover:

2017-05-09 20:08:27
,327:fsmgmt.fsmgmt:INFO     oss_failover completed, running lfsck

MDS reported a single error:

May  9 20:08:39 soak-8 kernel: LustreError: 5550:0:(lfsck_lib.c:2680:lfsck_load_one_trace_file()) soaked-MDT0000-osd: unlink lfsck sub trace file lfsck_namespace_00: rc = 0

Before soak hits timeout, MDS has wedged:

May  9 20:12:41 soak-8 kernel: NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [OI_scrub:5551]
May  9 20:12:41 soak-8 kernel: Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd dm_round_robin ipmi_ssif sb_edac ipmi_devintf ntb sg iTCO_wdt ioatdma shpchp edac_core mei_me iTCO_vendor_support mei lpc_ich ipmi_si pcspkr i2c_i801
May  9 20:12:41 soak-8 kernel: ipmi_msghandler wmi nfsd dm_multipath dm_mod nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_en mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops isci igb ttm ahci crct10dif_pclmul crct10dif_common ptp libsas crc32c_intel libahci pps_core drm mlx4_core mpt2sas libata dca raid_class i2c_algo_bit scsi_transport_sas devlink i2c_core fjes
May  9 20:12:41 soak-8 kernel: CPU: 6 PID: 5551 Comm: OI_scrub Tainted: P           OE  ------------   3.10.0-514.16.1.el7_lustre.x86_64 #1
May  9 20:12:41 soak-8 kernel: Hardware name: Intel Corporation SandyBridge Platform/To be filled by O.E.M., BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
May  9 20:12:41 soak-8 kernel: task: ffff88083fde6dd0 ti: ffff880703600000 task.ti: ffff880703600000
May  9 20:12:41 soak-8 kernel: RIP: 0010:[<ffffffffa121d1d9>]  [<ffffffffa121d1d9>] osd_inode_iteration+0x489/0xcc0 [osd_ldiskfs]
May  9 20:12:41 soak-8 kernel: RSP: 0018:ffff880703603d18  EFLAGS: 00000293
May  9 20:12:41 soak-8 kernel: RAX: 0000000000000004 RBX: 0000000023f30a01 RCX: 0000000000000000
May  9 20:12:41 soak-8 kernel: RDX: ffff880703603d78 RSI: ffff8800b2a36000 RDI: ffff8803162f6000
May  9 20:12:41 soak-8 kernel: RBP: ffff880703603df0 R08: ffff880703603d57 R09: 0000000000000004
May  9 20:12:41 soak-8 kernel: R10: 0000000023f30a01 R11: ffffea000c8fcc00 R12: 0000000023f30a01
May  9 20:12:41 soak-8 kernel: R13: ffff880703603d08 R14: 0000000023f30a01 R15: ffff880703603d08
May  9 20:12:41 soak-8 kernel: FS:  0000000000000000(0000) GS:ffff88042e180000(0000) knlGS:0000000000000000
May  9 20:12:41 soak-8 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May  9 20:12:41 soak-8 kernel: CR2: 00007f64d55202e0 CR3: 00000000019be000 CR4: 00000000000407e0
May  9 20:12:41 soak-8 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
May  9 20:12:41 soak-8 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
May  9 20:12:41 soak-8 kernel: Stack:
May  9 20:12:41 soak-8 kernel: ffffffffa121b990 ffffffffa1217a20 ffff8800b2a36000 00000000810d354f
May  9 20:12:41 soak-8 kernel: ffff8803162f6000 ffff8800b2a37468 0000000020000000 010000000000000c
May  9 20:12:41 soak-8 kernel: 0000000000000000 0000000000000000 ffff8800b2a36000 0000000000000000
May  9 20:12:41 soak-8 kernel: Call Trace:
May  9 20:12:41 soak-8 kernel: [<ffffffffa121b990>] ? osd_ios_ROOT_scan+0x300/0x300 [osd_ldiskfs]
May  9 20:12:41 soak-8 kernel: [<ffffffffa1217a20>] ? osd_preload_next+0xb0/0xb0 [osd_ldiskfs]
May  9 20:12:41 soak-8 kernel: [<ffffffffa121e370>] osd_scrub_main+0x960/0xf30 [osd_ldiskfs]
May  9 20:12:41 soak-8 kernel: [<ffffffff810c54c0>] ? wake_up_state+0x20/0x20
May  9 20:12:41 soak-8 kernel: [<ffffffffa121da10>] ? osd_inode_iteration+0xcc0/0xcc0 [osd_ldiskfs]
May  9 20:12:41 soak-8 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0
May  9 20:12:41 soak-8 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
May  9 20:12:41 soak-8 kernel: [<ffffffff81697318>] ret_from_fork+0x58/0x90
May  9 20:12:41 soak-8 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
May  9 20:12:41 soak-8 kernel: Code: 00 e8 7c eb 97 ff e9 0f fc ff ff 0f 1f 80 00 00 00 00 45 89 e9 4c 8d 85 67 ff ff ff 48 8b 4d a8 48 8d 55 88 48 8b b5 38 ff ff ff <48> 8b bd 48 ff ff ff 48 8b 85 28 ff ff ff ff d0 85 c0 41 89 c5

lfsck hit 600 second timeout, abort attempted:

2017-05-09 20:18:46,982:fsmgmt.fsmgmt:ERROR    lfsck still running after 600s, aborting
2017-05-09 20:18:46,983:fsmgmt.fsmgmt:INFO     executing cmd: lctl lfsck_stop -M soaked-MDT0000

One the MDS hits the lockup, it hits it over and over and within minutes is doing nothing other that hitting the lockup.
At this point, decided not to wait for the crash, dumped stacks, them forced a crash dump.
Crash dump is available on soak



 Comments   
Comment by Peter Jones [ 11/May/17 ]

Fan Yong

Could you please assist with this one?

Thanks

Peter

Comment by Gerrit Updater [ 22/May/17 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/27228
Subject: LU-9488 scrub: reset noslot for full speed scrub
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e7b6dfcd1f83d35aaa82856e69863b4bc2695103

Comment by nasf (Inactive) [ 22/May/17 ]

Cliff,

I hope that the patch https://review.whamcloud.com/27228 can fix the OI scrub soft lockup trouble. But if was not, please dump the oi_scrub via proc interface. The patch adds more information the /proc/fs/lustre/osd-ldiskfs/xxx/oi_scrub output, they are helpful to understand the OI scrub status.

Thanks!

Comment by Gerrit Updater [ 10/Jun/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27228/
Subject: LU-9488 scrub: reset noslot for full speed scrub
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e2ad44de38a4ef42b975e71c01bbfce0ef7dd827

Comment by Peter Jones [ 10/Jun/17 ]

Landed for 2.10

Generated at Sat Feb 10 02:26:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.