Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.9.0, Lustre 2.10.0
-
Soak test cluster. Lustre 2.9.0 GA Release
-
3
-
9223372036854775807
Description
This looks like a few bugs we've seen before, but not enough like any specific one.
- Started soak
- Started and completed router delay fault
- LFSCK triggered.
- LFSCK dones not complete, runs > 1905s
MGS/MDS0 has soft lockup on CPU.Jan 19 14:06:24 lola-8 kernel: BUG: soft lockup - CPU#9 stuck for 67s! [OI_scrub:19279] Jan 19 14:06:24 lola-8 kernel: Modules linked in: osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgs(U) mgc(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) ldiskfs(U) jbd2 8021q garp stp llc nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm scsi_dh_rdac dm_round_robin dm_multipath microcode iTCO_wdt iTCO_vendor_support zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) joydev sb_edac edac_core lpc_ich mfd_core i2c_i801 ioatdma sg igb dca i2c_algo_bit i2c_core ext3 jbd mbcache sd_mod crc_t10dif ahci isci libsas wmi mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en ptp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Jan 19 14:06:24 lola-8 kernel: CPU 9 Jan 19 14:06:24 lola-8 kernel: Modules linked in: osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgs(U) mgc(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) ldiskfs(U) jbd2 8021q garp stp llc nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm scsi_dh_rdac dm_round_robin dm_multipath microcode iTCO_wdt iTCO_vendor_support zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) joydev sb_edac edac_core lpc_ich mfd_core i2c_i801 ioatdma sg igb dca i2c_algo_bit i2c_core ext3 jbd mbcache sd_mod crc_t10dif ahci isci libsas wmi mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en ptp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Jan 19 14:06:24 lola-8 kernel: Jan 19 14:06:24 lola-8 kernel: Pid: 19279, comm: OI_scrub Tainted: P --L------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1 Intel Corporation S2600GZ ........../S2600GZ Jan 19 14:06:24 lola-8 kernel: RIP: 0010:[<ffffffffa10d1e2f>] [<ffffffffa10d1e2f>] osd_scrub_exec+0xf/0x1dc0 [osd_ldiskfs] Jan 19 14:06:24 lola-8 kernel: RSP: 0018:ffff88049472bc20 EFLAGS: 00000282 Jan 19 14:06:24 lola-8 kernel: RAX: 0000000000000004 RBX: ffff88049472bd00 RCX: ffff8803ac48ced0 Jan 19 14:06:24 lola-8 kernel: RDX: ffff88049472bda0 RSI: ffff880216078000 RDI: ffff88038a74d000 Jan 19 14:06:24 lola-8 kernel: RBP: ffffffff8100bc0e R08: ffff88049472bdcf R09: 0000000000000004 Jan 19 14:06:24 lola-8 kernel: R10: ffff8803ac48cec0 R11: ffff8803ac48ced0 R12: ffffffff81fe6920 Jan 19 14:06:24 lola-8 kernel: R13: ffff8802a5151210 R14: ffff88038a74d000 R15: ffffffffa08549a8 Jan 19 14:06:24 lola-8 kernel: FS: 0000000000000000(0000) GS:ffff88044e420000(0000) knlGS:0000000000000000 Jan 19 14:06:24 lola-8 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Jan 19 14:06:24 lola-8 kernel: CR2: 0000003903673b90 CR3: 0000000001a8d000 CR4: 00000000000407e0 Jan 19 14:06:24 lola-8 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jan 19 14:06:24 lola-8 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Jan 19 14:06:24 lola-8 kernel: Process OI_scrub (pid: 19279, threadinfo ffff880494728000, task ffff880802c30ab0) Jan 19 14:06:24 lola-8 kernel: Stack: Jan 19 14:06:24 lola-8 kernel: ffff880216078000 ffff88038a74d000 ffffffffffffff02 ffffffffa10d3c9a Jan 19 14:06:24 lola-8 kernel: <d> 0000000000000010 0000000000000202 ffff88049472bc60 0000000000000018 Jan 19 14:06:24 lola-8 kernel: <d> ffff8803ac48cee0 0000000000000001 ffff8802160790e0 ffff880216079000 Jan 19 14:06:24 lola-8 kernel: Call Trace: Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d3c9a>] ? osd_scrub_next+0xba/0x4b0 [osd_ldiskfs] Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d6ec9>] ? osd_inode_iteration+0x4c9/0xd80 [osd_ldiskfs] Jan 19 14:06:24 lola-8 kernel: [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0 Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d1e20>] ? osd_scrub_exec+0x0/0x1dc0 [osd_ldiskfs] Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d3be0>] ? osd_scrub_next+0x0/0x4b0 [osd_ldiskfs] Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d8905>] ? osd_scrub_main+0x885/0xec0 [osd_ldiskfs] Jan 19 14:06:24 lola-8 kernel: [<ffffffff81067650>] ? default_wake_function+0x0/0x20 Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d8080>] ? osd_scrub_main+0x0/0xec0 [osd_ldiskfs] Jan 19 14:06:24 lola-8 kernel: [<ffffffff810a138e>] ? kthread+0x9e/0xc0 Jan 19 14:06:24 lola-8 kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20 Jan 19 14:06:24 lola-8 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0 Jan 19 14:06:24 lola-8 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 Jan 19 14:06:24 lola-8 kernel: Code: 00 00 00 e8 54 fe 7b ff e9 14 ff ff ff 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 81 ec e0 00 00 00 48 89 5d d8 <4c> 89 65 e0 4c 89 6d e8 4c 89 75 f0 4c 89 7d f8 0f 1f 44 00 00 an 19 14:06:24 lola-8 kernel: Code: 00 00 00 e8 54 fe 7b ff e9 14 ff ff ff 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 81 ec e0 00 00 00 48 89 5d d8 <4c> 89 65 e0 4c 89 6d e8 4c 89 75 f0 4c 89 7d f8 0f 1f 44 00 00 Jan 19 14:06:24 lola-8 kernel: Call Trace: Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d3c9a>] ? osd_scrub_next+0xba/0x4b0 [osd_ldiskfs] Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d6ec9>] ? osd_inode_iteration+0x4c9/0xd80 [osd_ldiskfs] Jan 19 14:06:24 lola-8 kernel: [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0 Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d1e20>] ? osd_scrub_exec+0x0/0x1dc0 [osd_ldiskfs] Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d3be0>] ? osd_scrub_next+0x0/0x4b0 [osd_ldiskfs] Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d8905>] ? osd_scrub_main+0x885/0xec0 [osd_ldiskfs] Jan 19 14:06:24 lola-8 kernel: [<ffffffff81067650>] ? default_wake_function+0x0/0x20 Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d8080>] ? osd_scrub_main+0x0/0xec0 [osd_ldiskfs] Jan 19 14:06:24 lola-8 kernel: [<ffffffff810a138e>] ? kthread+0x9e/0xc0 Jan 19 14:06:24 lola-8 kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20 Jan 19 14:06:24 lola-8 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0 Jan 19 14:06:24 lola-8 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
Attempting to get a stack dump, but not going well. System is available is anyone wishes to have a look
Created a build with patches https://review.whamcloud.com/#/c/26751/ and https://review.whamcloud.com/#/c/26893/
soak-8 is currently spinning very hard, available for examination.
60.530576] CPU: 13 PID: 12279 Comm: OI_scrub Tainted: P OEL ------------ 3.10.0-514.16.1.el7_lustre.x86_64 #1^M [ 6560.543976] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M [ 6560.557569] task: ffff8803fca11f60 ti: ffff8803fca24000 task.ti: ffff8803fca24000^M [ 6560.566978] RIP: 0010:[<ffffffffa13a5be9>] [<ffffffffa13a5be9>] osd_scrub_next+0x9/0x3c0 [osd_ldiskfs]^M [ 6560.578543] RSP: 0018:ffff8803fca27d08 EFLAGS: 00000287^M [ 6560.585506] RAX: ffffffffa13a5be0 RBX: 0000000000000001 RCX: ffff8803fca27d98^M [ 6560.594492] RDX: ffff8803fca27d78 RSI: ffff88082a370000 RDI: ffff88082da75000^M [ 6560.603462] RBP: ffff8803fca27d08 R08: 0000000000000001 R09: 0000000000000004^M [ 6560.612418] R10: 000000002a2da701 R11: ffffea0020a8b680 R12: 000000002a2da701^M [ 6560.621356] R13: ffffea0020a8b680 R14: ffff8803fca27c70 R15: ffff8803fca27d08^M [ 6560.630283] FS: 0000000000000000(0000) GS:ffff88082d940000(0000) knlGS:0000000000000000^M [ 6560.640284] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M [ 6560.647665] CR2: 00007f6a3b9ab000 CR3: 00000000019be000 CR4: 00000000000407e0^M [ 6560.656614] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M [ 6560.665553] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400^M [ 6560.674488] Stack:^M [ 6560.677690] ffff8803fca27df0 ffffffffa13ab1c9 ffffffffa13a9b50 ffffffffa13a5be0^M [ 6560.686960] ffff88082a370000 00000000810d354f ffff88082da75000 ffff88082a371468^M [ 6560.696224] 0000000020000000 0100000000200163 0000000000000000 0000000000000000^M [ 6560.705485] Call Trace:^M [ 6560.709202] [<ffffffffa13ab1c9>] osd_inode_iteration+0x2a9/0xcc0 [osd_ldiskfs]^M [ 6560.718370] [<ffffffffa13a9b50>] ? osd_ios_ROOT_scan+0x300/0x300 [osd_ldiskfs]^M [ 6560.727525] [<ffffffffa13a5be0>] ? osd_preload_next+0xb0/0xb0 [osd_ldiskfs]^M [ 6560.736388] [<ffffffffa13ac540>] osd_scrub_main+0x960/0xf30 [osd_ldiskfs]^M [ 6560.745052] [<ffffffff810c54c0>] ? wake_up_state+0x20/0x20^M [ 6560.752248] [<ffffffffa13abbe0>] ? osd_inode_iteration+0xcc0/0xcc0 [osd_ldiskfs]^M [ 6560.761582] [<ffffffff810b0a4f>] kthread+0xcf/0xe0^M [ 6560.768004] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140^M [ 6560.776269] [<ffffffff81697318>] ret_from_fork+0x58/0x90^M [ 6560.783271] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140^M [ 6560.791526] Code: 48 8d 4c 02 10 48 01 c2 e8 25 fc ff ff 48 83 c4 08 5b 41 5c 41 5d 41 5e 5d c3 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 <41> 57 49 89 f7 41 56 41 55 49 89 fd 41 54 49 89 d4 53 48 89 cb ^M [ 6588.383881] NMI watchdog: BUG: soft lockup - CPU#13 stuck for 22s! [OI_scrub:12279]^