Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.9.0, Lustre 2.10.0
-
Soak test cluster. Lustre 2.9.0 GA Release
-
3
-
9223372036854775807
Description
This looks like a few bugs we've seen before, but not enough like any specific one.
- Started soak
- Started and completed router delay fault
- LFSCK triggered.
- LFSCK dones not complete, runs > 1905s
MGS/MDS0 has soft lockup on CPU.Jan 19 14:06:24 lola-8 kernel: BUG: soft lockup - CPU#9 stuck for 67s! [OI_scrub:19279] Jan 19 14:06:24 lola-8 kernel: Modules linked in: osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgs(U) mgc(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) ldiskfs(U) jbd2 8021q garp stp llc nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm scsi_dh_rdac dm_round_robin dm_multipath microcode iTCO_wdt iTCO_vendor_support zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) joydev sb_edac edac_core lpc_ich mfd_core i2c_i801 ioatdma sg igb dca i2c_algo_bit i2c_core ext3 jbd mbcache sd_mod crc_t10dif ahci isci libsas wmi mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en ptp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Jan 19 14:06:24 lola-8 kernel: CPU 9 Jan 19 14:06:24 lola-8 kernel: Modules linked in: osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgs(U) mgc(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) ldiskfs(U) jbd2 8021q garp stp llc nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm scsi_dh_rdac dm_round_robin dm_multipath microcode iTCO_wdt iTCO_vendor_support zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) joydev sb_edac edac_core lpc_ich mfd_core i2c_i801 ioatdma sg igb dca i2c_algo_bit i2c_core ext3 jbd mbcache sd_mod crc_t10dif ahci isci libsas wmi mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en ptp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Jan 19 14:06:24 lola-8 kernel: Jan 19 14:06:24 lola-8 kernel: Pid: 19279, comm: OI_scrub Tainted: P --L------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1 Intel Corporation S2600GZ ........../S2600GZ Jan 19 14:06:24 lola-8 kernel: RIP: 0010:[<ffffffffa10d1e2f>] [<ffffffffa10d1e2f>] osd_scrub_exec+0xf/0x1dc0 [osd_ldiskfs] Jan 19 14:06:24 lola-8 kernel: RSP: 0018:ffff88049472bc20 EFLAGS: 00000282 Jan 19 14:06:24 lola-8 kernel: RAX: 0000000000000004 RBX: ffff88049472bd00 RCX: ffff8803ac48ced0 Jan 19 14:06:24 lola-8 kernel: RDX: ffff88049472bda0 RSI: ffff880216078000 RDI: ffff88038a74d000 Jan 19 14:06:24 lola-8 kernel: RBP: ffffffff8100bc0e R08: ffff88049472bdcf R09: 0000000000000004 Jan 19 14:06:24 lola-8 kernel: R10: ffff8803ac48cec0 R11: ffff8803ac48ced0 R12: ffffffff81fe6920 Jan 19 14:06:24 lola-8 kernel: R13: ffff8802a5151210 R14: ffff88038a74d000 R15: ffffffffa08549a8 Jan 19 14:06:24 lola-8 kernel: FS: 0000000000000000(0000) GS:ffff88044e420000(0000) knlGS:0000000000000000 Jan 19 14:06:24 lola-8 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Jan 19 14:06:24 lola-8 kernel: CR2: 0000003903673b90 CR3: 0000000001a8d000 CR4: 00000000000407e0 Jan 19 14:06:24 lola-8 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jan 19 14:06:24 lola-8 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Jan 19 14:06:24 lola-8 kernel: Process OI_scrub (pid: 19279, threadinfo ffff880494728000, task ffff880802c30ab0) Jan 19 14:06:24 lola-8 kernel: Stack: Jan 19 14:06:24 lola-8 kernel: ffff880216078000 ffff88038a74d000 ffffffffffffff02 ffffffffa10d3c9a Jan 19 14:06:24 lola-8 kernel: <d> 0000000000000010 0000000000000202 ffff88049472bc60 0000000000000018 Jan 19 14:06:24 lola-8 kernel: <d> ffff8803ac48cee0 0000000000000001 ffff8802160790e0 ffff880216079000 Jan 19 14:06:24 lola-8 kernel: Call Trace: Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d3c9a>] ? osd_scrub_next+0xba/0x4b0 [osd_ldiskfs] Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d6ec9>] ? osd_inode_iteration+0x4c9/0xd80 [osd_ldiskfs] Jan 19 14:06:24 lola-8 kernel: [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0 Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d1e20>] ? osd_scrub_exec+0x0/0x1dc0 [osd_ldiskfs] Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d3be0>] ? osd_scrub_next+0x0/0x4b0 [osd_ldiskfs] Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d8905>] ? osd_scrub_main+0x885/0xec0 [osd_ldiskfs] Jan 19 14:06:24 lola-8 kernel: [<ffffffff81067650>] ? default_wake_function+0x0/0x20 Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d8080>] ? osd_scrub_main+0x0/0xec0 [osd_ldiskfs] Jan 19 14:06:24 lola-8 kernel: [<ffffffff810a138e>] ? kthread+0x9e/0xc0 Jan 19 14:06:24 lola-8 kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20 Jan 19 14:06:24 lola-8 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0 Jan 19 14:06:24 lola-8 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 Jan 19 14:06:24 lola-8 kernel: Code: 00 00 00 e8 54 fe 7b ff e9 14 ff ff ff 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 81 ec e0 00 00 00 48 89 5d d8 <4c> 89 65 e0 4c 89 6d e8 4c 89 75 f0 4c 89 7d f8 0f 1f 44 00 00 an 19 14:06:24 lola-8 kernel: Code: 00 00 00 e8 54 fe 7b ff e9 14 ff ff ff 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 81 ec e0 00 00 00 48 89 5d d8 <4c> 89 65 e0 4c 89 6d e8 4c 89 75 f0 4c 89 7d f8 0f 1f 44 00 00 Jan 19 14:06:24 lola-8 kernel: Call Trace: Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d3c9a>] ? osd_scrub_next+0xba/0x4b0 [osd_ldiskfs] Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d6ec9>] ? osd_inode_iteration+0x4c9/0xd80 [osd_ldiskfs] Jan 19 14:06:24 lola-8 kernel: [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0 Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d1e20>] ? osd_scrub_exec+0x0/0x1dc0 [osd_ldiskfs] Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d3be0>] ? osd_scrub_next+0x0/0x4b0 [osd_ldiskfs] Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d8905>] ? osd_scrub_main+0x885/0xec0 [osd_ldiskfs] Jan 19 14:06:24 lola-8 kernel: [<ffffffff81067650>] ? default_wake_function+0x0/0x20 Jan 19 14:06:24 lola-8 kernel: [<ffffffffa10d8080>] ? osd_scrub_main+0x0/0xec0 [osd_ldiskfs] Jan 19 14:06:24 lola-8 kernel: [<ffffffff810a138e>] ? kthread+0x9e/0xc0 Jan 19 14:06:24 lola-8 kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20 Jan 19 14:06:24 lola-8 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0 Jan 19 14:06:24 lola-8 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
Attempting to get a stack dump, but not going well. System is available is anyone wishes to have a look
Fan Yong,
There are some recent sanity-sec test 9 hangs that have a very similar stack trace to the one in this ticket, but have the patch for this ticket. Would you please review these failures are determine if they are indeed the same issue as stated in this ticket or if they are new failures?
If the below logs show that these failures are unrelated to this ticket, then I can open a new one.
Here are a few of the recent failures:
2017-04-09 - https://testing.hpdd.intel.com/test_sets/1fb1e996-1d1c-11e7-8920-5254006e85c2
2017-04-11 - https://testing.hpdd.intel.com/test_sets/da448a92-1ef3-11e7-9de9-5254006e85c2
2017-04-12 - https://testing.hpdd.intel.com/test_sets/d251faa0-1fab-11e7-b742-5254006e85c2
2017-04-17 - https://testing.hpdd.intel.com/test_sets/4ddd019c-2362-11e7-8920-5254006e85c2