Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7270

sanity-scrub test_9: panic and reboot on mds

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/6a6b269a-6db5-11e5-bd44-5254006e85c2.

      The sub-test test_9 failed with the following error:

      test failed to respond and timed out
      

      syslog from mds strongly suggests the node rebooted, probably due to a panic. However no console logs are captured, making it difficult to impossible to know the cause. TEI-4028 was entered about the missing console logs.

      This seems to be reproducing frequently, maybe 100%, on recent tests of sles11sp3 client/server on master. not seen on other branches.

      Info required for matching: sanity-scrub 9

      Attachments

        Issue Links

          Activity

            [LU-7270] sanity-scrub test_9: panic and reboot on mds

            The new failure is another instance of LU-9040.

            yong.fan nasf (Inactive) added a comment - The new failure is another instance of LU-9040 .

            seen again.

            MDS panic trace looks like:

            09:02:47:[12636.065002] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [OI_scrub:20459]
            09:02:47:[12636.065002] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_ssse3 sha512_generic crypto_null libcfs(OE) ldiskfs(OE) dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr virtio_balloon i2c_piix4 parport_pc parport nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ata_generic pata_acpi ext4 mbcache jbd2 virtio_blk crct10dif_pclmul crct10dif_common ata_piix cirrus crc32c_intel drm_kms_helper syscopyarea sysfillrect serio_raw sysimgblt fb_sys_fops 8139too ttm floppy drm libata virtio_pci virtio_ring virtio 8139cp mii i2c_core
            09:02:47:[12636.065002] CPU: 1 PID: 20459 Comm: OI_scrub Tainted: G           OE  ------------   3.10.0-514.6.1.el7_lustre.x86_64 #1
            09:02:47:[12636.065002] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
            09:02:47:[12636.065002] task: ffff880042bf6dd0 ti: ffff88004db8c000 task.ti: ffff88004db8c000
            09:02:47:[12636.065002] RIP: 0010:[<ffffffffa0d0024f>]  [<ffffffffa0d0024f>] osd_inode_iteration+0x56f/0xd80 [osd_ldiskfs]
            09:02:47:[12636.065002] RSP: 0018:ffff88004db8fd20  EFLAGS: 00000293
            09:02:47:[12636.065002] RAX: ffffffffa0cfe930 RBX: ffff88004db8fd57 RCX: ffff880077d5f250
            09:02:47:[12636.065002] RDX: ffff88004db8fd78 RSI: ffff88003d216000 RDI: ffff88003c382000
            09:02:47:[12636.065002] RBP: ffff88004db8fdf0 R08: ffff88004db8fd57 R09: 0000000000000004
            09:02:47:[12636.065002] R10: 0000000077d5f901 R11: ffffea0001df57c0 R12: 0000000000000001
            09:02:47:[12636.065002] R13: 0000000000000004 R14: 0000000077d5f901 R15: ffffea0001df57c0
            09:02:47:[12636.065002] FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
            09:02:47:[12636.065002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            09:02:47:[12636.065002] CR2: 00007fed9fd98990 CR3: 000000007b854000 CR4: 00000000000406e0
            09:02:47:[12636.065002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            09:02:47:[12636.065002] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
            09:02:47:[12636.065002] Stack:
            09:02:47:[12636.065002]  ffffffffa0cfe930 ffffffffa0cfa810 ffff88003d216000 ffff8800000186a0
            09:02:47:[12636.065002]  ffff88003d217468 ffff88003c382000 010000000000000c 0000000000000000
            09:02:47:[12636.065002]  0000000000000000 ffff88003d216000 0000000000000000 ffff88003da2e800
            09:02:47:[12636.065002] Call Trace:
            09:02:47:[12636.065002]  [<ffffffffa0cfe930>] ? osd_ios_ROOT_scan+0x300/0x300 [osd_ldiskfs]
            09:02:47:[12636.065002]  [<ffffffffa0cfa810>] ? osd_preload_next+0xc0/0xc0 [osd_ldiskfs]
            09:02:47:[12636.065002]  [<ffffffffa0d013c0>] osd_scrub_main+0x960/0xf30 [osd_ldiskfs]
            09:02:47:[12636.065002]  [<ffffffff810c4fd0>] ? wake_up_state+0x20/0x20
            09:02:47:[12636.065002]  [<ffffffffa0d00a60>] ? osd_inode_iteration+0xd80/0xd80 [osd_ldiskfs]
            09:02:47:[12636.065002]  [<ffffffff810b064f>] kthread+0xcf/0xe0
            09:02:47:[12636.065002]  [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
            09:02:47:[12636.065002]  [<ffffffff81696958>] ret_from_fork+0x58/0x90
            09:02:47:[12636.065002]  [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
            09:02:47:[12636.065002] Code: ff 0f 1f 00 41 89 c1 4c 8d 85 67 ff ff ff 48 8b 4d a8 48 8d 55 88 48 8b b5 40 ff ff ff 48 8b bd 58 ff ff ff 48 8b 85 30 ff ff ff <ff> d0 85 c0 0f 84 26 fd ff ff 48 8b 7d 90 48 85 ff 74 11 89 85 
            09:02:47:[12636.065002] Kernel panic - not syncing: softlockup: hung tasks
            09:02:47:[12636.065002] CPU: 1 PID: 20459 Comm: OI_scrub Tainted: G           OEL ------------   3.10.0-514.6.1.el7_lustre.x86_64 #1
            09:02:47:[12636.065002] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
            09:02:47:[12636.065002]  ffffffff818d97bd 00000000c06d7c0d ffff88007fd03e18 ffffffff816863f8
            09:02:47:[12636.065002]  ffff88007fd03e98 ffffffff8167f823 0000000000000008 ffff88007fd03ea8
            09:02:47:[12636.065002]  ffff88007fd03e48 00000000c06d7c0d ffff88007fd03e67 0000000000000000
            09:02:47:[12636.065002] Call Trace:
            09:02:47:[12636.065002]  <IRQ>  [<ffffffff816863f8>] dump_stack+0x19/0x1b
            09:02:47:[12636.065002]  [<ffffffff8167f823>] panic+0xe3/0x1f2
            09:02:47:[12636.065002]  [<ffffffff8112eecc>] watchdog_timer_fn+0x20c/0x220
            09:02:47:[12636.065002]  [<ffffffff8112ecc0>] ? watchdog+0x50/0x50
            09:02:47:[12636.065002]  [<ffffffff810b4982>] __hrtimer_run_queues+0xd2/0x260
            09:02:47:[12636.065002]  [<ffffffff810b4f20>] hrtimer_interrupt+0xb0/0x1e0
            09:02:47:[12636.065002]  [<ffffffff8169849c>] ? call_softirq+0x1c/0x30
            09:02:47:[12636.065002]  [<ffffffff81051077>] local_apic_timer_interrupt+0x37/0x60
            09:02:47:[12636.065002]  [<ffffffff8169910f>] smp_apic_timer_interrupt+0x3f/0x60
            09:02:47:[12636.065002]  [<ffffffff8169765d>] apic_timer_interrupt+0x6d/0x80
            09:02:47:[12636.065002]  <EOI>  [<ffffffffa0cfe930>] ? osd_ios_ROOT_scan+0x300/0x300 [osd_ldiskfs]
            09:02:47:[12636.065002]  [<ffffffffa0d0024f>] ? osd_inode_iteration+0x56f/0xd80 [osd_ldiskfs]
            09:02:47:[12636.065002]  [<ffffffffa0cfe930>] ? osd_ios_ROOT_scan+0x300/0x300 [osd_ldiskfs]
            09:02:47:[12636.065002]  [<ffffffffa0cfa810>] ? osd_preload_next+0xc0/0xc0 [osd_ldiskfs]
            09:02:47:[12636.065002]  [<ffffffffa0d013c0>] osd_scrub_main+0x960/0xf30 [osd_ldiskfs]
            09:02:47:[12636.065002]  [<ffffffff810c4fd0>] ? wake_up_state+0x20/0x20
            09:02:47:[12636.065002]  [<ffffffffa0d00a60>] ? osd_inode_iteration+0xd80/0xd80 [osd_ldiskfs]
            09:02:47:[12636.065002]  [<ffffffff810b064f>] kthread+0xcf/0xe0
            09:02:47:[12636.065002]  [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
            09:02:47:[12636.065002]  [<ffffffff81696958>] ret_from_fork+0x58/0x90
            09:02:47:[12636.065002]  [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
            

            https://testing.hpdd.intel.com/test_sets/c9714ef0-ea3a-11e6-9a08-5254006e85c2
            may be a new bug, but looks enough like this old one that I'm reopening it.

            bogl Bob Glossman (Inactive) added a comment - seen again. MDS panic trace looks like: 09:02:47:[12636.065002] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [OI_scrub:20459] 09:02:47:[12636.065002] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_ssse3 sha512_generic crypto_null libcfs(OE) ldiskfs(OE) dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr virtio_balloon i2c_piix4 parport_pc parport nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ata_generic pata_acpi ext4 mbcache jbd2 virtio_blk crct10dif_pclmul crct10dif_common ata_piix cirrus crc32c_intel drm_kms_helper syscopyarea sysfillrect serio_raw sysimgblt fb_sys_fops 8139too ttm floppy drm libata virtio_pci virtio_ring virtio 8139cp mii i2c_core 09:02:47:[12636.065002] CPU: 1 PID: 20459 Comm: OI_scrub Tainted: G OE ------------ 3.10.0-514.6.1.el7_lustre.x86_64 #1 09:02:47:[12636.065002] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 09:02:47:[12636.065002] task: ffff880042bf6dd0 ti: ffff88004db8c000 task.ti: ffff88004db8c000 09:02:47:[12636.065002] RIP: 0010:[<ffffffffa0d0024f>] [<ffffffffa0d0024f>] osd_inode_iteration+0x56f/0xd80 [osd_ldiskfs] 09:02:47:[12636.065002] RSP: 0018:ffff88004db8fd20 EFLAGS: 00000293 09:02:47:[12636.065002] RAX: ffffffffa0cfe930 RBX: ffff88004db8fd57 RCX: ffff880077d5f250 09:02:47:[12636.065002] RDX: ffff88004db8fd78 RSI: ffff88003d216000 RDI: ffff88003c382000 09:02:47:[12636.065002] RBP: ffff88004db8fdf0 R08: ffff88004db8fd57 R09: 0000000000000004 09:02:47:[12636.065002] R10: 0000000077d5f901 R11: ffffea0001df57c0 R12: 0000000000000001 09:02:47:[12636.065002] R13: 0000000000000004 R14: 0000000077d5f901 R15: ffffea0001df57c0 09:02:47:[12636.065002] FS: 0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000 09:02:47:[12636.065002] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 09:02:47:[12636.065002] CR2: 00007fed9fd98990 CR3: 000000007b854000 CR4: 00000000000406e0 09:02:47:[12636.065002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 09:02:47:[12636.065002] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 09:02:47:[12636.065002] Stack: 09:02:47:[12636.065002] ffffffffa0cfe930 ffffffffa0cfa810 ffff88003d216000 ffff8800000186a0 09:02:47:[12636.065002] ffff88003d217468 ffff88003c382000 010000000000000c 0000000000000000 09:02:47:[12636.065002] 0000000000000000 ffff88003d216000 0000000000000000 ffff88003da2e800 09:02:47:[12636.065002] Call Trace: 09:02:47:[12636.065002] [<ffffffffa0cfe930>] ? osd_ios_ROOT_scan+0x300/0x300 [osd_ldiskfs] 09:02:47:[12636.065002] [<ffffffffa0cfa810>] ? osd_preload_next+0xc0/0xc0 [osd_ldiskfs] 09:02:47:[12636.065002] [<ffffffffa0d013c0>] osd_scrub_main+0x960/0xf30 [osd_ldiskfs] 09:02:47:[12636.065002] [<ffffffff810c4fd0>] ? wake_up_state+0x20/0x20 09:02:47:[12636.065002] [<ffffffffa0d00a60>] ? osd_inode_iteration+0xd80/0xd80 [osd_ldiskfs] 09:02:47:[12636.065002] [<ffffffff810b064f>] kthread+0xcf/0xe0 09:02:47:[12636.065002] [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140 09:02:47:[12636.065002] [<ffffffff81696958>] ret_from_fork+0x58/0x90 09:02:47:[12636.065002] [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140 09:02:47:[12636.065002] Code: ff 0f 1f 00 41 89 c1 4c 8d 85 67 ff ff ff 48 8b 4d a8 48 8d 55 88 48 8b b5 40 ff ff ff 48 8b bd 58 ff ff ff 48 8b 85 30 ff ff ff <ff> d0 85 c0 0f 84 26 fd ff ff 48 8b 7d 90 48 85 ff 74 11 89 85 09:02:47:[12636.065002] Kernel panic - not syncing: softlockup: hung tasks 09:02:47:[12636.065002] CPU: 1 PID: 20459 Comm: OI_scrub Tainted: G OEL ------------ 3.10.0-514.6.1.el7_lustre.x86_64 #1 09:02:47:[12636.065002] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 09:02:47:[12636.065002] ffffffff818d97bd 00000000c06d7c0d ffff88007fd03e18 ffffffff816863f8 09:02:47:[12636.065002] ffff88007fd03e98 ffffffff8167f823 0000000000000008 ffff88007fd03ea8 09:02:47:[12636.065002] ffff88007fd03e48 00000000c06d7c0d ffff88007fd03e67 0000000000000000 09:02:47:[12636.065002] Call Trace: 09:02:47:[12636.065002] <IRQ> [<ffffffff816863f8>] dump_stack+0x19/0x1b 09:02:47:[12636.065002] [<ffffffff8167f823>] panic+0xe3/0x1f2 09:02:47:[12636.065002] [<ffffffff8112eecc>] watchdog_timer_fn+0x20c/0x220 09:02:47:[12636.065002] [<ffffffff8112ecc0>] ? watchdog+0x50/0x50 09:02:47:[12636.065002] [<ffffffff810b4982>] __hrtimer_run_queues+0xd2/0x260 09:02:47:[12636.065002] [<ffffffff810b4f20>] hrtimer_interrupt+0xb0/0x1e0 09:02:47:[12636.065002] [<ffffffff8169849c>] ? call_softirq+0x1c/0x30 09:02:47:[12636.065002] [<ffffffff81051077>] local_apic_timer_interrupt+0x37/0x60 09:02:47:[12636.065002] [<ffffffff8169910f>] smp_apic_timer_interrupt+0x3f/0x60 09:02:47:[12636.065002] [<ffffffff8169765d>] apic_timer_interrupt+0x6d/0x80 09:02:47:[12636.065002] <EOI> [<ffffffffa0cfe930>] ? osd_ios_ROOT_scan+0x300/0x300 [osd_ldiskfs] 09:02:47:[12636.065002] [<ffffffffa0d0024f>] ? osd_inode_iteration+0x56f/0xd80 [osd_ldiskfs] 09:02:47:[12636.065002] [<ffffffffa0cfe930>] ? osd_ios_ROOT_scan+0x300/0x300 [osd_ldiskfs] 09:02:47:[12636.065002] [<ffffffffa0cfa810>] ? osd_preload_next+0xc0/0xc0 [osd_ldiskfs] 09:02:47:[12636.065002] [<ffffffffa0d013c0>] osd_scrub_main+0x960/0xf30 [osd_ldiskfs] 09:02:47:[12636.065002] [<ffffffff810c4fd0>] ? wake_up_state+0x20/0x20 09:02:47:[12636.065002] [<ffffffffa0d00a60>] ? osd_inode_iteration+0xd80/0xd80 [osd_ldiskfs] 09:02:47:[12636.065002] [<ffffffff810b064f>] kthread+0xcf/0xe0 09:02:47:[12636.065002] [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140 09:02:47:[12636.065002] [<ffffffff81696958>] ret_from_fork+0x58/0x90 09:02:47:[12636.065002] [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140 https://testing.hpdd.intel.com/test_sets/c9714ef0-ea3a-11e6-9a08-5254006e85c2 may be a new bug, but looks enough like this old one that I'm reopening it.

            It may have been fixed by the patch http://review.whamcloud.com/#/c/17403/

            yong.fan nasf (Inactive) added a comment - It may have been fixed by the patch http://review.whamcloud.com/#/c/17403/

            I haven't seen this problem lately. No way to reproduce it. The patch you refer to is already in master. If this problem isn't showing up in testing it is probably not an issue any more.

            bogl Bob Glossman (Inactive) added a comment - I haven't seen this problem lately. No way to reproduce it. The patch you refer to is already in master. If this problem isn't showing up in testing it is probably not an issue any more.

            Bob,

            Have you hit the same trouble recently? The patch http://review.whamcloud.com/#/c/17403/ fixed some important issues related with OI scrub. Would you please to try the latest master? Thanks!

            yong.fan nasf (Inactive) added a comment - Bob, Have you hit the same trouble recently? The patch http://review.whamcloud.com/#/c/17403/ fixed some important issues related with OI scrub. Would you please to try the latest master? Thanks!

            while this has been seen a lot during testing of sles11sp3 client/server, there is at least one test run that didn't hit it: https://testing.hpdd.intel.com/test_sessions/a402810a-8ae8-11e5-8e06-5254006e85c2. Existence proof that either some recent fix in master made it go away or it's not a 100% deterministic failure on sles11sp3 or both.

            bogl Bob Glossman (Inactive) added a comment - while this has been seen a lot during testing of sles11sp3 client/server, there is at least one test run that didn't hit it: https://testing.hpdd.intel.com/test_sessions/a402810a-8ae8-11e5-8e06-5254006e85c2 . Existence proof that either some recent fix in master made it go away or it's not a 100% deterministic failure on sles11sp3 or both.

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: