Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6400

conf-sanity test_56: test failed to respond and timed out

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.8.0
    • Lustre 2.7.0, Lustre 2.8.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/12b10500-d319-11e4-94cf-5254006e85c2.

      while maloo reports 0% fail history lookup shows a few of these in Feb. and March.

      The sub-test test_56 failed with the following error:

      test failed to respond and timed out
      

      Please provide additional information about the failure here.

      Info required for matching: conf-sanity 56

      Attachments

        Activity

          [LU-6400] conf-sanity test_56: test failed to respond and timed out

          Patch has landed for 2.8

          jgmitter Joseph Gmitter (Inactive) added a comment - Patch has landed for 2.8
          ys Yang Sheng added a comment -

          Patch landed. Close ticket.

          ys Yang Sheng added a comment - Patch landed. Close ticket.

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16008/
          Subject: LU-6400 osd: initialize variable before use
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 2aa7bd74f6fae63692c65217b5dd38a709b0c0bd

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16008/ Subject: LU-6400 osd: initialize variable before use Project: fs/lustre-release Branch: master Current Patch Set: Commit: 2aa7bd74f6fae63692c65217b5dd38a709b0c0bd

          Yang Sheng (yang.sheng@intel.com) uploaded a new patch: http://review.whamcloud.com/16008
          Subject: LU-6400 osd: initialize variable before use
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 4db4c4528fdeab9ce93161149177f857b05e2993

          gerrit Gerrit Updater added a comment - Yang Sheng (yang.sheng@intel.com) uploaded a new patch: http://review.whamcloud.com/16008 Subject: LU-6400 osd: initialize variable before use Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4db4c4528fdeab9ce93161149177f857b05e2993
          pjones Peter Jones added a comment -

          Yang Sheng

          Could you please look into this issue?

          Thanks

          Peter

          pjones Peter Jones added a comment - Yang Sheng Could you please look into this issue? Thanks Peter
          green Oleg Drokin added a comment -

          Ok, so the lack of logs is due to timestamp in logs.
          We cna see the logs if we go to session and examine previous lustre-initialization-X logs.

          What we see there is

          04:22:10:shadow-14vm4 login: [22423.103364] BUG: unable to handle kernel paging request at ffffffffffff8828
          04:22:10:[22423.105214] IP: [<ffffffffa08004da>] class_setup+0x63a/0xad0 [obdclass]
          04:22:10:[22423.105214] PGD 1a0b067 PUD 1a0c067 PMD 0 
          04:22:10:[22423.105214] Oops: 0002 [#1] SMP 
          04:22:10:[22423.105214] CPU 1 
          04:22:10:[22423.105214] Modules linked in: osp(EN) mdd(EN) lod(EN) mdt(EN) lfsck(EN) mgs(EN) mgc(EN) osd_ldiskfs(EN) lquota(EN) lustre(EN) lov(EN) mdc(EN) fid(EN) lmv(EN) fld(EN) ksocklnd(EN) ptlrpc(EN) obdclass(EN) lnet(EN) libcfs(EN) ldiskfs(EN) sha512_generic sha1_generic md5 crypto_null crc32c quota_v2 quota_tree jbd2 crc16 nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc rdma_ucm rdma_cm iw_cm ib_addr ib_srp scsi_transport_srp scsi_tgt ib_ipoib ib_cm ib_uverbs ib_umad iw_cxgb3 cxgb3 mdio mlx4_en mlx4_ib ib_sa mlx4_core ib_mthca ib_mad ib_core mperf loop dm_mod ipv6 ipv6_lib 8139too floppy virtio_balloon rtc_cmos 8139cp i2c_piix4 mii pcspkr button ttm drm_kms_helper drm i2c_core sysimgblt sysfillrect syscopyarea uhci_hcd ehci_hcd usbcore usb_common intel_agp intel_gtt scsi_dh_emc scsi_dh_rdac scsi_dh_alua scsi_dh_hp_sw scsi_dh virtio_pci ata_generic virtio_blk virtio virtio_ring ata_piix edd ext3 mbcache jbd fan processor ahci libahci libata scsi_mod thermal thermal_sys hwmon [last unloaded: libcfs]
          04:22:10:[22423.121496] Supported: No, Unsupported modules are loaded
          04:22:10:[22423.121496] 
          04:22:10:[22423.121496] Pid: 31534, comm: llog_process_th Tainted: G           EN  3.0.101-0.47.50-default #1 Red Hat KVM
          04:22:10:[22423.121496] RIP: 0010:[<ffffffffa08004da>]  [<ffffffffa08004da>] class_setup+0x63a/0xad0 [obdclass]
          04:22:10:[22423.121496] RSP: 0018:ffff88005d8e9be0  EFLAGS: 00010287
          04:22:10:[22423.121496] RAX: 0000000000000007 RBX: ffff880061f14e80 RCX: ffff8800375e3c00
          04:22:10:[22423.121496] RDX: 0000000000000006 RSI: ffff8800375e3c00 RDI: 0000000000000286
          04:22:10:[22423.121496] RBP: ffff880061f14e80 R08: 000000000000000a R09: 0000000000000010
          04:22:10:[22423.121496] R10: 0000ffff0010ff10 R11: 0000000000000000 R12: 0000000000000000
          04:22:10:[22423.121496] R13: ffff880061f14fd8 R14: ffffffffffff8800 R15: ffff88005d8e9c10
          04:22:10:[22423.121496] FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
          04:22:10:[22423.121496] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
          04:22:10:[22423.121496] CR2: ffffffffffff8828 CR3: 000000007a9a7000 CR4: 00000000000006e0
          04:22:10:[22423.121496] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          04:22:10:[22423.121496] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
          04:22:10:[22423.121496] Process llog_process_th (pid: 31534, threadinfo ffff88005d8e8000, task ffff88006cf18280)
          04:22:10:[22423.121496] Stack:
          04:22:10:[22423.121496]  ffff880000000800 ffffffffa087c940 551388d200000284 00000000000b2a9a
          04:22:10:[22423.121496]  00007b2e00000000 ffff88006fbf4dc0 0000000410000003 0000000000000000
          04:22:10:[22423.121496]  0000000000000000 ffff88005d8e9c28 ffff88005d8e9c28 0000000000000082
          04:22:10:[22423.121496] Call Trace:
          04:22:10:[22423.121496]  [<ffffffffa0808605>] class_process_config+0xc95/0x18f0 [obdclass]
          04:22:10:[22423.121496]  [<ffffffffa080a448>] class_config_llog_handler+0x978/0x14d0 [obdclass]
          04:22:10:[22423.121496]  [<ffffffffa07ce3cd>] llog_process_thread+0x8bd/0xd10 [obdclass]
          04:22:10:[22423.121496]  [<ffffffffa07ce85a>] llog_process_thread_daemonize+0x3a/0x70 [obdclass]
          04:22:10:[22423.121496]  [<ffffffff81083fe6>] kthread+0x96/0xa0
          04:22:10:[22423.121496]  [<ffffffff8146dce4>] kernel_thread_helper+0x4/0x10
          04:22:10:[22423.121496] Code: ff 48 89 44 24 60 49 8b 46 10 ff 10 4c 89 ff 49 89 c6 e8 4a 6b 01 00 49 81 fe 00 f0 ff ff 0f 87 3d 03 00 00 4c 89 b5 b8 00 00 00 
          04:22:10:[22423.121496]  89 6e 28 e9 7b fe ff ff 0f 1f 44 00 00 c7 05 1e 39 0a 00 20 
          04:22:10:[22423.121496] RIP  [<ffffffffa08004da>] class_setup+0x63a/0xad0 [obdclass]
          04:22:10:[22423.121496]  RSP <ffff88005d8e9be0>
          04:22:10:[22423.121496] CR2: ffffffffffff8828
          
          green Oleg Drokin added a comment - Ok, so the lack of logs is due to timestamp in logs. We cna see the logs if we go to session and examine previous lustre-initialization-X logs. What we see there is 04:22:10:shadow-14vm4 login: [22423.103364] BUG: unable to handle kernel paging request at ffffffffffff8828 04:22:10:[22423.105214] IP: [<ffffffffa08004da>] class_setup+0x63a/0xad0 [obdclass] 04:22:10:[22423.105214] PGD 1a0b067 PUD 1a0c067 PMD 0 04:22:10:[22423.105214] Oops: 0002 [#1] SMP 04:22:10:[22423.105214] CPU 1 04:22:10:[22423.105214] Modules linked in: osp(EN) mdd(EN) lod(EN) mdt(EN) lfsck(EN) mgs(EN) mgc(EN) osd_ldiskfs(EN) lquota(EN) lustre(EN) lov(EN) mdc(EN) fid(EN) lmv(EN) fld(EN) ksocklnd(EN) ptlrpc(EN) obdclass(EN) lnet(EN) libcfs(EN) ldiskfs(EN) sha512_generic sha1_generic md5 crypto_null crc32c quota_v2 quota_tree jbd2 crc16 nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc rdma_ucm rdma_cm iw_cm ib_addr ib_srp scsi_transport_srp scsi_tgt ib_ipoib ib_cm ib_uverbs ib_umad iw_cxgb3 cxgb3 mdio mlx4_en mlx4_ib ib_sa mlx4_core ib_mthca ib_mad ib_core mperf loop dm_mod ipv6 ipv6_lib 8139too floppy virtio_balloon rtc_cmos 8139cp i2c_piix4 mii pcspkr button ttm drm_kms_helper drm i2c_core sysimgblt sysfillrect syscopyarea uhci_hcd ehci_hcd usbcore usb_common intel_agp intel_gtt scsi_dh_emc scsi_dh_rdac scsi_dh_alua scsi_dh_hp_sw scsi_dh virtio_pci ata_generic virtio_blk virtio virtio_ring ata_piix edd ext3 mbcache jbd fan processor ahci libahci libata scsi_mod thermal thermal_sys hwmon [last unloaded: libcfs] 04:22:10:[22423.121496] Supported: No, Unsupported modules are loaded 04:22:10:[22423.121496] 04:22:10:[22423.121496] Pid: 31534, comm: llog_process_th Tainted: G EN 3.0.101-0.47.50-default #1 Red Hat KVM 04:22:10:[22423.121496] RIP: 0010:[<ffffffffa08004da>] [<ffffffffa08004da>] class_setup+0x63a/0xad0 [obdclass] 04:22:10:[22423.121496] RSP: 0018:ffff88005d8e9be0 EFLAGS: 00010287 04:22:10:[22423.121496] RAX: 0000000000000007 RBX: ffff880061f14e80 RCX: ffff8800375e3c00 04:22:10:[22423.121496] RDX: 0000000000000006 RSI: ffff8800375e3c00 RDI: 0000000000000286 04:22:10:[22423.121496] RBP: ffff880061f14e80 R08: 000000000000000a R09: 0000000000000010 04:22:10:[22423.121496] R10: 0000ffff0010ff10 R11: 0000000000000000 R12: 0000000000000000 04:22:10:[22423.121496] R13: ffff880061f14fd8 R14: ffffffffffff8800 R15: ffff88005d8e9c10 04:22:10:[22423.121496] FS: 0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000 04:22:10:[22423.121496] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 04:22:10:[22423.121496] CR2: ffffffffffff8828 CR3: 000000007a9a7000 CR4: 00000000000006e0 04:22:10:[22423.121496] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 04:22:10:[22423.121496] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 04:22:10:[22423.121496] Process llog_process_th (pid: 31534, threadinfo ffff88005d8e8000, task ffff88006cf18280) 04:22:10:[22423.121496] Stack: 04:22:10:[22423.121496] ffff880000000800 ffffffffa087c940 551388d200000284 00000000000b2a9a 04:22:10:[22423.121496] 00007b2e00000000 ffff88006fbf4dc0 0000000410000003 0000000000000000 04:22:10:[22423.121496] 0000000000000000 ffff88005d8e9c28 ffff88005d8e9c28 0000000000000082 04:22:10:[22423.121496] Call Trace: 04:22:10:[22423.121496] [<ffffffffa0808605>] class_process_config+0xc95/0x18f0 [obdclass] 04:22:10:[22423.121496] [<ffffffffa080a448>] class_config_llog_handler+0x978/0x14d0 [obdclass] 04:22:10:[22423.121496] [<ffffffffa07ce3cd>] llog_process_thread+0x8bd/0xd10 [obdclass] 04:22:10:[22423.121496] [<ffffffffa07ce85a>] llog_process_thread_daemonize+0x3a/0x70 [obdclass] 04:22:10:[22423.121496] [<ffffffff81083fe6>] kthread+0x96/0xa0 04:22:10:[22423.121496] [<ffffffff8146dce4>] kernel_thread_helper+0x4/0x10 04:22:10:[22423.121496] Code: ff 48 89 44 24 60 49 8b 46 10 ff 10 4c 89 ff 49 89 c6 e8 4a 6b 01 00 49 81 fe 00 f0 ff ff 0f 87 3d 03 00 00 4c 89 b5 b8 00 00 00 04:22:10:[22423.121496] 89 6e 28 e9 7b fe ff ff 0f 1f 44 00 00 c7 05 1e 39 0a 00 20 04:22:10:[22423.121496] RIP [<ffffffffa08004da>] class_setup+0x63a/0xad0 [obdclass] 04:22:10:[22423.121496] RSP <ffff88005d8e9be0> 04:22:10:[22423.121496] CR2: ffffffffffff8828

          I may be able to help a bit with the analysis, at least to the extent of answering the question whether my modifications appear to be to blame in this particular case, provided I can get a look at the dmesg output for the crash. (Looking at the actual core would be even better, but I don't know how feasible that would be.)

          olaf Olaf Weber (Inactive) added a comment - I may be able to help a bit with the analysis, at least to the extent of answering the question whether my modifications appear to be to blame in this particular case, provided I can get a look at the dmesg output for the crash. (Looking at the actual core would be even better, but I don't know how feasible that would be.)
          green Oleg Drokin added a comment -

          So the issue at hand is such that MDS1 crashed (dmesg is empty with signs of reboot).
          So we need to hunt down crashdump and extract dmesg from there to see why it crashed and analyze from there.

          green Oleg Drokin added a comment - So the issue at hand is such that MDS1 crashed (dmesg is empty with signs of reboot). So we need to hunt down crashdump and extract dmesg from there to see why it crashed and analyze from there.
          bogl Bob Glossman (Inactive) added a comment - another seen: https://testing.hpdd.intel.com/test_sets/1bd0824a-d3ce-11e4-8c98-5254006e85c2

          I see that conf-sanity test_56 was once disabled in TEI. restored in TEI-2738. maybe this is related.

          bogl Bob Glossman (Inactive) added a comment - I see that conf-sanity test_56 was once disabled in TEI. restored in TEI-2738. maybe this is related.

          People

            ys Yang Sheng
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: