Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9465

Kernel NULL pointer: osd_object.c:427:osd_object_init()) soaked-OST0005: lookup [0x440000401:0x195026b:0x0]/0x920ea8 failed: rc = 17

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • None
    • Lustre 2.10.0
    • Soak stress cluster
    • 3
    • 9,394
    • 9223372036854775807

    Description

      soak-7 survived several failovers, last failover at 2017-05-07 07:41:31
      The soak cluster failed over soak-10 at 2017-05-07 18:23:22
      Immediately after finishing recovery, soak-7 crashed.
      The OSS is reconnected to the recently failed-over MDT on soak-10/11

      May  7 18:22:39 soak-7 kernel: LustreError: 11-0: soaked-MDT0003-lwp-OST0011: operation obd_ping to node 192.168.1.110@o2ib10 failed: rc = -107
      May  7 18:22:39 soak-7 kernel: Lustre: soaked-MDT0003-lwp-OST0005: Connection to soaked-MDT0003 (at 192.168.1.110@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
      May  7 18:22:39 soak-7 kernel: Lustre: Skipped 2 previous similar messages
      May  7 18:22:39 soak-7 kernel: LustreError: Skipped 3 previous similar messages
      May  7 18:23:21 soak-7 kernel: LNet: 228:0:(o2iblnd_cb.c:2421:kiblnd_passive_connect()) Conn stale 192.168.1.111@o2ib10 version 12/12 incarnation 1494181401470091/1494181401470091
      May  7 18:23:21 soak-7 kernel: Lustre: soaked-OST0005: Connection restored to  (at 192.168.1.111@o2ib10)
      May  7 18:23:21 soak-7 kernel: Lustre: Skipped 2 previous similar messages
      May  7 18:23:22 soak-7 kernel: LNet: 7422:0:(o2iblnd_cb.c:1377:kiblnd_reconnect_peer()) Abort reconnection of 192.168.1.111@o2ib10: connected
      May  7 18:23:29 soak-7 kernel: LustreError: 167-0: soaked-MDT0003-lwp-OST0011: This client was evicted by soaked-MDT0003; in progress operations using this service will fail.
      May  7 18:23:29 soak-7 kernel: LustreError: Skipped 1 previous similar message
      May  7 18:23:43 soak-7 kernel: Lustre: soaked-OST0005: deleting orphan objects from 0x440000401:26279429 to 0x440000401:26291121
      May  7 18:23:43 soak-7 kernel: Lustre: soaked-OST0011: deleting orphan objects from 0x780000401:26209136 to 0x780000401:26218273
      May  7 18:23:43 soak-7 kernel: Lustre: soaked-OST000b: deleting orphan objects from 0x5c0000400:26329949 to 0x5c0000400:26339745
      May  7 18:23:43 soak-7 kernel: Lustre: soaked-OST0017: deleting orphan objects from 0x8c0000401:26229632 to 0x8c0000401:26238017
      May  7 18:23:54 soak-7 kernel: LustreError: 167-0: soaked-MDT0003-lwp-OST000b: This client was evicted by soaked-MDT0003; in progress operations using this service will fail.
      May  7 18:23:54 soak-7 kernel: Lustre: soaked-MDT0003-lwp-OST0017: Connection restored to 192.168.1.111@o2ib10 (at 192.168.1.111@o2ib10)
      

      Then, a hard crash

      [38854.133273] Lustre: soaked-MDT0003-lwp-OST0017: Connection restored to 192.168.1.111@o2ib10 (at 192.168.1.111@o2ib10)
      [38854.147850] Lustre: Skipped 3 previous similar messages
      [55622.538966] perf: interrupt took too long (5010 > 5007), lowering kernel.perf_event_max_sample_rate to 39000
      [60371.183844] LustreError: 16407:0:(osd_object.c:427:osd_object_init()) soaked-OST0005: lookup [0x440000401:0x195026b:0x0]/0x920ea8 failed: rc = 17
      [60371.201275] BUG: unable to handle kernel NULL pointer dereference at 0000000000000011
      [60371.211442] IP: [<ffffffffa0a0d328>] lu_object_find_try+0x178/0x2b0 [obdclass]
      [60371.221570] PGD 0
      [60371.225825] Oops: 0000 [#1] SMP
      

      There is a crash dump available on the node, vmcore-dmesg attached.

      Attachments

        1. soak-8.console.log
          2.23 MB
        2. soak-8.vmcore-dmesg.txt
          1010 kB
        3. vmcore-dmesg.txt
          226 kB

        Issue Links

          Activity

            [LU-9465] Kernel NULL pointer: osd_object.c:427:osd_object_init()) soaked-OST0005: lookup [0x440000401:0x195026b:0x0]/0x920ea8 failed: rc = 17

            This issue should be fixed with the patch for LU-9394

            jamesanunez James Nunez (Inactive) added a comment - This issue should be fixed with the patch for LU-9394

            I have created LU-9488 and transferred the information there.

            cliffw Cliff White (Inactive) added a comment - I have created LU-9488 and transferred the information there.
            adilger Andreas Dilger added a comment - - edited

            The LU-9394 fix is for the original problem reported in this bug, "NULL pointer dereference at lu_object_find_try()". The stack trace added yesterday is for something completely different, "soft lockup in osd_inode_iteration()", which looks to be related to LFSCK/Scrub. If the testing with the latest master build hit the soft lockup in osd_inode_iteration() then it should go into a separate ticket.

            adilger Andreas Dilger added a comment - - edited The LU-9394 fix is for the original problem reported in this bug, " NULL pointer dereference at lu_object_find_try() ". The stack trace added yesterday is for something completely different, " soft lockup in osd_inode_iteration() ", which looks to be related to LFSCK/Scrub. If the testing with the latest master build hit the soft lockup in osd_inode_iteration() then it should go into a separate ticket.

            Tests of latest master build failed. That's not a fix

            cliffw Cliff White (Inactive) added a comment - Tests of latest master build failed. That's not a fix

            I tested master build 3577 and hit a very similar problem as reported above. I don't think it's fixed

            cliffw Cliff White (Inactive) added a comment - I tested master build 3577 and hit a very similar problem as reported above. I don't think it's fixed
            laisiyao Lai Siyao added a comment -

            sorry, I tried to mark it duplicate of LU-9394, but I can't fill 'LU-9394' in Bugzilla-Id field. Can someone help me fix it?

            laisiyao Lai Siyao added a comment - sorry, I tried to mark it duplicate of LU-9394 , but I can't fill ' LU-9394 ' in Bugzilla-Id field. Can someone help me fix it?
            laisiyao Lai Siyao added a comment - - edited

            ahh, just found it's fixed in LU-9394 already, and the fix is in latest master.

            laisiyao Lai Siyao added a comment - - edited ahh, just found it's fixed in LU-9394 already, and the fix is in latest master.
            laisiyao Lai Siyao added a comment -

            It looks to be a code error, I will make a patch soon.

            laisiyao Lai Siyao added a comment - It looks to be a code error, I will make a patch soon.

            Restarted testing with latest master, 3577

            • Competed two failovers with lfsck turned off.
            • Restarted with lfsck turned on
            • soak-5 (OSS) completed failover:
              2017-05-09 20:08:27
              ,327:fsmgmt.fsmgmt:INFO     oss_failover completed, running lfsck

            MDS reported a single error:

            May  9 20:08:39 soak-8 kernel: LustreError: 5550:0:(lfsck_lib.c:2680:lfsck_load_one_trace_file()) soaked-MDT0000-osd: unlink lfsck sub trace file lfsck_namespace_00: rc = 0
            

            Before soak hits timeout, MDS has wedged:

            May  9 20:12:41 soak-8 kernel: NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [OI_scrub:5551]
            May  9 20:12:41 soak-8 kernel: Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd dm_round_robin ipmi_ssif sb_edac ipmi_devintf ntb sg iTCO_wdt ioatdma shpchp edac_core mei_me iTCO_vendor_support mei lpc_ich ipmi_si pcspkr i2c_i801
            May  9 20:12:41 soak-8 kernel: ipmi_msghandler wmi nfsd dm_multipath dm_mod nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_en mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops isci igb ttm ahci crct10dif_pclmul crct10dif_common ptp libsas crc32c_intel libahci pps_core drm mlx4_core mpt2sas libata dca raid_class i2c_algo_bit scsi_transport_sas devlink i2c_core fjes
            May  9 20:12:41 soak-8 kernel: CPU: 6 PID: 5551 Comm: OI_scrub Tainted: P           OE  ------------   3.10.0-514.16.1.el7_lustre.x86_64 #1
            May  9 20:12:41 soak-8 kernel: Hardware name: Intel Corporation SandyBridge Platform/To be filled by O.E.M., BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
            May  9 20:12:41 soak-8 kernel: task: ffff88083fde6dd0 ti: ffff880703600000 task.ti: ffff880703600000
            May  9 20:12:41 soak-8 kernel: RIP: 0010:[<ffffffffa121d1d9>]  [<ffffffffa121d1d9>] osd_inode_iteration+0x489/0xcc0 [osd_ldiskfs]
            May  9 20:12:41 soak-8 kernel: RSP: 0018:ffff880703603d18  EFLAGS: 00000293
            May  9 20:12:41 soak-8 kernel: RAX: 0000000000000004 RBX: 0000000023f30a01 RCX: 0000000000000000
            May  9 20:12:41 soak-8 kernel: RDX: ffff880703603d78 RSI: ffff8800b2a36000 RDI: ffff8803162f6000
            May  9 20:12:41 soak-8 kernel: RBP: ffff880703603df0 R08: ffff880703603d57 R09: 0000000000000004
            May  9 20:12:41 soak-8 kernel: R10: 0000000023f30a01 R11: ffffea000c8fcc00 R12: 0000000023f30a01
            May  9 20:12:41 soak-8 kernel: R13: ffff880703603d08 R14: 0000000023f30a01 R15: ffff880703603d08
            May  9 20:12:41 soak-8 kernel: FS:  0000000000000000(0000) GS:ffff88042e180000(0000) knlGS:0000000000000000
            May  9 20:12:41 soak-8 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            May  9 20:12:41 soak-8 kernel: CR2: 00007f64d55202e0 CR3: 00000000019be000 CR4: 00000000000407e0
            May  9 20:12:41 soak-8 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            May  9 20:12:41 soak-8 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
            May  9 20:12:41 soak-8 kernel: Stack:
            May  9 20:12:41 soak-8 kernel: ffffffffa121b990 ffffffffa1217a20 ffff8800b2a36000 00000000810d354f
            May  9 20:12:41 soak-8 kernel: ffff8803162f6000 ffff8800b2a37468 0000000020000000 010000000000000c
            May  9 20:12:41 soak-8 kernel: 0000000000000000 0000000000000000 ffff8800b2a36000 0000000000000000
            May  9 20:12:41 soak-8 kernel: Call Trace:
            May  9 20:12:41 soak-8 kernel: [<ffffffffa121b990>] ? osd_ios_ROOT_scan+0x300/0x300 [osd_ldiskfs]
            May  9 20:12:41 soak-8 kernel: [<ffffffffa1217a20>] ? osd_preload_next+0xb0/0xb0 [osd_ldiskfs]
            May  9 20:12:41 soak-8 kernel: [<ffffffffa121e370>] osd_scrub_main+0x960/0xf30 [osd_ldiskfs]
            May  9 20:12:41 soak-8 kernel: [<ffffffff810c54c0>] ? wake_up_state+0x20/0x20
            May  9 20:12:41 soak-8 kernel: [<ffffffffa121da10>] ? osd_inode_iteration+0xcc0/0xcc0 [osd_ldiskfs]
            May  9 20:12:41 soak-8 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0
            May  9 20:12:41 soak-8 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
            May  9 20:12:41 soak-8 kernel: [<ffffffff81697318>] ret_from_fork+0x58/0x90
            May  9 20:12:41 soak-8 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
            May  9 20:12:41 soak-8 kernel: Code: 00 e8 7c eb 97 ff e9 0f fc ff ff 0f 1f 80 00 00 00 00 45 89 e9 4c 8d 85 67 ff ff ff 48 8b 4d a8 48 8d 55 88 48 8b b5 38 ff ff ff <48> 8b bd 48 ff ff ff 48 8b 85 28 ff ff ff ff d0 85 c0 41 89 c5
            

            lfsck hit 600 second timeout, abort attempted:

            2017-05-09 20:18:46,982:fsmgmt.fsmgmt:ERROR    lfsck still running after 600s, aborting
            2017-05-09 20:18:46,983:fsmgmt.fsmgmt:INFO     executing cmd: lctl lfsck_stop -M soaked-MDT0000
            

            One the MDS hits the lockup, it hits it over and over and within minutes is doing nothing other that hitting the lockup.
            At this point, decided not to wait for the crash, dumped stacks, them forced a crash dump.

            Crash dump is available on soak-8 in /var/crash.
            console log, vmcore-dmesg are attached.

            cliffw Cliff White (Inactive) added a comment - Restarted testing with latest master, 3577 Competed two failovers with lfsck turned off. Restarted with lfsck turned on soak-5 (OSS) completed failover: 2017-05-09 20:08:27 ,327:fsmgmt.fsmgmt:INFO oss_failover completed, running lfsck MDS reported a single error: May 9 20:08:39 soak-8 kernel: LustreError: 5550:0:(lfsck_lib.c:2680:lfsck_load_one_trace_file()) soaked-MDT0000-osd: unlink lfsck sub trace file lfsck_namespace_00: rc = 0 Before soak hits timeout, MDS has wedged: May 9 20:12:41 soak-8 kernel: NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [OI_scrub:5551] May 9 20:12:41 soak-8 kernel: Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd dm_round_robin ipmi_ssif sb_edac ipmi_devintf ntb sg iTCO_wdt ioatdma shpchp edac_core mei_me iTCO_vendor_support mei lpc_ich ipmi_si pcspkr i2c_i801 May 9 20:12:41 soak-8 kernel: ipmi_msghandler wmi nfsd dm_multipath dm_mod nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_en mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops isci igb ttm ahci crct10dif_pclmul crct10dif_common ptp libsas crc32c_intel libahci pps_core drm mlx4_core mpt2sas libata dca raid_class i2c_algo_bit scsi_transport_sas devlink i2c_core fjes May 9 20:12:41 soak-8 kernel: CPU: 6 PID: 5551 Comm: OI_scrub Tainted: P OE ------------ 3.10.0-514.16.1.el7_lustre.x86_64 #1 May 9 20:12:41 soak-8 kernel: Hardware name: Intel Corporation SandyBridge Platform/To be filled by O.E.M., BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 May 9 20:12:41 soak-8 kernel: task: ffff88083fde6dd0 ti: ffff880703600000 task.ti: ffff880703600000 May 9 20:12:41 soak-8 kernel: RIP: 0010:[<ffffffffa121d1d9>] [<ffffffffa121d1d9>] osd_inode_iteration+0x489/0xcc0 [osd_ldiskfs] May 9 20:12:41 soak-8 kernel: RSP: 0018:ffff880703603d18 EFLAGS: 00000293 May 9 20:12:41 soak-8 kernel: RAX: 0000000000000004 RBX: 0000000023f30a01 RCX: 0000000000000000 May 9 20:12:41 soak-8 kernel: RDX: ffff880703603d78 RSI: ffff8800b2a36000 RDI: ffff8803162f6000 May 9 20:12:41 soak-8 kernel: RBP: ffff880703603df0 R08: ffff880703603d57 R09: 0000000000000004 May 9 20:12:41 soak-8 kernel: R10: 0000000023f30a01 R11: ffffea000c8fcc00 R12: 0000000023f30a01 May 9 20:12:41 soak-8 kernel: R13: ffff880703603d08 R14: 0000000023f30a01 R15: ffff880703603d08 May 9 20:12:41 soak-8 kernel: FS: 0000000000000000(0000) GS:ffff88042e180000(0000) knlGS:0000000000000000 May 9 20:12:41 soak-8 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 May 9 20:12:41 soak-8 kernel: CR2: 00007f64d55202e0 CR3: 00000000019be000 CR4: 00000000000407e0 May 9 20:12:41 soak-8 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 May 9 20:12:41 soak-8 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 May 9 20:12:41 soak-8 kernel: Stack: May 9 20:12:41 soak-8 kernel: ffffffffa121b990 ffffffffa1217a20 ffff8800b2a36000 00000000810d354f May 9 20:12:41 soak-8 kernel: ffff8803162f6000 ffff8800b2a37468 0000000020000000 010000000000000c May 9 20:12:41 soak-8 kernel: 0000000000000000 0000000000000000 ffff8800b2a36000 0000000000000000 May 9 20:12:41 soak-8 kernel: Call Trace: May 9 20:12:41 soak-8 kernel: [<ffffffffa121b990>] ? osd_ios_ROOT_scan+0x300/0x300 [osd_ldiskfs] May 9 20:12:41 soak-8 kernel: [<ffffffffa1217a20>] ? osd_preload_next+0xb0/0xb0 [osd_ldiskfs] May 9 20:12:41 soak-8 kernel: [<ffffffffa121e370>] osd_scrub_main+0x960/0xf30 [osd_ldiskfs] May 9 20:12:41 soak-8 kernel: [<ffffffff810c54c0>] ? wake_up_state+0x20/0x20 May 9 20:12:41 soak-8 kernel: [<ffffffffa121da10>] ? osd_inode_iteration+0xcc0/0xcc0 [osd_ldiskfs] May 9 20:12:41 soak-8 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0 May 9 20:12:41 soak-8 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 May 9 20:12:41 soak-8 kernel: [<ffffffff81697318>] ret_from_fork+0x58/0x90 May 9 20:12:41 soak-8 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 May 9 20:12:41 soak-8 kernel: Code: 00 e8 7c eb 97 ff e9 0f fc ff ff 0f 1f 80 00 00 00 00 45 89 e9 4c 8d 85 67 ff ff ff 48 8b 4d a8 48 8d 55 88 48 8b b5 38 ff ff ff <48> 8b bd 48 ff ff ff 48 8b 85 28 ff ff ff ff d0 85 c0 41 89 c5 lfsck hit 600 second timeout, abort attempted: 2017-05-09 20:18:46,982:fsmgmt.fsmgmt:ERROR lfsck still running after 600s, aborting 2017-05-09 20:18:46,983:fsmgmt.fsmgmt:INFO executing cmd: lctl lfsck_stop -M soaked-MDT0000 One the MDS hits the lockup, it hits it over and over and within minutes is doing nothing other that hitting the lockup. At this point, decided not to wait for the crash, dumped stacks, them forced a crash dump. Crash dump is available on soak-8 in /var/crash. console log, vmcore-dmesg are attached.

            Just a couple of notes for this ticket:
            LFSCK was turned off
            Soak testing ran for approximately 48 hours before the crash
            Running master - build #3573 with no other patches applied

            jamesanunez James Nunez (Inactive) added a comment - Just a couple of notes for this ticket: LFSCK was turned off Soak testing ran for approximately 48 hours before the crash Running master - build #3573 with no other patches applied

            People

              laisiyao Lai Siyao
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: