Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3918

kernel NULL pointer dereference at lod_initialize_objects+0x327

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.4.0
    • None
    • lustre-2.4.0-16chaos_2.6.32_358.14.1.2chaos.ch5.1.1.ch5.1.1.x86_64
    • 3
    • 10355

    Description

      Related to LU-3917, we were testing out the LU-3161 patch. Starting MDS after upgrading from 2.1 to 2.4 following a writeconf. We no longer hit the assertion from those bugs. Now we hit this following NULL pointer deref. Will try to obtain a full debug log and post here.

      (gdb) l *( lod_initialize_objects+0x327)
      0x8707 is in lod_initialize_objects (/usr/src/debug/lustre-2.4.0/lustre/lod/lod_lov.c:672).
      667             if (unlikely(!cfs_bitmap_check(md->lod_ost_bitmap, idx))) {
      668                     CERROR("%s: bad idx: %d\n", lod2obd(md)->obd_name, idx);
      669                     return -EINVAL;
      670             }
      671
      672             if (unlikely(OST_TGT(md, idx) == NULL)) {
      673                     CERROR("%s: bad lod_tgt_desc for idx: %d\n",
      674                            lod2obd(md)->obd_name, idx);
      675                     return -EINVAL;
      676             }
      
      2013-09-10 12:38:43 Lustre: Lustre: Build Version: 2.4.0-16chaos-16chaos--PRISTINE-2.6.32-358.14.1.2chaos.ch5.1.1.x86_64
      2013-09-10 12:38:43 LDISKFS-fs (sdb): mounted filesystem with ordered data mode. quota=off. Opts: 
      2013-09-10 12:38:44 Lustre: lsc-MDT0000: Not available for connect from 192.168.117.117@o2ib10 (not set up)
      2013-09-10 12:38:44 Lustre: lsc-MDT0000: Not available for connect from 192.168.116.198@o2ib5 (not set up)
      2013-09-10 12:38:45 Lustre: 4192:0:(mdt_handler.c:4947:mdt_process_config()) For interoperability, skip this mdd.quota_type. It is obsolete.
      2013-09-10 12:38:45 LustreError: 11-0: lsc-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
      2013-09-10 12:38:45 LustreError: 3988:0:(lod_lov.c:668:validate_lod_and_idx()) lsc-MDT0000-mdtlov: bad idx: 459
      2013-09-10 12:38:45 LustreError: 3988:0:(mdd_orphans.c:400:orph_key_test_and_del()) lsc-MDD0000: error unlinking orphan [0x7e4cf25127a:0x207:0x0] from PENDING: rc = -22
      2013-09-10 12:38:45 LustreError: 3988:0:(lod_lov.c:668:validate_lod_and_idx()) lsc-MDT0000-mdtlov: bad idx: 425
      2013-09-10 12:38:45 LustreError: 3988:0:(mdd_orphans.c:400:orph_key_test_and_del()) lsc-MDD0000: error unlinking orphan [0x7e4cf35c78c:0xb7b:0x0] from PENDING: rc = -22
      2013-09-10 12:38:45 BUG: unable to handle kernel NULL pointer dereference at 00000000000006c0
      2013-09-10 12:38:45 IP: [<ffffffffa11406d7>] lod_initialize_objects+0x327/0xd00 [lod]
      2013-09-10 12:38:45 PGD 633810067 PUD 63448e067 PMD 0 
      2013-09-10 12:38:45 Oops: 0000 [#1] SMP 
      2013-09-10 12:38:45 last sysfs file: /sys/module/fid/initstate
      2013-09-10 12:38:45 CPU 8 
      2013-09-10 12:38:45 Modules linked in: osp(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) ldiskfs(U) mbcache lquota(U) jbd2 mdd(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate powernow_k8 freq_table mperf ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) ipmi_devintf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr mlx4_ib ib_sa ib_mad ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm_amd kvm ppdev parport_pc parport sg sd_mod crc_t10dif radeon ttm drm_kms_helper drm i2c_algo_bit sr_mod cdrom serio_raw shpchp k10temp amd64_edac_mod edac_core edac_mce_amd 3w_9xxx sata_nv ata_generic pata_acpi pata_amd i2c_nforce2 i2c_core ipv6 nfs lockd fscache auth_rpcgss nfs_acl sunrpc e1000 mlx4_en mlx4_core [last unloaded: cpufreq_ondemand]
      2013-09-10 12:38:45 
      2013-09-10 12:38:45 Pid: 3988, comm: mount.lustre Tainted: P           ---------------    2.6.32-358.14.1.2chaos.ch5.1.1.x86_64 #1 Supermicro H8QM8/H8QM8
      2013-09-10 12:38:45 RIP: 0010:[<ffffffffa11406d7>]  [<ffffffffa11406d7>] lod_initialize_objects+0x327/0xd00 [lod]
      2013-09-10 12:38:45 RSP: 0018:ffff8804343198d8  EFLAGS: 00010213
      2013-09-10 12:38:45 RAX: 0000000000000000 RBX: ffff880632a65a00 RCX: 00000000ffffffff
      2013-09-10 12:38:45 RDX: 00000000000000d8 RSI: 000000000000005e RDI: ffff880831194900
      2013-09-10 12:38:45 RBP: ffff880434319958 R08: 00000000000001d8 R09: ffff880633e80ba0
      2013-09-10 12:38:45 R10: ffff880628ffd6c0 R11: 0000000000000000 R12: 0000000000000000
      2013-09-10 12:38:45 R13: 0000000000000000 R14: ffff880632a65a00 R15: ffff880831194000
      2013-09-10 12:38:45 FS:  00007ffff7fdb700(0000) GS:ffff88043f400000(0000) knlGS:0000000000000000
      2013-09-10 12:38:45 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      2013-09-10 12:38:45 CR2: 00000000000006c0 CR3: 0000000633983000 CR4: 00000000000007e0
      2013-09-10 12:38:45 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      2013-09-10 12:38:45 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      2013-09-10 12:38:45 Process mount.lustre (pid: 3988, threadinfo ffff880434318000, task ffff8804339ccaa0)
      2013-09-10 12:38:45 Stack:
      2013-09-10 12:38:45  ffff8806000001d8 0000000000000002 0002880400000002 0000000000000010
      2013-09-10 12:38:45 <d> ffff880632a65a30 ffff880632943f60 ffff880434319b78 ffff880633e80ba0
      2013-09-10 12:38:45 <d> ffff8808311940e8 ffff880632916998 0000a30128ffd6c0 ffff880632916998
      2013-09-10 12:38:45 Call Trace:
      2013-09-10 12:38:45  [<ffffffffa11412b1>] lod_parse_striping+0x201/0x300 [lod]
      2013-09-10 12:38:45  [<ffffffffa1142d00>] lod_load_striping+0x290/0x440 [lod]
      2013-09-10 12:38:45  [<ffffffffa114e73b>] lod_declare_object_destroy+0x16b/0x390 [lod]
      2013-09-10 12:38:45  [<ffffffffa0e180a9>] __mdd_orphan_cleanup+0x7d9/0xca0 [mdd]
      2013-09-10 12:38:45  [<ffffffffa0e2733d>] mdd_recovery_complete+0xed/0x170 [mdd]
      2013-09-10 12:38:45  [<ffffffffa1086ab0>] ? mdt_recovery_handle+0x0/0x120 [mdt]
      2013-09-10 12:38:45  [<ffffffffa1078cb5>] mdt_postrecov+0x35/0xd0 [mdt]
      2013-09-10 12:38:45  [<ffffffffa107d48b>] mdt_prepare+0x3cb/0x490 [mdt]
      2013-09-10 12:38:45  [<ffffffffa0884b86>] server_start_targets+0x1266/0x1960 [obdclass]
      2013-09-10 12:38:45  [<ffffffffa0854c15>] ? lustre_start_mgc+0x4b5/0x1ee0 [obdclass]
      2013-09-10 12:38:45  [<ffffffffa084c950>] ? class_config_llog_handler+0x0/0x1830 [obdclass]
      2013-09-10 12:38:45  [<ffffffffa0886164>] server_fill_super+0xee4/0x1580 [obdclass]
      2013-09-10 12:38:45  [<ffffffffa0856818>] lustre_fill_super+0x1d8/0x530 [obdclass]
      2013-09-10 12:38:45  [<ffffffffa0856640>] ? lustre_fill_super+0x0/0x530 [obdclass]
      2013-09-10 12:38:45  [<ffffffff8118539f>] get_sb_nodev+0x5f/0xa0
      2013-09-10 12:38:45  [<ffffffffa084e285>] lustre_get_sb+0x25/0x30 [obdclass]
      2013-09-10 12:38:45  [<ffffffff811849bb>] vfs_kern_mount+0x7b/0x1b0
      2013-09-10 12:38:45  [<ffffffff81184b62>] do_kern_mount+0x52/0x130
      2013-09-10 12:38:45  [<ffffffff811a4bf2>] do_mount+0x2d2/0x8d0
      2013-09-10 12:38:45  [<ffffffff811a5280>] sys_mount+0x90/0xe0
      2013-09-10 12:38:45  [<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b
      2013-09-10 12:38:45 Code: f0 08 00 00 44 8b 45 80 44 0f a3 40 08 19 c0 85 c0 0f 84 48 06 00 00 44 89 c0 41 0f b6 d0 c1 f8 08 48 98 49 8b 84 c7 e8 00 00 00 <48> 8b 04 d0 48 85 c0 0f 84 64 07 00 00 4c 8b 30 4d 85 f6 0f 84 
      2013-09-10 12:38:45 RIP  [<ffffffffa11406d7>] lod_initialize_objects+0x327/0xd00 [lod]
      2013-09-10 12:38:45  RSP <ffff8804343198d8>
      2013-09-10 12:38:45 CR2: 00000000000006c0
      2013-09-10 12:38:45 REWRITING MCP55 CFG REG
      2013-09-10 12:38:45 CFG = c1
      

      Attachments

        Issue Links

          Activity

            [LU-3918] kernel NULL pointer dereference at lod_initialize_objects+0x327
            bobijam Zhenyu Xu added a comment -
            crash> struct lod_device 0xffff88083134e000
            ...
              lod_ost_descs = {
                ltd_tgt_idx = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 
                ltd_tgts_size = 32, 
                ltd_tgtnr = 0, 
                ltd_tgt_bitmap = 0xffff8808312cd620, 
            ...
            
            crash> struct cfs_bitmap_t 0xffff8808312cd620
            struct cfs_bitmap_t {
              size = 32, 
              data = 0xffff8808312cd628
            }
            
            crash> x/x 0xffff8808312cd628
            0xffff8808312cd628:     0x00000000
            

            strangely !cfs_bitmap_check(md->lod_ost_bitmap, idx) should has caught that there's no target on that index, since bitmap is all 0, unless the idx is beyond 32.

            bobijam Zhenyu Xu added a comment - crash> struct lod_device 0xffff88083134e000 ... lod_ost_descs = { ltd_tgt_idx = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, ltd_tgts_size = 32, ltd_tgtnr = 0, ltd_tgt_bitmap = 0xffff8808312cd620, ... crash> struct cfs_bitmap_t 0xffff8808312cd620 struct cfs_bitmap_t { size = 32, data = 0xffff8808312cd628 } crash> x/x 0xffff8808312cd628 0xffff8808312cd628: 0x00000000 strangely !cfs_bitmap_check(md->lod_ost_bitmap, idx) should has caught that there's no target on that index, since bitmap is all 0, unless the idx is beyond 32.
            nedbass Ned Bass (Inactive) added a comment - - edited

            Uploaded vmcore to ftp.whamcloud.com. Email bass6 at llnl dot gov if you need the filename.

            nedbass Ned Bass (Inactive) added a comment - - edited Uploaded vmcore to ftp.whamcloud.com. Email bass6 at llnl dot gov if you need the filename.

            I'll upload a crash dump today. Sorry for the delay, I've been busy putting out other fires.

            nedbass Ned Bass (Inactive) added a comment - I'll upload a crash dump today. Sorry for the delay, I've been busy putting out other fires.
            bobijam Zhenyu Xu added a comment -

            Still needs to know where lod_initialize_objects+0x327 refers to. Do you have a rehit and collect crashdump/image?

            bobijam Zhenyu Xu added a comment - Still needs to know where lod_initialize_objects+0x327 refers to. Do you have a rehit and collect crashdump/image?
            pjones Peter Jones added a comment -

            This seems to be related to LU-3917.

            Ned, is there any further information available to assist with the investigation?

            pjones Peter Jones added a comment - This seems to be related to LU-3917 . Ned, is there any further information available to assist with the investigation?
            green Oleg Drokin added a comment -

            So do you still have the crashdump/image of problematic fs to help us move forward on this?

            green Oleg Drokin added a comment - So do you still have the crashdump/image of problematic fs to help us move forward on this?

            We need to find someone other than Oleg to assign this ticket to

            jlevi Jodi Levi (Inactive) added a comment - We need to find someone other than Oleg to assign this ticket to
            green Oleg Drokin added a comment -

            I see.

            Can you please trace the array content (in the crashdump) to make sure where the NULL pointer really is this time? that line is a big macro with several dereferences.

            green Oleg Drokin added a comment - I see. Can you please trace the array content (in the crashdump) to make sure where the NULL pointer really is this time? that line is a big macro with several dereferences.

            Oleg, I obtained this debug log by reproducing the crash with a -1 lustre debug level. It does not match the console output above (i.e. the LustreError messages) because those errors did not reproduce after the first crash. However, the first crash had default debug levels so the debug log from that one would probably not be helpful.

            nedbass Ned Bass (Inactive) added a comment - Oleg, I obtained this debug log by reproducing the crash with a -1 lustre debug level. It does not match the console output above (i.e. the LustreError messages) because those errors did not reproduce after the first crash. However, the first crash had default debug levels so the debug log from that one would probably not be helpful.
            green Oleg Drokin added a comment -

            this seems to be a wrong lustre log as it does not have any of the stuff that supposedly is happening in the crash.

            I tried this patch locally on my reproducer and it worked, so something more subtle might be going on, even though I am not sure how could you end up with a null pointer in ltd->ltd_tgt_idx[index / TGT_PTRS_PER_BLOCK] (the only place that could produce a crash like this, I think) since we did check the bitmap first and we do set the pointer to not null first before setting the bitmap.

            green Oleg Drokin added a comment - this seems to be a wrong lustre log as it does not have any of the stuff that supposedly is happening in the crash. I tried this patch locally on my reproducer and it worked, so something more subtle might be going on, even though I am not sure how could you end up with a null pointer in ltd->ltd_tgt_idx [index / TGT_PTRS_PER_BLOCK] (the only place that could produce a crash like this, I think) since we did check the bitmap first and we do set the pointer to not null first before setting the bitmap.
            nedbass Ned Bass (Inactive) added a comment - - edited

            Attached debug log lustre.log.LU-3918.gz

            nedbass Ned Bass (Inactive) added a comment - - edited Attached debug log lustre.log.LU-3918.gz

            People

              bobijam Zhenyu Xu
              nedbass Ned Bass (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: