[LU-3918] kernel NULL pointer dereference at lod_initialize_objects+0x327 Created: 10/Sep/13 Updated: 16/Oct/13 Resolved: 23/Sep/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Ned Bass | Assignee: | Zhenyu Xu |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
lustre-2.4.0-16chaos_2.6.32_358.14.1.2chaos.ch5.1.1.ch5.1.1.x86_64 |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 10355 | ||||||||||||
| Description |
|
Related to (gdb) l *( lod_initialize_objects+0x327)
0x8707 is in lod_initialize_objects (/usr/src/debug/lustre-2.4.0/lustre/lod/lod_lov.c:672).
667 if (unlikely(!cfs_bitmap_check(md->lod_ost_bitmap, idx))) {
668 CERROR("%s: bad idx: %d\n", lod2obd(md)->obd_name, idx);
669 return -EINVAL;
670 }
671
672 if (unlikely(OST_TGT(md, idx) == NULL)) {
673 CERROR("%s: bad lod_tgt_desc for idx: %d\n",
674 lod2obd(md)->obd_name, idx);
675 return -EINVAL;
676 }
2013-09-10 12:38:43 Lustre: Lustre: Build Version: 2.4.0-16chaos-16chaos--PRISTINE-2.6.32-358.14.1.2chaos.ch5.1.1.x86_64 2013-09-10 12:38:43 LDISKFS-fs (sdb): mounted filesystem with ordered data mode. quota=off. Opts: 2013-09-10 12:38:44 Lustre: lsc-MDT0000: Not available for connect from 192.168.117.117@o2ib10 (not set up) 2013-09-10 12:38:44 Lustre: lsc-MDT0000: Not available for connect from 192.168.116.198@o2ib5 (not set up) 2013-09-10 12:38:45 Lustre: 4192:0:(mdt_handler.c:4947:mdt_process_config()) For interoperability, skip this mdd.quota_type. It is obsolete. 2013-09-10 12:38:45 LustreError: 11-0: lsc-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11. 2013-09-10 12:38:45 LustreError: 3988:0:(lod_lov.c:668:validate_lod_and_idx()) lsc-MDT0000-mdtlov: bad idx: 459 2013-09-10 12:38:45 LustreError: 3988:0:(mdd_orphans.c:400:orph_key_test_and_del()) lsc-MDD0000: error unlinking orphan [0x7e4cf25127a:0x207:0x0] from PENDING: rc = -22 2013-09-10 12:38:45 LustreError: 3988:0:(lod_lov.c:668:validate_lod_and_idx()) lsc-MDT0000-mdtlov: bad idx: 425 2013-09-10 12:38:45 LustreError: 3988:0:(mdd_orphans.c:400:orph_key_test_and_del()) lsc-MDD0000: error unlinking orphan [0x7e4cf35c78c:0xb7b:0x0] from PENDING: rc = -22 2013-09-10 12:38:45 BUG: unable to handle kernel NULL pointer dereference at 00000000000006c0 2013-09-10 12:38:45 IP: [<ffffffffa11406d7>] lod_initialize_objects+0x327/0xd00 [lod] 2013-09-10 12:38:45 PGD 633810067 PUD 63448e067 PMD 0 2013-09-10 12:38:45 Oops: 0000 [#1] SMP 2013-09-10 12:38:45 last sysfs file: /sys/module/fid/initstate 2013-09-10 12:38:45 CPU 8 2013-09-10 12:38:45 Modules linked in: osp(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) ldiskfs(U) mbcache lquota(U) jbd2 mdd(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate powernow_k8 freq_table mperf ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) ipmi_devintf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr mlx4_ib ib_sa ib_mad ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm_amd kvm ppdev parport_pc parport sg sd_mod crc_t10dif radeon ttm drm_kms_helper drm i2c_algo_bit sr_mod cdrom serio_raw shpchp k10temp amd64_edac_mod edac_core edac_mce_amd 3w_9xxx sata_nv ata_generic pata_acpi pata_amd i2c_nforce2 i2c_core ipv6 nfs lockd fscache auth_rpcgss nfs_acl sunrpc e1000 mlx4_en mlx4_core [last unloaded: cpufreq_ondemand] 2013-09-10 12:38:45 2013-09-10 12:38:45 Pid: 3988, comm: mount.lustre Tainted: P --------------- 2.6.32-358.14.1.2chaos.ch5.1.1.x86_64 #1 Supermicro H8QM8/H8QM8 2013-09-10 12:38:45 RIP: 0010:[<ffffffffa11406d7>] [<ffffffffa11406d7>] lod_initialize_objects+0x327/0xd00 [lod] 2013-09-10 12:38:45 RSP: 0018:ffff8804343198d8 EFLAGS: 00010213 2013-09-10 12:38:45 RAX: 0000000000000000 RBX: ffff880632a65a00 RCX: 00000000ffffffff 2013-09-10 12:38:45 RDX: 00000000000000d8 RSI: 000000000000005e RDI: ffff880831194900 2013-09-10 12:38:45 RBP: ffff880434319958 R08: 00000000000001d8 R09: ffff880633e80ba0 2013-09-10 12:38:45 R10: ffff880628ffd6c0 R11: 0000000000000000 R12: 0000000000000000 2013-09-10 12:38:45 R13: 0000000000000000 R14: ffff880632a65a00 R15: ffff880831194000 2013-09-10 12:38:45 FS: 00007ffff7fdb700(0000) GS:ffff88043f400000(0000) knlGS:0000000000000000 2013-09-10 12:38:45 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 2013-09-10 12:38:45 CR2: 00000000000006c0 CR3: 0000000633983000 CR4: 00000000000007e0 2013-09-10 12:38:45 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 2013-09-10 12:38:45 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 2013-09-10 12:38:45 Process mount.lustre (pid: 3988, threadinfo ffff880434318000, task ffff8804339ccaa0) 2013-09-10 12:38:45 Stack: 2013-09-10 12:38:45 ffff8806000001d8 0000000000000002 0002880400000002 0000000000000010 2013-09-10 12:38:45 <d> ffff880632a65a30 ffff880632943f60 ffff880434319b78 ffff880633e80ba0 2013-09-10 12:38:45 <d> ffff8808311940e8 ffff880632916998 0000a30128ffd6c0 ffff880632916998 2013-09-10 12:38:45 Call Trace: 2013-09-10 12:38:45 [<ffffffffa11412b1>] lod_parse_striping+0x201/0x300 [lod] 2013-09-10 12:38:45 [<ffffffffa1142d00>] lod_load_striping+0x290/0x440 [lod] 2013-09-10 12:38:45 [<ffffffffa114e73b>] lod_declare_object_destroy+0x16b/0x390 [lod] 2013-09-10 12:38:45 [<ffffffffa0e180a9>] __mdd_orphan_cleanup+0x7d9/0xca0 [mdd] 2013-09-10 12:38:45 [<ffffffffa0e2733d>] mdd_recovery_complete+0xed/0x170 [mdd] 2013-09-10 12:38:45 [<ffffffffa1086ab0>] ? mdt_recovery_handle+0x0/0x120 [mdt] 2013-09-10 12:38:45 [<ffffffffa1078cb5>] mdt_postrecov+0x35/0xd0 [mdt] 2013-09-10 12:38:45 [<ffffffffa107d48b>] mdt_prepare+0x3cb/0x490 [mdt] 2013-09-10 12:38:45 [<ffffffffa0884b86>] server_start_targets+0x1266/0x1960 [obdclass] 2013-09-10 12:38:45 [<ffffffffa0854c15>] ? lustre_start_mgc+0x4b5/0x1ee0 [obdclass] 2013-09-10 12:38:45 [<ffffffffa084c950>] ? class_config_llog_handler+0x0/0x1830 [obdclass] 2013-09-10 12:38:45 [<ffffffffa0886164>] server_fill_super+0xee4/0x1580 [obdclass] 2013-09-10 12:38:45 [<ffffffffa0856818>] lustre_fill_super+0x1d8/0x530 [obdclass] 2013-09-10 12:38:45 [<ffffffffa0856640>] ? lustre_fill_super+0x0/0x530 [obdclass] 2013-09-10 12:38:45 [<ffffffff8118539f>] get_sb_nodev+0x5f/0xa0 2013-09-10 12:38:45 [<ffffffffa084e285>] lustre_get_sb+0x25/0x30 [obdclass] 2013-09-10 12:38:45 [<ffffffff811849bb>] vfs_kern_mount+0x7b/0x1b0 2013-09-10 12:38:45 [<ffffffff81184b62>] do_kern_mount+0x52/0x130 2013-09-10 12:38:45 [<ffffffff811a4bf2>] do_mount+0x2d2/0x8d0 2013-09-10 12:38:45 [<ffffffff811a5280>] sys_mount+0x90/0xe0 2013-09-10 12:38:45 [<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b 2013-09-10 12:38:45 Code: f0 08 00 00 44 8b 45 80 44 0f a3 40 08 19 c0 85 c0 0f 84 48 06 00 00 44 89 c0 41 0f b6 d0 c1 f8 08 48 98 49 8b 84 c7 e8 00 00 00 <48> 8b 04 d0 48 85 c0 0f 84 64 07 00 00 4c 8b 30 4d 85 f6 0f 84 2013-09-10 12:38:45 RIP [<ffffffffa11406d7>] lod_initialize_objects+0x327/0xd00 [lod] 2013-09-10 12:38:45 RSP <ffff8804343198d8> 2013-09-10 12:38:45 CR2: 00000000000006c0 2013-09-10 12:38:45 REWRITING MCP55 CFG REG 2013-09-10 12:38:45 CFG = c1 |
| Comments |
| Comment by Peter Jones [ 10/Sep/13 ] |
|
Ned Just to clarify - do you mean the Peter |
| Comment by Ned Bass [ 10/Sep/13 ] |
|
Yes, sorry. I'll edit my comments to avoid confusion. |
| Comment by Ned Bass [ 10/Sep/13 ] |
|
Attached debug log lustre.log.LU-3918.gz |
| Comment by Oleg Drokin [ 11/Sep/13 ] |
|
this seems to be a wrong lustre log as it does not have any of the stuff that supposedly is happening in the crash. I tried this patch locally on my reproducer and it worked, so something more subtle might be going on, even though I am not sure how could you end up with a null pointer in ltd->ltd_tgt_idx[index / TGT_PTRS_PER_BLOCK] (the only place that could produce a crash like this, I think) since we did check the bitmap first and we do set the pointer to not null first before setting the bitmap. |
| Comment by Ned Bass [ 11/Sep/13 ] |
|
Oleg, I obtained this debug log by reproducing the crash with a -1 lustre debug level. It does not match the console output above (i.e. the LustreError messages) because those errors did not reproduce after the first crash. However, the first crash had default debug levels so the debug log from that one would probably not be helpful. |
| Comment by Oleg Drokin [ 12/Sep/13 ] |
|
I see. Can you please trace the array content (in the crashdump) to make sure where the NULL pointer really is this time? that line is a big macro with several dereferences. |
| Comment by Jodi Levi (Inactive) [ 13/Sep/13 ] |
|
We need to find someone other than Oleg to assign this ticket to |
| Comment by Oleg Drokin [ 13/Sep/13 ] |
|
So do you still have the crashdump/image of problematic fs to help us move forward on this? |
| Comment by Peter Jones [ 13/Sep/13 ] |
|
This seems to be related to Ned, is there any further information available to assist with the investigation? |
| Comment by Zhenyu Xu [ 18/Sep/13 ] |
|
Still needs to know where lod_initialize_objects+0x327 refers to. Do you have a rehit and collect crashdump/image? |
| Comment by Ned Bass [ 19/Sep/13 ] |
|
I'll upload a crash dump today. Sorry for the delay, I've been busy putting out other fires. |
| Comment by Ned Bass [ 19/Sep/13 ] |
|
Uploaded vmcore to ftp.whamcloud.com. Email bass6 at llnl dot gov if you need the filename. |
| Comment by Zhenyu Xu [ 22/Sep/13 ] |
crash> struct lod_device 0xffff88083134e000
...
lod_ost_descs = {
ltd_tgt_idx = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
ltd_tgts_size = 32,
ltd_tgtnr = 0,
ltd_tgt_bitmap = 0xffff8808312cd620,
...
crash> struct cfs_bitmap_t 0xffff8808312cd620
struct cfs_bitmap_t {
size = 32,
data = 0xffff8808312cd628
}
crash> x/x 0xffff8808312cd628
0xffff8808312cd628: 0x00000000
strangely !cfs_bitmap_check(md->lod_ost_bitmap, idx) should has caught that there's no target on that index, since bitmap is all 0, unless the idx is beyond 32. |
| Comment by Zhenyu Xu [ 22/Sep/13 ] |
|
would you mind trying the patchset 7 of http://review.whamcloud.com/#/c/7234/ |
| Comment by Andreas Dilger [ 23/Sep/13 ] |
|
I believe the newer patch in |
| Comment by Andreas Dilger [ 23/Sep/13 ] |
|
Closing this as a duplicate of |