Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3411

Encountered at NULL pointer exception for function osd_read_prep

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.4.1, Lustre 2.5.0
    • Lustre 2.4.0
    • None
    • OSS running Lustre 2.4.0-RC2 on a RHEL6.4 system.
    • 3
    • 8429

    Description

      While running test application against the file system one of OSS crashed and rebooted. We managed to get a crash dump and form the early analysis determined the crash to occur in osd_read_prep. Here is what was dmesg ran in crash showed.

      [ 6304.888367] BUG: unable to handle kernel NULL pointer dereference at 0000000000000036
      [ 6304.889314] IP: [<ffffffffa0d31776>] osd_read_prep+0x326/0x3b0 [osd_ldiskfs]
      [ 6304.889314] PGD 3dd3bb067 PUD 3dd177067 PMD 0
      [ 6304.935543] Oops: 0000 1 SMP
      [ 6304.935543] last sysfs file: /sys/devices/pci0000:00/0000:00:04.0/0000:0a:00.0/infiniband_mad/umad0/port
      [ 6304.935543] CPU 5
      [ 6304.935543] Modules linked in: osp(U) ofd(U) ost(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) ldiskfs(U) mbcache mdd(U) jbd2 lustre(U) lov(U) osc(U)
      mdc(U) lquota(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) autofs4 scsi_wait_scan dm_rou
      nd_robin sd_mod crc_t10dif ib_srp scsi_transport_srp scsi_tgt nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_REJECT xt_comment nf_conntrack_ipv4 n
      f_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa dm_mirror dm
      _region_hash dm_log dm_multipath dm_mod sg radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core sr_mod cdrom shpchp ib_mthca ib_mad ib_core microcode d
      cdbas serio_raw iTCO_wdt iTCO_vendor_support ata_generic pata_acpi ata_piix i5000_edac edac_core i5k_amb ioatdma dca ipv6 nfs lockd fscache auth_rpcgs
      s nfs_acl sunrpc bnx2 [last unloaded: scsi_wait_scan]
      [ 6304.935543]
      [ 6304.935543] Pid: 2910, comm: ll_ost_io01_003 Not tainted 2.6.32-358.6.1.el6.wc.x86_64 #1 Dell Inc. PowerEdge 1950/0TT740
      [ 6304.935543] RIP: 0010:[<ffffffffa0d31776>] [<ffffffffa0d31776>] osd_read_prep+0x326/0x3b0 [osd_ldiskfs]
      [ 6304.935543] RSP: 0018:ffff880385b3b8f0 EFLAGS: 00010246
      [ 6304.935543] RAX: fffffffffffffffe RBX: ffff8803a376a000 RCX: ffff880385ba2000
      [ 6304.935543] RDX: 0000000000000100 RSI: ffff880385a36000 RDI: ffff880378e4f5f0
      [ 6304.935543] RBP: ffff880385b3b960 R08: 0000000000000000 R09: 0000000000000000
      [ 6304.935543] R10: 0000000000000100 R11: 0000000000000000 R12: ffff880385b12fd0
      [ 6304.935543] R13: ffff880378e4f5f0 R14: ffff880385b13000 R15: 0000000000000100
      [ 6304.935543] FS: 00007f7ca8560700(0000) GS:ffff880028340000(0000) knlGS:0000000000000000
      [ 6304.935543] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [ 6304.935543] CR2: 0000000000000036 CR3: 00000003dd429000 CR4: 00000000000007e0
      [ 6304.935543] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 6304.935543] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [ 6304.935543] Process ll_ost_io01_003 (pid: 2910, threadinfo ffff880385b3a000, task ffff88038c4f8040)
      [ 6304.935543] Stack:
      [ 6304.935543] ffff8803a9b4c000 0000000092fb0380 ffff8803a9b4cbd0 0000010000000000
      [ 6304.935543] <d> 0000000051a4df4c 00000000000eef4c 0000000051a4df4c 00000000000eeef3
      [ 6304.935543] <d> ffff880385b3b960 ffffc900322a4d80 0000000000000100 0000000000000000
      [ 6304.935543] Call Trace:
      [ 6304.935543] [<ffffffffa08c1a43>] ofd_preprw_read+0x253/0x7f0 [ofd]
      [ 6304.935543] [<ffffffffa08c274a>] ofd_preprw+0x76a/0x13c0 [ofd]
      [ 6304.935543] [<ffffffffa0776fc1>] ? lustre_pack_reply_v2+0x1e1/0x280 [ptlrpc]
      [ 6304.935543] [<ffffffffa06d1e0c>] obd_preprw+0x12c/0x3d0 [ost]
      [ 6304.935543] [<ffffffffa06d9780>] ost_brw_read+0xd60/0x1340 [ost]
      [ 6304.935543] [<ffffffff81277779>] ? cpumask_next_and+0x29/0x50
      [ 6304.935543] [<ffffffff81059784>] ? find_busiest_group+0x244/0x9f0
      [ 6304.935543] [<ffffffffa0777e6c>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc]
      [ 6304.935543] [<ffffffffa0777fc8>] ? lustre_msg_check_version+0xe8/0x100 [ptlrpc]
      [ 6304.935543] [<ffffffffa06e0f28>] ost_handle+0x2ac8/0x48e0 [ost]
      [ 6304.935543] [<ffffffffa077ebab>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc]
      [ 6304.935543] [<ffffffffa0787388>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
      [ 6304.935543] [<ffffffffa046b5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      [ 6304.935543] [<ffffffffa047cd8f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
      [ 6304.935543] [<ffffffffa077e6e9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
      [ 6304.935543] [<ffffffff81055ab3>] ? __wake_up+0x53/0x70
      [ 6304.935543] [<ffffffffa078871e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
      [ 6304.935543] [<ffffffffa0787c50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      [ 6304.935543] [<ffffffff8100c0ca>] child_rip+0xa/0x20
      [ 6304.935543] [<ffffffffa0787c50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      [ 6304.935543] [<ffffffffa0787c50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      [ 6304.935543] [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      [ 6304.935543] Code: 5c 41 5d 41 5e 41 5f c9 c3 0f 1f 40 00 48 8b 83 40 01 00 00 48 8b 8e 28 0c 00 00 45 31 c9 48 8b b6 10 0c 00 00 45 31 c0 4c 89 ef <ff> 50 38 48 8b 55 a0 4c 89 ee 48 89 df e8 68 d6 ff ff 89 45 9c
      [ 6304.935543] RIP [<ffffffffa0d31776>] osd_read_prep+0x326/0x3b0 [osd_ldiskfs]
      [ 6304.935543] RSP <ffff880385b3b8f0>
      [ 6304.935543] CR2: 0000000000000036

      Attachments

        Issue Links

          Activity

            [LU-3411] Encountered at NULL pointer exception for function osd_read_prep
            pjones Peter Jones added a comment -

            Landed for 2.5

            pjones Peter Jones added a comment - Landed for 2.5

            This patch was merged in the 2.4 branch. We can close this ticket now.

            simmonsja James A Simmons added a comment - This patch was merged in the 2.4 branch. We can close this ticket now.

            The second attempt we did not have this problem so fsfilt_ldiskfs.ko was there. Yes its very strange.

            simmonsja James A Simmons added a comment - The second attempt we did not have this problem so fsfilt_ldiskfs.ko was there. Yes its very strange.
            green Oleg Drokin added a comment - - edited

            Ok, so the patch fixes the crash. The pressing question I have - how come this path triggered at all? It's not like there was no fsfilt_ldiskfs.ko module present, right? So how come?

            green Oleg Drokin added a comment - - edited Ok, so the patch fixes the crash. The pressing question I have - how come this path triggered at all? It's not like there was no fsfilt_ldiskfs.ko module present, right? So how come?
            bzzz Alex Zhuravlev added a comment - http://review.whamcloud.com/6496

            just discussed with Oleg, seem to be a result of incorrect error handling in osd_mount():

            o->od_fsops = fsfilt_get_ops(mt_str(LDD_MT_LDISKFS));
            if (o->od_fsops == NULL) {

            while fsfilt_get_ops() can actually return -ENOENT, giving us that -2 offset to 0x38.

            bzzz Alex Zhuravlev added a comment - just discussed with Oleg, seem to be a result of incorrect error handling in osd_mount(): o->od_fsops = fsfilt_get_ops(mt_str(LDD_MT_LDISKFS)); if (o->od_fsops == NULL) { while fsfilt_get_ops() can actually return -ENOENT, giving us that -2 offset to 0x38.

            given iobuf has been used before in few places and proximity of 0x36 to:
            (gdb) p/x &((struct fsfilt_operations *)0)->fs_map_inode_pages
            $3 = 0x38

            I tend to think it's osd->od_fsops being NULL. though I'd expect struct fsfilt_operations to have exact offsets..

            bzzz Alex Zhuravlev added a comment - given iobuf has been used before in few places and proximity of 0x36 to: (gdb) p/x &((struct fsfilt_operations *)0)->fs_map_inode_pages $3 = 0x38 I tend to think it's osd->od_fsops being NULL. though I'd expect struct fsfilt_operations to have exact offsets..
            simmonsja James A Simmons added a comment - - edited
            crash> l *(osd_read_prep+0x326)
            0xffffffffa0d31776 is in osd_read_prep (/usr/src/debug/lustre-2.4.0/lustre/osd-ldiskfs/osd_io.c:930).
            925             cfs_gettimeofday(&end);
            926             timediff = cfs_timeval_sub(&end, &start, NULL);
            927             lprocfs_counter_add(osd->od_stats, LPROC_OSD_GET_PAGE, timediff);
            928     
            929             if (iobuf->dr_npages) {
            930                 rc = osd->od_fsops->fs_map_inode_pages(inode, iobuf->dr_pages,
            931                                                            iobuf->dr_npages,
            932                                                            iobuf->dr_blocks,
            933                                                            0, NULL);
            934                     rc = osd_do_bio(osd, inode, iobuf);
            
            simmonsja James A Simmons added a comment - - edited crash> l *(osd_read_prep+0x326) 0xffffffffa0d31776 is in osd_read_prep (/usr/src/debug/lustre-2.4.0/lustre/osd-ldiskfs/osd_io.c:930). 925 cfs_gettimeofday(&end); 926 timediff = cfs_timeval_sub(&end, &start, NULL); 927 lprocfs_counter_add(osd->od_stats, LPROC_OSD_GET_PAGE, timediff); 928 929 if (iobuf->dr_npages) { 930 rc = osd->od_fsops->fs_map_inode_pages(inode, iobuf->dr_pages, 931 iobuf->dr_npages, 932 iobuf->dr_blocks, 933 0, NULL); 934 rc = osd_do_bio(osd, inode, iobuf);

            Uploaded the vmcore and debuginfo rpms to ftp.whamcloud.com/uploads/LU-3411

            simmonsja James A Simmons added a comment - Uploaded the vmcore and debuginfo rpms to ftp.whamcloud.com/uploads/ LU-3411

            When you get a chance, can you please decode osd_read_prep+0x326/0x3b0 to a specific line number, and verify that 0x36 is a valid structure offset for a NULL pointer that is being accessed on that line.

            adilger Andreas Dilger added a comment - When you get a chance, can you please decode osd_read_prep+0x326/0x3b0 to a specific line number, and verify that 0x36 is a valid structure offset for a NULL pointer that is being accessed on that line.

            People

              bzzz Alex Zhuravlev
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: