Details

    • Technical task
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.1, Lustre 2.5.0
    • Lustre 2.5.0
    • HSM
    • 9379

    Description

      How to reproduce:
      1. copy a file ~85M to lustre file system
      2. archive and release the file
      3. calculate md5sum of the file which will trigger restore

      OSS nodes crash with the following message on console:

      BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
      IP: [<ffffffffa0b44fc5>] osd_xattr_get+0x155/0x2d0 [osd_ldiskfs]
      PGD 0 
      Oops: 0000 [#1] SMP 
      last sysfs file: /sys/devices/system/cpu/possible
      CPU 0 
      Modules linked in: lustre(U) ofd(U) osp(U) lod(U) ost(U) mdt(U) osd_ldiskfs(U) fsfilt_ldiskfs(U) ldiskfs(U) mdd(U) mgs(U) lquota(U) lfsck(U) jbd2 jbd obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) mbcache exportfs virtio_balloon i2c_piix4 i2c_core sg virtio_blk virtio_net sr_mod cdrom virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache auth_rpcgss nfs_acl sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: scsi_wait_scan]
      
      Pid: 2778, comm: ll_ost00_003 Not tainted 2.6.32-358.11.1.el6.x86_64 #1 Bochs Bochs
      RIP: 0010:[<ffffffffa0b44fc5>]  [<ffffffffa0b44fc5>] osd_xattr_get+0x155/0x2d0 [osd_ldiskfs]
      RSP: 0018:ffff8801032c1ad0  EFLAGS: 00010202
      RAX: 0000000000000000 RBX: ffffffffffffff30 RCX: 0000000000000000
      RDX: ffff880116f2c200 RSI: ffffffffa0b73540 RDI: ffffffffa0b882e0
      RBP: ffff8801032c1b10 R08: fffffffffffffffe R09: 00000000ffffffef
      R10: 000000000000000f R11: 000000000000000f R12: ffff8801032c1b38
      R13: ffff8801032c1b20 R14: 0000000000000000 R15: ffffffffa0500dbb
      FS:  0000000000000000(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      CR2: 0000000000000040 CR3: 0000000117b17000 CR4: 00000000000006f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process ll_ost00_003 (pid: 2778, threadinfo ffff8801032c0000, task ffff880116ea0040)
      Stack:
       fffffffffffffffe ffff880101dae000 fffffffffffffffe ffff880119c2bc40
      <d> 00000000fffffffe ffff880102c55440 ffff880105bc2bc0 ffff880101fda1d0
      <d> ffff8801032c1b40 ffffffffa04c0284 ffff8801032c1b38 0000000000000008
      Call Trace:
       [<ffffffffa04c0284>] dt_version_get+0x54/0x170 [obdclass]
       [<ffffffffa0d4248e>] ofd_getattr+0x30e/0x610 [ofd]
       [<ffffffffa0c94efc>] ost_getattr+0x40c/0x950 [ost]
       [<ffffffffa0351fd1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
       [<ffffffffa0c9fa6b>] ost_handle+0x1ebb/0x40e0 [ost]
       [<ffffffffa034dd84>] ? libcfs_id2str+0x74/0xb0 [libcfs]
       [<ffffffffa0659158>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
       [<ffffffffa034254e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
       [<ffffffffa0353a8f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
       [<ffffffffa0650569>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
       [<ffffffffa0351fd1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
       [<ffffffff81055af3>] ? __wake_up+0x53/0x70
       [<ffffffffa065a4dd>] ptlrpc_main+0xabd/0x1700 [ptlrpc]
       [<ffffffffa0659a20>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
       [<ffffffff810969e6>] kthread+0x96/0xa0
       [<ffffffff8100c0ca>] child_rip+0xa/0x20
       [<ffffffff81096950>] ? kthread+0x0/0xa0
       [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      Code: 52 33 04 00 6e 0a 00 00 48 c7 c6 40 35 b7 a0 48 c7 05 4c 33 04 00 00 00 00 00 c7 05 3a 33 04 00 02 00 00 00 48 c7 c7 e0 82 b8 a0 <48> 8b 48 40 48 8b 93 10 04 00 00 31 c0 e8 b9 cf 80 ff 48 8b 83 
      RIP  [<ffffffffa0b44fc5>] osd_xattr_get+0x155/0x2d0 [osd_ldiskfs]
       RSP <ffff8801032c1ad0>
      CR2: 0000000000000040
      ---[ end trace 124bcb67ec32d88c ]---
      Kernel panic - not syncing: Fatal exception
      Pid: 2778, comm: ll_ost00_003 Tainted: G      D    ---------------    2.6.32-358.11.1.el6.x86_64 #1
      Call Trace:
       [<ffffffff8150e4ac>] ? panic+0xa7/0x16f
       [<ffffffff815126d4>] ? oops_end+0xe4/0x100
       [<ffffffff81046c3b>] ? no_context+0xfb/0x260
       [<ffffffff81046ec5>] ? __bad_area_nosemaphore+0x125/0x1e0
       [<ffffffff81281a16>] ? vsnprintf+0x336/0x5e0
       [<ffffffff81046f93>] ? bad_area_nosemaphore+0x13/0x20
       [<ffffffff810476f1>] ? __do_page_fault+0x321/0x480
       [<ffffffffa034127b>] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs]
       [<ffffffffa03518eb>] ? libcfs_debug_vmsg2+0x50b/0xbb0 [libcfs]
       [<ffffffffa03518eb>] ? libcfs_debug_vmsg2+0x50b/0xbb0 [libcfs]
       [<ffffffff815145fe>] ? do_page_fault+0x3e/0xa0
       [<ffffffff815119b5>] ? page_fault+0x25/0x30
       [<ffffffffa0b44fc5>] ? osd_xattr_get+0x155/0x2d0 [osd_ldiskfs]
       [<ffffffffa04c0284>] ? dt_version_get+0x54/0x170 [obdclass]
       [<ffffffffa0d4248e>] ? ofd_getattr+0x30e/0x610 [ofd]
       [<ffffffffa0c94efc>] ? ost_getattr+0x40c/0x950 [ost]
       [<ffffffffa0351fd1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
       [<ffffffffa0c9fa6b>] ? ost_handle+0x1ebb/0x40e0 [ost]
       [<ffffffffa034dd84>] ? libcfs_id2str+0x74/0xb0 [libcfs]
       [<ffffffffa0659158>] ? ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
       [<ffffffffa034254e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
       [<ffffffffa0353a8f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
       [<ffffffffa0650569>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
       [<ffffffffa0351fd1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
       [<ffffffff81055af3>] ? __wake_up+0x53/0x70
       [<ffffffffa065a4dd>] ? ptlrpc_main+0xabd/0x1700 [ptlrpc]
       [<ffffffffa0659a20>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
       [<ffffffff810969e6>] ? kthread+0x96/0xa0
       [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
       [<ffffffff81096950>] ? kthread+0x0/0xa0
       [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      

      It looks like inode dt_object was NULL for some reason, I will take a further look.

      jxiong@titan:osd-ldiskfs$ gdb osd_ldiskfs.ko
      GNU gdb (GDB) Fedora (7.3.50.20110722-16.fc16)
      Copyright (C) 2011 Free Software Foundation, Inc.
      License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
      and "show warranty" for details.
      This GDB was configured as "x86_64-redhat-linux-gnu".
      For bug reporting instructions, please see:
      <http://www.gnu.org/software/gdb/bugs/>...
      Reading symbols from /exports/nfsroot/home/jxiong/srcs/lustre/lustre/osd-ldiskfs/osd_ldiskfs.ko...done.
      (gdb) l *(osd_xattr_get+0x155)
      0x9ff5 is in osd_xattr_get (/home/jxiong/srcs/lustre/lustre/osd-ldiskfs/osd_handler.c:2669).
      2664	static int osd_object_version_get(const struct lu_env *env,
      2665	                                  struct dt_object *dt, dt_obj_version_t *ver)
      2666	{
      2667	        struct inode *inode = osd_dt_obj(dt)->oo_inode;
      2668	
      2669	        CDEBUG(D_INODE, "Get version "LPX64" for inode %lu\n",
      2670	               LDISKFS_I(inode)->i_fs_version, inode->i_ino);
      2671	        *ver = LDISKFS_I(inode)->i_fs_version;
      2672	        return 0;
      2673	}
      

      Attachments

        Activity

          People

            jay Jinshan Xiong (Inactive)
            jay Jinshan Xiong (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: