Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7536

racer_1 test: mds crash : BUG: unable to handle kernel NULL pointer dereference at (null) Oops: 0010 [#1] SMP

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.9.0
    • Lustre 2.8.0
    • Server 2.7.63 Client 2.7.63
      Configuration : 4 node_dne_singlemds - 1 MDS with DNE, 1 OSS, 2 clients
      Release
      2.6.32_431.29.2.el6_lustremaster_master__77_g2d11035 Build Date: Sat 05 Dec 2015 05:40:23 PM UTC
    • 3
    • 9223372036854775807

    Description

      <1>BUG: unable to handle kernel NULL pointer dereference at (null)
      <1>IP: [<(null)>] (null)
      <4>PGD 0 
      <4>Oops: 0010 [#1] SMP 
      <4>last sysfs file: /sys/devices/system/cpu/online
      <4>CPU 0 
      <4>Modules linked in: osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgs(U) mgc(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic sha256_generic libcfs(U) ldiskfs(U) nfs lockd fscache auth_rpcgss nfs_acl sunrpc ipt_REJECT ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 virtio_balloon virtio_net i2c_piix4 i2c_core ext4 jbd2 mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nf_defrag_ipv4]
      <4>
      <4>Pid: 7661, comm: mdt00_016 Not tainted 2.6.32-431.29.2.el6_lustremaster_master__77 #1 Red Hat KVM
      <4>RIP: 0010:[<0000000000000000>]  [<(null)>] (null)
      <4>RSP: 0018:ffff880056cefca8  EFLAGS: 00010246
      <4>RAX: 0000000000000009 RBX: ffff880056c0b800 RCX: ffff88007c2aea20
      <4>RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880056c0b800
      <4>RBP: ffff880056cefcc0 R08: 0000000000000009 R09: 0000000000000000
      <4>R10: ffff880055063a40 R11: 0000000000000080 R12: 0000000000000000
      <4>R13: ffff8800558636c0 R14: 0000000000000000 R15: ffff880055863930
      <4>FS:  0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
      <4>CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      <4>CR2: 0000000000000000 CR3: 0000000079ad2000 CR4: 00000000000006f0
      <4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      <4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      <4>Process mdt00_016 (pid: 7661, threadinfo ffff880056cee000, task ffff880079413500)
      <4>Stack:
      <4> ffffffffa0d9dc85 0000000000000001 ffff880056c0b800 ffff880056cefd00
      <4><d> ffffffffa0d7cc79 ffff880056cefd00 ffff880057193bc0 ffff880056c0b800
      <4><d> 0000000000000009 0000000000000107 ffff880037bfe0b0 ffff880056cefd40
      <4>Call Trace:
      <4> [<ffffffffa0d9dc85>] ? mdt_reconstruct+0x45/0x120 [mdt]
      <4> [<ffffffffa0d7cc79>] mdt_reint_internal+0xa89/0xb80 [mdt]
      <4> [<ffffffffa0d7d20b>] mdt_reint+0x6b/0x120 [mdt]
      <4> [<ffffffffa073d6fc>] tgt_request_handle+0x8bc/0x12e0 [ptlrpc]
      <4> [<ffffffffa06e4a21>] ptlrpc_main+0xe41/0x1910 [ptlrpc]
      <4> [<ffffffffa06e3be0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
      <4> [<ffffffff8109abf6>] kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] child_rip+0xa/0x20
      <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      <4>Code:  Bad RIP value.
      <1>RIP  [<(null)>] (null)
      <4> RSP <ffff880056cefca8>
      <4>CR2: 0000000000000000
      

      Attachments

        Issue Links

          Activity

            [LU-7536] racer_1 test: mds crash : BUG: unable to handle kernel NULL pointer dereference at (null) Oops: 0010 [#1] SMP

            Patch has landed to master for 2.9.0

            jgmitter Joseph Gmitter (Inactive) added a comment - Patch has landed to master for 2.9.0

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20395/
            Subject: LU-7536 mdt: handling NULL dereference in mdt_reconstruct
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: aa40787eee1835e4f84a40caa96fb232354bd799

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20395/ Subject: LU-7536 mdt: handling NULL dereference in mdt_reconstruct Project: fs/lustre-release Branch: master Current Patch Set: Commit: aa40787eee1835e4f84a40caa96fb232354bd799

            As per given analysis given in above comment mdt_reconstruct function corresponding to op-code REINT_MIGRATE is not present, so adding check for non defined re-constructor function for given opcode. The patch http://review.whamcloud.com/#/c/20395/ created to fix the same.
            I have verified the fix by adding hard coded op-codes.

            jadhav.vikram VIKRAM BABASO JADHAV (Inactive) added a comment - As per given analysis given in above comment mdt_reconstruct function corresponding to op-code REINT_MIGRATE is not present, so adding check for non defined re-constructor function for given opcode. The patch http://review.whamcloud.com/#/c/20395/ created to fix the same. I have verified the fix by adding hard coded op-codes.

            Looks like re-constructor function correspond to op-code REINT_MIGRATE is not present.

            crash>  mdt_thread_info ffff880056c0b800
            struct mdt_thread_info {
              mti_pill = 0xffff880055863930,
              mti_exp = 0xffff880079f5fc00,
              ....
              mti_rr = {
                rr_opcode = REINT_MIGRATE,
                rr_handle = 0x0,
                rr_fid1 = 0xffff88007c84bdd0,
                rr_fid2 = 0xffff88007c84bde0,
                rr_name = {
                  ln_name = 0xffff88007c84bf20 "5",
                  ln_namelen = 1
            
                    .....
            }
            
            crash> REINT_MIGRATE
            enum (unknown) = 9
            
            crash>  p/x reconstructors
            $1 = {0x6f6365725f74646d, 0x5f7463757274736e, 0x6d00657461657263, 0x63656a626f5f7464, 0x7475705f74, 0x0, 0x6f6365725f74646d, 0x5f7463757274736e, 0x72747461746573, 0x0}
            
            crash> p/x reconstructors[9]
            $2 = 0x0
            crash>
            
            typedef enum {
                       REINT_SETATTR  = 1,
                       REINT_CREATE   = 2,
                       ......
                      REINT_MIGRATE  = 9,
                      REINT_MAX
            } mds_reint_t, mdt_reint_t;
            
            static mdt_reconstructor reconstructors[REINT_MAX] = {
                    [REINT_SETATTR]  = mdt_reconstruct_setattr,
                    [REINT_CREATE]   = mdt_reconstruct_create,
                    [REINT_LINK]     = mdt_reconstruct_generic,
                    [REINT_UNLINK]   = mdt_reconstruct_generic,
                    [REINT_RENAME]   = mdt_reconstruct_generic,
                    [REINT_OPEN]     = mdt_reconstruct_open,
                    [REINT_SETXATTR] = mdt_reconstruct_generic
            };
            

            Got NULL dereference when i purposely hard coded not existing op-code value in mdt_reconstruct()

            mti->mti_rr.rr_opcode = 9;
            reconstructors[mti->mti_rr.rr_opcode](mti, lhc);
            
            jadhav.vikram VIKRAM BABASO JADHAV (Inactive) added a comment - Looks like re-constructor function correspond to op-code REINT_MIGRATE is not present. crash> mdt_thread_info ffff880056c0b800 struct mdt_thread_info { mti_pill = 0xffff880055863930, mti_exp = 0xffff880079f5fc00, .... mti_rr = { rr_opcode = REINT_MIGRATE, rr_handle = 0x0, rr_fid1 = 0xffff88007c84bdd0, rr_fid2 = 0xffff88007c84bde0, rr_name = { ln_name = 0xffff88007c84bf20 "5" , ln_namelen = 1 ..... } crash> REINT_MIGRATE enum (unknown) = 9 crash> p/x reconstructors $1 = {0x6f6365725f74646d, 0x5f7463757274736e, 0x6d00657461657263, 0x63656a626f5f7464, 0x7475705f74, 0x0, 0x6f6365725f74646d, 0x5f7463757274736e, 0x72747461746573, 0x0} crash> p/x reconstructors[9] $2 = 0x0 crash> typedef enum { REINT_SETATTR = 1, REINT_CREATE = 2, ...... REINT_MIGRATE = 9, REINT_MAX } mds_reint_t, mdt_reint_t; static mdt_reconstructor reconstructors[REINT_MAX] = { [REINT_SETATTR] = mdt_reconstruct_setattr, [REINT_CREATE] = mdt_reconstruct_create, [REINT_LINK] = mdt_reconstruct_generic, [REINT_UNLINK] = mdt_reconstruct_generic, [REINT_RENAME] = mdt_reconstruct_generic, [REINT_OPEN] = mdt_reconstruct_open, [REINT_SETXATTR] = mdt_reconstruct_generic }; Got NULL dereference when i purposely hard coded not existing op-code value in mdt_reconstruct() mti->mti_rr.rr_opcode = 9; reconstructors[mti->mti_rr.rr_opcode](mti, lhc);

            jadhav.vikram (jadhav.vikram@seagate.com) uploaded a new patch: http://review.whamcloud.com/20395
            Subject: LU-7536 mdt: handling NULL dereference in mdt_reconstruct
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a8600c93254140bf8e49f2d4d82b85cf882518f8

            gerrit Gerrit Updater added a comment - jadhav.vikram (jadhav.vikram@seagate.com) uploaded a new patch: http://review.whamcloud.com/20395 Subject: LU-7536 mdt: handling NULL dereference in mdt_reconstruct Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a8600c93254140bf8e49f2d4d82b85cf882518f8

            Another instance found for Full tag 2.7.66 - EL6.7 Server/EL6.7 Client - DNE, build# 3314
            https://testing.hpdd.intel.com/test_sets/7ae45854-ca83-11e5-9215-5254006e85c2

            Another instance found for Full tag 2.7.66 -EL7.1 Server/EL7.1 Client - DNE, build# 3314
            https://testing.hpdd.intel.com/test_sets/9e3f6a90-cac5-11e5-9609-5254006e85c2

            standan Saurabh Tandan (Inactive) added a comment - - edited Another instance found for Full tag 2.7.66 - EL6.7 Server/EL6.7 Client - DNE, build# 3314 https://testing.hpdd.intel.com/test_sets/7ae45854-ca83-11e5-9215-5254006e85c2 Another instance found for Full tag 2.7.66 -EL7.1 Server/EL7.1 Client - DNE, build# 3314 https://testing.hpdd.intel.com/test_sets/9e3f6a90-cac5-11e5-9609-5254006e85c2
            standan Saurabh Tandan (Inactive) added a comment - - edited

            Another instance failing with the same error as above for tag 2.7.66 for FULL - EL6.7 Server/EL6.7 Client - DNE , master build# 3314.
            https://testing.hpdd.intel.com/test_sets/7ae45854-ca83-11e5-9215-5254006e85c2

            Another instance for FULL - EL7.1 Server/EL7.1 Client - DNE, master, build# 3314
            https://testing.hpdd.intel.com/test_sets/9e3f6a90-cac5-11e5-9609-5254006e85c2

            standan Saurabh Tandan (Inactive) added a comment - - edited Another instance failing with the same error as above for tag 2.7.66 for FULL - EL6.7 Server/EL6.7 Client - DNE , master build# 3314. https://testing.hpdd.intel.com/test_sets/7ae45854-ca83-11e5-9215-5254006e85c2 Another instance for FULL - EL7.1 Server/EL7.1 Client - DNE, master, build# 3314 https://testing.hpdd.intel.com/test_sets/9e3f6a90-cac5-11e5-9609-5254006e85c2
            parinay parinay v kondekar (Inactive) added a comment - Looks similar to - https://jira.hpdd.intel.com/browse/LU-5735

            People

              wc-triage WC Triage
              parinay parinay v kondekar (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: