Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5249

conf-sanity test_32a: NULL pointer in fld_local_lookup

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.6.0
    • Lustre 2.6.0
    • 3
    • 14640

    Description

      This issue was created by maloo for wangdi <di.wang@intel.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/d6d84ce4-fbb1-11e3-a4bd-52540035b04c.

      The sub-test test_32a failed with the following error:

      05:25:05:LustreError: 11-0: t32fs-MDT0000-lwp-OST0000: Communicating with 0@lo, operation obd_ping failed with -107.
      05:25:05:Lustre: t32fs-MDT0000-lwp-OST0000: Connection to t32fs-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
      05:25:05:LustreError: 25468:0:(client.c:1079:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff88005ff58800 x1471790788379308/t0(0) o13->t32fs-OST0000-osc-MDT0000@0@lo:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
      05:25:05:Lustre: t32fs-MDT0000: Not available for connect from 0@lo (stopping)
      05:25:05:BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
      05:25:05:IP: [<ffffffffa0235ddb>] fld_local_lookup+0x5b/0x290 [fld]
      05:25:05:PGD 0
      05:25:05:Oops: 0000 1 SMP
      05:25:05:last sysfs file: /sys/devices/system/cpu/online
      05:25:05:CPU 1
      05:25:05:Modules linked in: osd_ldiskfs(U) ldiskfs(U) lustre(U) ofd(U) osp(U) lod(U) ost(U) mdt(U) mdd(U) mgs(U) nodemap(U) lquota(U) lfsck(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc_gss(U) ptlrpc(U) obdclass(U) ksocklnd(U) lnet(U) libcfs(U) sha512_generic sha256_generic jbd2 nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]
      05:25:05:
      05:25:05:Pid: 26850, comm: ll_mgs_0002 Tainted: G W --------------- 2.6.32-431.17.1.el6_lustre.g0eed638.x86_64 #1 Red Hat KVM
      05:25:05:RIP: 0010:[<ffffffffa0235ddb>] [<ffffffffa0235ddb>] fld_local_lookup+0x5b/0x290 [fld]
      05:25:05:RSP: 0018:ffff88005c3e3a50 EFLAGS: 00010282
      05:25:05:RAX: ffff880037be4640 RBX: ffff880037be4640 RCX: 0000000000000000
      05:25:05:RDX: ffff880065388400 RSI: ffffffffa023d200 RDI: ffff880063190dc0
      05:25:05:RBP: ffff88005c3e3a90 R08: 20737365636f7250 R09: 0a64657265746e65
      05:25:05:R10: 20737365636f7250 R11: 0a64657265746e65 R12: 0000000000000000
      05:25:05:R13: 000000000000000f R14: ffff88005c3e5078 R15: ffff88005c3e4000
      05:25:05:FS: 0000000000000000(0000) GS:ffff880002300000(0000) knlGS:0000000000000000
      05:25:05:CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      05:25:05:CR2: 0000000000000018 CR3: 000000007c041000 CR4: 00000000000006e0
      05:25:05:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      05:25:05:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      05:25:05:Process ll_mgs_0002 (pid: 26850, threadinfo ffff88005c3e2000, task ffff88006d370ae0)
      05:25:05:Stack:
      05:25:05: ffff880000000010 ffff88005c3e3ab0 ffff88005c3e3a70 0000000000000000
      05:25:05:<d> 000000000000000f ffff880060474000 ffff880063190dc0 ffff88007ac93040
      05:25:05:<d> ffff88005c3e3aa0 ffffffffa0ff8849 ffff88005c3e3af0 ffffffffa0ff8c90
      05:25:05:Call Trace:
      05:25:05: [<ffffffffa0ff8849>] osd_fld_lookup+0x49/0xd0 [osd_ldiskfs]
      05:25:05: [<ffffffffa0ff8c90>] osd_remote_fid+0x100/0x3a0 [osd_ldiskfs]
      05:25:05: [<ffffffffa1009071>] osd_index_ea_lookup+0x521/0x8a0 [osd_ldiskfs]
      05:25:05: [<ffffffffa0a4a7ef>] dt_lookup_dir+0x6f/0x130 [obdclass]
      05:25:05: [<ffffffffa0a0aeb5>] llog_osd_open+0x475/0xbb0 [obdclass]
      05:25:05: [<ffffffffa09f91ba>] llog_open+0xba/0x2c0 [obdclass]
      05:25:05: [<ffffffffa0e129a2>] llog_origin_handle_open+0x242/0x670 [ptlrpc]
      05:25:05: [<ffffffffa0e57045>] tgt_llog_open+0x35/0xd0 [ptlrpc]
      05:25:05: [<ffffffffa07fd13c>] mgs_llog_open+0x10c/0x450 [mgs]
      05:25:05: [<ffffffffa075d4b8>] ? libcfs_log_return+0x28/0x40 [libcfs]
      05:25:05: [<ffffffffa0e5d7dc>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
      05:25:05: [<ffffffffa0e0d29a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
      05:25:05: [<ffffffffa0e0c580>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
      05:25:05: [<ffffffff8109ab56>] kthread+0x96/0xa0
      05:25:05: [<ffffffff8100c20a>] child_rip+0xa/0x20
      05:25:05: [<ffffffff8109aac0>] ? kthread+0x0/0xa0
      05:25:05: [<ffffffff8100c200>] ? child_rip+0x0/0x20
      test failed to respond and timed out

      Info required for matching: conf-sanity 32a

      Attachments

        Issue Links

          Activity

            [LU-5249] conf-sanity test_32a: NULL pointer in fld_local_lookup

            Li Xi (pkuelelixi@gmail.com) uploaded a new patch: http://review.whamcloud.com/13579
            Subject: LU-5249 osd: check if FLD is fully intialized
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set: 1
            Commit: a2f48b5784c82c160a1100d1b08abc3018984deb

            gerrit Gerrit Updater added a comment - Li Xi (pkuelelixi@gmail.com) uploaded a new patch: http://review.whamcloud.com/13579 Subject: LU-5249 osd: check if FLD is fully intialized Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: a2f48b5784c82c160a1100d1b08abc3018984deb

            Thanks, Di. Sounds like I have a match then. BTW, by cosmetic, I just meant the name change between fld_server_lookup and fld_local_lookup. The code in these two functions looks the same at least up to the point where the uninitialized variable is accessed.

            amk Ann Koehler (Inactive) added a comment - Thanks, Di. Sounds like I have a match then. BTW, by cosmetic, I just meant the name change between fld_server_lookup and fld_local_lookup. The code in these two functions looks the same at least up to the point where the uninitialized variable is accessed.
            di.wang Di Wang added a comment -

            Yes, this exists in 2.5 as well, and usually happens during setup or recovery process. Hmm, this is more than a cosmetic change, and the key is checking if ss_server_fld is fully initialized before going further.

            di.wang Di Wang added a comment - Yes, this exists in 2.5 as well, and usually happens during setup or recovery process. Hmm, this is more than a cosmetic change, and the key is checking if ss_server_fld is fully initialized before going further.

            Does this bug exist in 2.5 as well as 2.6? I've got a node panic with almost the same stack trace from a server running 2.5. The null pointer dereference is in fld_server_lookup instead of fld_local_lookup, but the name change looks largely cosmetic. I don't have a dump or a reproducer so I'm trying to make an informed guess about what might have gone wrong. Thanks.

            amk Ann Koehler (Inactive) added a comment - Does this bug exist in 2.5 as well as 2.6? I've got a node panic with almost the same stack trace from a server running 2.5. The null pointer dereference is in fld_server_lookup instead of fld_local_lookup, but the name change looks largely cosmetic. I don't have a dump or a reproducer so I'm trying to make an informed guess about what might have gone wrong. Thanks.

            Patch landed to Master.

            jlevi Jodi Levi (Inactive) added a comment - Patch landed to Master.

            This patch solves the problem of the crash, which is good, but the other question is why this started happening, only a day ago. It is worthwhile to check into the patches that have landed recently to see what is actually causing this problem.

            adilger Andreas Dilger added a comment - This patch solves the problem of the crash, which is good, but the other question is why this started happening, only a day ago. It is worthwhile to check into the patches that have landed recently to see what is actually causing this problem.
            di.wang Di Wang added a comment - http://review.whamcloud.com/10806
            di.wang Di Wang added a comment -

            Ah, this is upgrade test, i.e. those mgs llog is IGIF, so they are in the FLDB. Hmm, then we need check whether FLD is fully initialized here.

            di.wang Di Wang added a comment - Ah, this is upgrade test, i.e. those mgs llog is IGIF, so they are in the FLDB. Hmm, then we need check whether FLD is fully initialized here.
            di.wang Di Wang added a comment -

            it seems point to this line

            (gdb) l *(fld_local_lookup+0x5b)
            0x5e0b is in fld_local_lookup (/home/work/lustre-release_new/lustre-release/lustre/fld/fld_handler.c:219).
            214		info = lu_context_key_get(&env->le_ctx, &fld_thread_key);
            215		LASSERT(info != NULL);
            216		erange = &info->fti_lrange;
            217	
            218		/* Lookup it in the cache. */
            219		rc = fld_cache_lookup(fld->lsf_cache, seq, erange);
            220		if (rc == 0) {
            221			if (unlikely(fld_range_type(erange) != fld_range_type(range) &&
            222				     !fld_range_is_any(range))) {
            223				CERROR("%s: FLD cache range "DRANGE" does not match"
            (gdb) q
            

            So fld is NULL here. it means FLD is not being initialized yet. Hmm, this should not get into fld_lookup, since log FID is not inside FLDB.

            di.wang Di Wang added a comment - it seems point to this line (gdb) l *(fld_local_lookup+0x5b) 0x5e0b is in fld_local_lookup (/home/work/lustre-release_new/lustre-release/lustre/fld/fld_handler.c:219). 214 info = lu_context_key_get(&env->le_ctx, &fld_thread_key); 215 LASSERT(info != NULL); 216 erange = &info->fti_lrange; 217 218 /* Lookup it in the cache. */ 219 rc = fld_cache_lookup(fld->lsf_cache, seq, erange); 220 if (rc == 0) { 221 if (unlikely(fld_range_type(erange) != fld_range_type(range) && 222 !fld_range_is_any(range))) { 223 CERROR("%s: FLD cache range "DRANGE" does not match" (gdb) q So fld is NULL here. it means FLD is not being initialized yet. Hmm, this should not get into fld_lookup, since log FID is not inside FLDB.

            It looks like the first time this failed was 2014-06-23, which was based on:

            79dd530f1352e6b57fcb870a1e0f2c2a05a0648d LU-5188 osp: Correctly check for invalid setattr record

            adilger Andreas Dilger added a comment - It looks like the first time this failed was 2014-06-23, which was based on: 79dd530f1352e6b57fcb870a1e0f2c2a05a0648d LU-5188 osp: Correctly check for invalid setattr record

            People

              tappro Mikhail Pershin
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: