[LU-5249] conf-sanity test_32a: NULL pointer in fld_local_lookup Created: 24/Jun/14  Updated: 28/Aug/15  Resolved: 25/Jun/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: Lustre 2.6.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: mq115

Issue Links:
Duplicate
Related
is related to LU-5188 nbp6-OST002f-osc-MDT0000: invalid set... Resolved
is related to LU-6973 Null pointer access during umount MDT... Resolved
is related to LU-5244 conf-sanity test_32b: osp_sync_thread... Closed
Severity: 3
Rank (Obsolete): 14640

 Description   

This issue was created by maloo for wangdi <di.wang@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/d6d84ce4-fbb1-11e3-a4bd-52540035b04c.

The sub-test test_32a failed with the following error:

05:25:05:LustreError: 11-0: t32fs-MDT0000-lwp-OST0000: Communicating with 0@lo, operation obd_ping failed with -107.
05:25:05:Lustre: t32fs-MDT0000-lwp-OST0000: Connection to t32fs-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
05:25:05:LustreError: 25468:0:(client.c:1079:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff88005ff58800 x1471790788379308/t0(0) o13->t32fs-OST0000-osc-MDT0000@0@lo:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
05:25:05:Lustre: t32fs-MDT0000: Not available for connect from 0@lo (stopping)
05:25:05:BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
05:25:05:IP: [<ffffffffa0235ddb>] fld_local_lookup+0x5b/0x290 [fld]
05:25:05:PGD 0
05:25:05:Oops: 0000 1 SMP
05:25:05:last sysfs file: /sys/devices/system/cpu/online
05:25:05:CPU 1
05:25:05:Modules linked in: osd_ldiskfs(U) ldiskfs(U) lustre(U) ofd(U) osp(U) lod(U) ost(U) mdt(U) mdd(U) mgs(U) nodemap(U) lquota(U) lfsck(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc_gss(U) ptlrpc(U) obdclass(U) ksocklnd(U) lnet(U) libcfs(U) sha512_generic sha256_generic jbd2 nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]
05:25:05:
05:25:05:Pid: 26850, comm: ll_mgs_0002 Tainted: G W --------------- 2.6.32-431.17.1.el6_lustre.g0eed638.x86_64 #1 Red Hat KVM
05:25:05:RIP: 0010:[<ffffffffa0235ddb>] [<ffffffffa0235ddb>] fld_local_lookup+0x5b/0x290 [fld]
05:25:05:RSP: 0018:ffff88005c3e3a50 EFLAGS: 00010282
05:25:05:RAX: ffff880037be4640 RBX: ffff880037be4640 RCX: 0000000000000000
05:25:05:RDX: ffff880065388400 RSI: ffffffffa023d200 RDI: ffff880063190dc0
05:25:05:RBP: ffff88005c3e3a90 R08: 20737365636f7250 R09: 0a64657265746e65
05:25:05:R10: 20737365636f7250 R11: 0a64657265746e65 R12: 0000000000000000
05:25:05:R13: 000000000000000f R14: ffff88005c3e5078 R15: ffff88005c3e4000
05:25:05:FS: 0000000000000000(0000) GS:ffff880002300000(0000) knlGS:0000000000000000
05:25:05:CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
05:25:05:CR2: 0000000000000018 CR3: 000000007c041000 CR4: 00000000000006e0
05:25:05:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
05:25:05:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
05:25:05:Process ll_mgs_0002 (pid: 26850, threadinfo ffff88005c3e2000, task ffff88006d370ae0)
05:25:05:Stack:
05:25:05: ffff880000000010 ffff88005c3e3ab0 ffff88005c3e3a70 0000000000000000
05:25:05:<d> 000000000000000f ffff880060474000 ffff880063190dc0 ffff88007ac93040
05:25:05:<d> ffff88005c3e3aa0 ffffffffa0ff8849 ffff88005c3e3af0 ffffffffa0ff8c90
05:25:05:Call Trace:
05:25:05: [<ffffffffa0ff8849>] osd_fld_lookup+0x49/0xd0 [osd_ldiskfs]
05:25:05: [<ffffffffa0ff8c90>] osd_remote_fid+0x100/0x3a0 [osd_ldiskfs]
05:25:05: [<ffffffffa1009071>] osd_index_ea_lookup+0x521/0x8a0 [osd_ldiskfs]
05:25:05: [<ffffffffa0a4a7ef>] dt_lookup_dir+0x6f/0x130 [obdclass]
05:25:05: [<ffffffffa0a0aeb5>] llog_osd_open+0x475/0xbb0 [obdclass]
05:25:05: [<ffffffffa09f91ba>] llog_open+0xba/0x2c0 [obdclass]
05:25:05: [<ffffffffa0e129a2>] llog_origin_handle_open+0x242/0x670 [ptlrpc]
05:25:05: [<ffffffffa0e57045>] tgt_llog_open+0x35/0xd0 [ptlrpc]
05:25:05: [<ffffffffa07fd13c>] mgs_llog_open+0x10c/0x450 [mgs]
05:25:05: [<ffffffffa075d4b8>] ? libcfs_log_return+0x28/0x40 [libcfs]
05:25:05: [<ffffffffa0e5d7dc>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
05:25:05: [<ffffffffa0e0d29a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
05:25:05: [<ffffffffa0e0c580>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
05:25:05: [<ffffffff8109ab56>] kthread+0x96/0xa0
05:25:05: [<ffffffff8100c20a>] child_rip+0xa/0x20
05:25:05: [<ffffffff8109aac0>] ? kthread+0x0/0xa0
05:25:05: [<ffffffff8100c200>] ? child_rip+0x0/0x20
test failed to respond and timed out

Info required for matching: conf-sanity 32a



 Comments   
Comment by Andreas Dilger [ 24/Jun/14 ]

It looks like the first time this failed was 2014-06-23, which was based on:

79dd530f1352e6b57fcb870a1e0f2c2a05a0648d LU-5188 osp: Correctly check for invalid setattr record

Comment by Di Wang [ 24/Jun/14 ]

it seems point to this line

(gdb) l *(fld_local_lookup+0x5b)
0x5e0b is in fld_local_lookup (/home/work/lustre-release_new/lustre-release/lustre/fld/fld_handler.c:219).
214		info = lu_context_key_get(&env->le_ctx, &fld_thread_key);
215		LASSERT(info != NULL);
216		erange = &info->fti_lrange;
217	
218		/* Lookup it in the cache. */
219		rc = fld_cache_lookup(fld->lsf_cache, seq, erange);
220		if (rc == 0) {
221			if (unlikely(fld_range_type(erange) != fld_range_type(range) &&
222				     !fld_range_is_any(range))) {
223				CERROR("%s: FLD cache range "DRANGE" does not match"
(gdb) q

So fld is NULL here. it means FLD is not being initialized yet. Hmm, this should not get into fld_lookup, since log FID is not inside FLDB.

Comment by Di Wang [ 24/Jun/14 ]

Ah, this is upgrade test, i.e. those mgs llog is IGIF, so they are in the FLDB. Hmm, then we need check whether FLD is fully initialized here.

Comment by Di Wang [ 24/Jun/14 ]

http://review.whamcloud.com/10806

Comment by Andreas Dilger [ 24/Jun/14 ]

This patch solves the problem of the crash, which is good, but the other question is why this started happening, only a day ago. It is worthwhile to check into the patches that have landed recently to see what is actually causing this problem.

Comment by Jodi Levi (Inactive) [ 25/Jun/14 ]

Patch landed to Master.

Comment by Ann Koehler (Inactive) [ 26/Aug/14 ]

Does this bug exist in 2.5 as well as 2.6? I've got a node panic with almost the same stack trace from a server running 2.5. The null pointer dereference is in fld_server_lookup instead of fld_local_lookup, but the name change looks largely cosmetic. I don't have a dump or a reproducer so I'm trying to make an informed guess about what might have gone wrong. Thanks.

Comment by Di Wang [ 26/Aug/14 ]

Yes, this exists in 2.5 as well, and usually happens during setup or recovery process. Hmm, this is more than a cosmetic change, and the key is checking if ss_server_fld is fully initialized before going further.

Comment by Ann Koehler (Inactive) [ 26/Aug/14 ]

Thanks, Di. Sounds like I have a match then. BTW, by cosmetic, I just meant the name change between fld_server_lookup and fld_local_lookup. The code in these two functions looks the same at least up to the point where the uninitialized variable is accessed.

Comment by Gerrit Updater [ 01/Feb/15 ]

Li Xi (pkuelelixi@gmail.com) uploaded a new patch: http://review.whamcloud.com/13579
Subject: LU-5249 osd: check if FLD is fully intialized
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: a2f48b5784c82c160a1100d1b08abc3018984deb

Generated at Sat Feb 10 06:54:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.