[LU-5249] conf-sanity test_32a: NULL pointer in fld_local_lookup Created: 24/Jun/14 Updated: 28/Aug/15 Resolved: 25/Jun/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0 |
| Fix Version/s: | Lustre 2.6.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | mq115 | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 14640 | ||||||||||||||||||||
| Description |
|
This issue was created by maloo for wangdi <di.wang@intel.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/d6d84ce4-fbb1-11e3-a4bd-52540035b04c. The sub-test test_32a failed with the following error:
Info required for matching: conf-sanity 32a |
| Comments |
| Comment by Andreas Dilger [ 24/Jun/14 ] |
|
It looks like the first time this failed was 2014-06-23, which was based on: 79dd530f1352e6b57fcb870a1e0f2c2a05a0648d |
| Comment by Di Wang [ 24/Jun/14 ] |
|
it seems point to this line (gdb) l *(fld_local_lookup+0x5b)
0x5e0b is in fld_local_lookup (/home/work/lustre-release_new/lustre-release/lustre/fld/fld_handler.c:219).
214 info = lu_context_key_get(&env->le_ctx, &fld_thread_key);
215 LASSERT(info != NULL);
216 erange = &info->fti_lrange;
217
218 /* Lookup it in the cache. */
219 rc = fld_cache_lookup(fld->lsf_cache, seq, erange);
220 if (rc == 0) {
221 if (unlikely(fld_range_type(erange) != fld_range_type(range) &&
222 !fld_range_is_any(range))) {
223 CERROR("%s: FLD cache range "DRANGE" does not match"
(gdb) q
So fld is NULL here. it means FLD is not being initialized yet. Hmm, this should not get into fld_lookup, since log FID is not inside FLDB. |
| Comment by Di Wang [ 24/Jun/14 ] |
|
Ah, this is upgrade test, i.e. those mgs llog is IGIF, so they are in the FLDB. Hmm, then we need check whether FLD is fully initialized here. |
| Comment by Di Wang [ 24/Jun/14 ] |
| Comment by Andreas Dilger [ 24/Jun/14 ] |
|
This patch solves the problem of the crash, which is good, but the other question is why this started happening, only a day ago. It is worthwhile to check into the patches that have landed recently to see what is actually causing this problem. |
| Comment by Jodi Levi (Inactive) [ 25/Jun/14 ] |
|
Patch landed to Master. |
| Comment by Ann Koehler (Inactive) [ 26/Aug/14 ] |
|
Does this bug exist in 2.5 as well as 2.6? I've got a node panic with almost the same stack trace from a server running 2.5. The null pointer dereference is in fld_server_lookup instead of fld_local_lookup, but the name change looks largely cosmetic. I don't have a dump or a reproducer so I'm trying to make an informed guess about what might have gone wrong. Thanks. |
| Comment by Di Wang [ 26/Aug/14 ] |
|
Yes, this exists in 2.5 as well, and usually happens during setup or recovery process. Hmm, this is more than a cosmetic change, and the key is checking if ss_server_fld is fully initialized before going further. |
| Comment by Ann Koehler (Inactive) [ 26/Aug/14 ] |
|
Thanks, Di. Sounds like I have a match then. BTW, by cosmetic, I just meant the name change between fld_server_lookup and fld_local_lookup. The code in these two functions looks the same at least up to the point where the uninitialized variable is accessed. |
| Comment by Gerrit Updater [ 01/Feb/15 ] |
|
Li Xi (pkuelelixi@gmail.com) uploaded a new patch: http://review.whamcloud.com/13579 |