[LU-7084] sanityn test_77c: cfs_hash_find_or_add()) ASSERTION( hlist_unhashed(hnode) ) failed Created: 01/Sep/15  Updated: 26/Aug/16  Resolved: 23/Mar/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Major
Reporter: Maloo Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-4499 NRS ORR cfs_hash_find_or_add() LBUG Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/6bace048-50dc-11e5-95a9-5254006e85c2.

The sub-test test_77c failed with the following error:

test failed to respond and timed out

this test fail looks like LU-4499, but that has been marked Fixed. As far as I can tell the fix landed before the mod in this build.

16:23:54:LustreError: 24876:0:(hash.c:1253:cfs_hash_find_or_add()) ASSERTION( hlist_unhashed(hnode) ) failed: 
16:23:54:LustreError: 24876:0:(hash.c:1253:cfs_hash_find_or_add()) LBUG
16:23:54:Kernel panic - not syncing: LBUG in interrupt.
16:23:54:
16:23:54:Pid: 24876, comm: ll_ost00_006 Tainted: P           -- ------------    2.6.32-573.3.1.el6_lustre.g4276203.x86_64 #1
16:23:54:Call Trace:
16:23:54: [<ffffffff815384e4>] ? panic+0xa7/0x16f
16:23:54: [<ffffffffa0713ebd>] ? lbug_with_loc+0x8d/0xb0 [libcfs]
16:23:54: [<ffffffffa0727d10>] ? cfs_hash_findadd_unique+0x0/0x30 [libcfs]
16:23:54: [<ffffffffa0727d28>] ? cfs_hash_findadd_unique+0x18/0x30 [libcfs]
16:23:54: [<ffffffffa0adf390>] ? nrs_orr_res_get+0x430/0xb70 [ptlrpc]
16:23:54: [<ffffffffa0ad4f86>] ? nrs_resource_get+0x56/0x110 [ptlrpc]
16:23:54: [<ffffffffa0a91935>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
16:23:54: [<ffffffffa0ad5d0b>] ? nrs_resource_get_safe+0x8b/0x100 [ptlrpc]
16:23:54: [<ffffffffa0ad99eb>] ? ptlrpc_nrs_req_hp_move+0x6b/0x210 [ptlrpc]
16:23:54: [<ffffffffa0abb165>] ? req_capsule_client_get+0x15/0x20 [ptlrpc]
16:23:54: [<ffffffffa0a6fab8>] ? ldlm_server_blocking_ast+0x238/0x8c0 [ptlrpc]
16:23:54: [<ffffffffa0af7329>] ? tgt_blocking_ast+0x1b9/0x8c0 [ptlrpc]
16:23:54: [<ffffffff8129bc34>] ? snprintf+0x34/0x40
16:23:54: [<ffffffffa0a4139e>] ? ldlm_work_bl_ast_lock+0xde/0x290 [ptlrpc]
16:23:54: [<ffffffffa0a87064>] ? ptlrpc_set_wait+0x74/0xa20 [ptlrpc]
16:23:54: [<ffffffff8117900d>] ? kmem_cache_alloc_node_trace+0x1cd/0x200
16:23:54: [<ffffffffa0a7e23e>] ? ptlrpc_prep_set+0xbe/0x270 [ptlrpc]
16:23:54: [<ffffffffa0a412c0>] ? ldlm_work_bl_ast_lock+0x0/0x290 [ptlrpc]
16:23:54: [<ffffffffa0a3dfcf>] ? ldlm_run_ast_work+0xcf/0x4a0 [ptlrpc]
16:23:54: [<ffffffffa0a5ca45>] ? ldlm_process_extent_lock+0x155/0xab0 [ptlrpc]
16:23:54: [<ffffffffa0a445be>] ? ldlm_lock_enqueue+0x47e/0x8e0 [ptlrpc]
16:23:54: [<ffffffffa0a710d7>] ? ldlm_handle_enqueue0+0x807/0x15b0 [ptlrpc]
16:23:54: [<ffffffffa0afbea1>] ? tgt_enqueue+0x61/0x230 [ptlrpc]
16:23:54: [<ffffffffa0afcaec>] ? tgt_request_handle+0x8bc/0x12e0 [ptlrpc]
16:23:54: [<ffffffffa0aa4731>] ? ptlrpc_main+0xe41/0x1910 [ptlrpc]
16:23:54: [<ffffffffa0aa38f0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
16:23:54: [<ffffffff810a101e>] ? kthread+0x9e/0xc0
16:23:54: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
16:23:54: [<ffffffff810a0f80>] ? kthread+0x0/0xc0

Info required for matching: sanityn 77c



 Comments   
Comment by Andreas Dilger [ 02/Sep/15 ]
12:14:15:Lustre: DEBUG MARKER: lctl set_param ost.OSS.*.nrs_orr_offset_type=logical
12:14:15:LustreError: 23366:0:(hash.c:1253:cfs_hash_find_or_add()) ASSERTION( hlist_unhashed(hnode) ) failed: 
12:14:15:LustreError: 23366:0:(hash.c:1253:cfs_hash_find_or_add()) LBUG
12:14:15:Kernel panic - not syncing: LBUG in interrupt.
12:14:15:
12:14:15:Pid: 23366, comm: ll_ost00_017 Tainted: P           ---------------    2.6.32-504.30.3.el6_lustre.gc67434c.x86_64 #1
12:14:15:Call Trace:
12:14:15: [<ffffffff81529c9c>] ? panic+0xa7/0x16f
12:14:15: [<ffffffffa0709ebd>] ? lbug_with_loc+0x8d/0xb0 [libcfs]
12:14:15: [<ffffffffa071dd60>] ? cfs_hash_findadd_unique+0x0/0x30 [libcfs]
12:14:15: [<ffffffffa071dd78>] ? cfs_hash_findadd_unique+0x18/0x30 [libcfs]
12:14:15: [<ffffffffa0ad9070>] ? nrs_orr_res_get+0x430/0xb70 [ptlrpc]
12:14:15: [<ffffffffa0acec46>] ? nrs_resource_get+0x56/0x110 [ptlrpc]
12:14:15: [<ffffffffa0a8b7b5>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
12:14:15: [<ffffffffa0acf9eb>] ? nrs_resource_get_safe+0x8b/0x100 [ptlrpc]
12:14:15: [<ffffffffa0ad36cb>] ? ptlrpc_nrs_req_hp_move+0x6b/0x210 [ptlrpc]
12:14:15: [<ffffffffa0ab4eb5>] ? req_capsule_client_get+0x15/0x20 [ptlrpc]
12:14:15: [<ffffffffa0a69988>] ? ldlm_server_blocking_ast+0x238/0x8c0 [ptlrpc]
12:14:15: [<ffffffffa0af0ff9>] ? tgt_blocking_ast+0x1b9/0x8c0 [ptlrpc]
12:14:15: [<ffffffffa0a3b39e>] ? ldlm_work_bl_ast_lock+0xde/0x290 [ptlrpc]
12:14:15: [<ffffffffa0a80ee4>] ? ptlrpc_set_wait+0x74/0xa20 [ptlrpc]
12:14:15: [<ffffffff8117591d>] ? kmem_cache_alloc_node_trace+0x1cd/0x200
12:14:15: [<ffffffffa0a780ae>] ? ptlrpc_prep_set+0xbe/0x270 [ptlrpc]
12:14:15: [<ffffffffa0a3b2c0>] ? ldlm_work_bl_ast_lock+0x0/0x290 [ptlrpc]
12:14:15: [<ffffffffa0a37fcf>] ? ldlm_run_ast_work+0xcf/0x4a0 [ptlrpc]
12:14:15: [<ffffffffa0a56915>] ? ldlm_process_extent_lock+0x155/0xab0 [ptlrpc]
12:14:15: [<ffffffffa0a3e5be>] ? ldlm_lock_enqueue+0x47e/0x8e0 [ptlrpc]
12:14:15: [<ffffffffa0a6afa7>] ? ldlm_handle_enqueue0+0x807/0x15b0 [ptlrpc]
12:14:15: [<ffffffffa0af5b71>] ? tgt_enqueue+0x61/0x230 [ptlrpc]
12:14:15: [<ffffffffa0af694c>] ? tgt_request_handle+0xa4c/0x1290 [ptlrpc]
12:14:15: [<ffffffffa0a9e5b1>] ? ptlrpc_main+0xe41/0x1910 [ptlrpc]
12:14:15: [<ffffffff8109e78e>] ? kthread+0x9e/0xc0
Comment by Emoly Liu [ 16/Sep/15 ]

Another instance: https://testing.hpdd.intel.com/test_sets/64c3d51a-5c13-11e5-9dac-5254006e85c2

Comment by Joseph Gmitter (Inactive) [ 30/Sep/15 ]

Here is a failure on master within the past week:
https://testing.hpdd.intel.com/test_sessions/19297758-627c-11e5-a45a-5254006e85c2

Comment by Bob Glossman (Inactive) [ 11/Oct/15 ]

another on master:
https://testing.hpdd.intel.com/test_sets/f7e06f60-7064-11e5-a914-5254006e85c2

Comment by Bruno Faccini (Inactive) [ 15/Oct/15 ]

+1 at https://testing.hpdd.intel.com/test_sets/b442abee-72dd-11e5-b8fe-5254006e85c2

Comment by Bob Glossman (Inactive) [ 22/Oct/15 ]

another on master:
https://testing.hpdd.intel.com/test_sets/691d1330-78e0-11e5-9aa5-5254006e85c2

Comment by Mikhail Pershin [ 27/Oct/15 ]

another one:
https://testing.hpdd.intel.com/test_sets/ce9c606e-7c12-11e5-9851-5254006e85c2

Comment by Gerrit Updater [ 27/Oct/15 ]

Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/16952
Subject: LU-7084 ptlrpc: add request to the HP list properly
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 43b4181507f1e3e2c84188386959d9ee71b69968

Comment by James Nunez (Inactive) [ 27/Oct/15 ]

More failures at:
2015-10-26 22:44:58 - https://testing.hpdd.intel.com/test_sets/142ebe04-7c5b-11e5-9ca1-5254006e85c2
2015-10-26 20:29:26 - https://testing.hpdd.intel.com/test_sets/1cd1f35e-7c52-11e5-9851-5254006e85c2

Comment by Mikhail Pershin [ 28/Oct/15 ]

I was wrong with that patch, ignore it.

Comment by Mikhail Pershin [ 28/Oct/15 ]

In fact this ticket is just duplicate of LU-4499 which wasn't fixed it seems. Though patch for LU-4499 makes sense it doesn't fix the issue.

Comment by Mikhail Pershin [ 28/Oct/15 ]

Looking at Bruno analysis in LU-4499 it is clear that hnode is not initialized in ORR object right after allocation, meanwhile cfs_hash_findadd_unique() checks it should be uninitialized by comparing it with NULL which is not true for 0x5a5a5a5a of course. I don't get how that might be, if oo_key and oo_ref are set both then it happened right after allocation, meanwhile our allocations use __GFP_ZERO so there can't be 0x5a pattern. This can't be also use-after-free case because in that case all fields would be poisoned and oo_ref, oo_key can't be re-initialized of orro structure was found via hash lookup. It looks like __GFP_ZERO is not working in some cases, can that be true?

Comment by James Nunez (Inactive) [ 30/Oct/15 ]

Another failure on master: https://testing.hpdd.intel.com/test_sets/12eb8de2-7e8b-11e5-aa3e-5254006e85c2
2015-10-30 15:25:57 - https://testing.hpdd.intel.com/test_sets/bb3f4fc0-7f57-11e5-8550-5254006e85c2
2015-11-01 06:53:15 - https://testing.hpdd.intel.com/test_sets/6e62e3ae-808a-11e5-9307-5254006e85c2
2015-11-02 05:22:46 - https://testing.hpdd.intel.com/test_sets/4c42df08-814e-11e5-a535-5254006e85c2
2015-11-03 06:53:04 - https://testing.hpdd.intel.com/test_sets/58ca9030-821c-11e5-a6c5-5254006e85c2
2015-11-04 18:45:04 - https://testing.hpdd.intel.com/test_sets/3a5c8c64-8372-11e5-aa25-5254006e85c2
2015-11-04 21:44:38 - https://testing.hpdd.intel.com/test_sets/90791a1e-8367-11e5-93d9-5254006e85c2
2015-11-05 22:17:18 - https://testing.hpdd.intel.com/test_sets/665edf50-843b-11e5-a1b5-5254006e85c2
2015-11-05 20:37:50 - https://testing.hpdd.intel.com/test_sets/8799f51e-842d-11e5-8df7-5254006e85c2
2015-11-05 19:49:18 - https://testing.hpdd.intel.com/test_sets/4af05cea-8426-11e5-a390-5254006e85c2
2015-11-12 03:29:09 - https://testing.hpdd.intel.com/test_sets/ccbdcab2-891e-11e5-8174-5254006e85c2
2015-11-12 05:31:24 - https://testing.hpdd.intel.com/test_sets/84231ad6-8938-11e5-9de3-5254006e85c2
2015-11-12 12:16:13 - https://testing.hpdd.intel.com/test_sets/5a230980-8974-11e5-8174-5254006e85c2
2015-11-14 20:25:49 - https://testing.hpdd.intel.com/test_sets/2c8bdca6-8b3e-11e5-8e06-5254006e85c2
2015-11-15 16:03:32 - https://testing.hpdd.intel.com/test_sets/9b6b3fa2-8bdb-11e5-81ec-5254006e85c2

Comment by nasf (Inactive) [ 02/Nov/15 ]

Another failure instance on master:
https://testing.hpdd.intel.com/test_sets/fc899e1c-8147-11e5-9c69-5254006e85c2

Comment by Gerrit Updater [ 05/Nov/15 ]

Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/17051
Subject: LU-7084 debug: add debug to get more data
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5028c7255d886fd6fc7dc6fa95f0c00ce99bb84a

Comment by Alex Zhuravlev [ 03/Dec/15 ]

https://testing.hpdd.intel.com/test_sets/d90368e4-9991-11e5-b944-5254006e85c2

Comment by Bob Glossman (Inactive) [ 03/Dec/15 ]

another on master:
https://testing.hpdd.intel.com/test_sets/6a00beaa-9929-11e5-9236-5254006e85c2

Comment by James Nunez (Inactive) [ 04/Dec/15 ]

More failures on master in review-zfs-part-1:
2015-12-03 17:50:06 - https://testing.hpdd.intel.com/test_sets/4a09f2c6-9a26-11e5-a313-5254006e85c2
2015-12-03 21:57:16 - https://testing.hpdd.intel.com/test_sets/c78bf85c-9a53-11e5-8b28-5254006e85c2
2015-12-04 00:29:27 - https://testing.hpdd.intel.com/test_sets/dceb8258-9a49-11e5-8b28-5254006e85c2
2015-12-08 23:29:35 - https://testing.hpdd.intel.com/test_sets/4f246306-9e34-11e5-98a4-5254006e85c2
2015-12-08 05:08:11 - https://testing.hpdd.intel.com/test_sets/e06968d2-9daf-11e5-91b0-5254006e85c2
2015-12-08 03:10:54 - https://testing.hpdd.intel.com/test_sets/be6523f2-9daa-11e5-a24e-5254006e85c2

Comment by Di Wang [ 11/Dec/15 ]

2015-12-11 06:50:30 https://testing.hpdd.intel.com/sub_tests/140fdcdc-a00f-11e5-8d69-5254006e85c2
2015-12-11 14:09:05 https://testing.hpdd.intel.com/test_sets/72f3afbe-a04e-11e5-90cc-5254006e85c2

Comment by Andreas Dilger [ 13/Dec/15 ]

Again: https://testing.hpdd.intel.com/test_sets/4f246306-9e34-11e5-98a4-5254006e85c2

Comment by James Nunez (Inactive) [ 15/Dec/15 ]

More failures on master:
2015-12-11 03:41:17 - https://testing.hpdd.intel.com/test_sets/9b9e814c-9fe4-11e5-ae0a-5254006e85c2
2015-12-11 06:36:27 - https://testing.hpdd.intel.com/test_sets/8760168c-a00c-11e5-a33d-5254006e85c2
2015-12-13 20:01:38 - https://testing.hpdd.intel.com/test_sets/504e086c-a1ff-11e5-8c8b-5254006e85c2
2015-12-15 00:54:27 - https://testing.hpdd.intel.com/test_sets/b9fff734-a2f4-11e5-9b3d-5254006e85c2
2015-12-15 08:06:47 - https://testing.hpdd.intel.com/test_sets/5d8fb048-a34e-11e5-867e-5254006e85c2
2015-12-16 14:25:59 - https://testing.hpdd.intel.com/test_sets/6d6be14a-a438-11e5-806a-5254006e85c2
2015-12-17 03:07:23 - https://testing.hpdd.intel.com/test_sets/b0c8dfbc-a49f-11e5-947b-5254006e85c2
2015-12-17 04:22:29 - https://testing.hpdd.intel.com/test_sets/59e6dc42-a4bf-11e5-947b-5254006e85c2
2015-12-17 20:57:45 - https://testing.hpdd.intel.com/test_sets/d31c4362-a529-11e5-947b-5254006e85c2
2015-12-18 06:12:50 - https://testing.hpdd.intel.com/test_sets/20560a20-a5c3-11e5-a028-5254006e85c2
2015-12-18 16:05:03 - https://testing.hpdd.intel.com/test_sets/19f29384-a5e5-11e5-9f01-5254006e85c2
2015-12-19 11:23:27 - https://testing.hpdd.intel.com/test_sets/1675bfe0-a678-11e5-860e-5254006e85c2
2015-12-19 18:05:15 - https://testing.hpdd.intel.com/test_sets/7c3a5770-a6c1-11e5-b558-5254006e85c2
2015-12-20 10:53:10 - https://testing.hpdd.intel.com/test_sets/a4955ac0-a73a-11e5-ab33-5254006e85c2
2015-12-22 10:45:13 - https://testing.hpdd.intel.com/test_sets/71d6f384-a8d2-11e5-99ca-5254006e85c2
2015-12-24 18:42:12 - https://testing.hpdd.intel.com/test_sets/00d2fc82-aa94-11e5-a386-5254006e85c2
2015-12-25 03:49:35 - https://testing.hpdd.intel.com/test_sets/a0a3dc3a-aae0-11e5-bed4-5254006e85c2
2015-12-28 10:39:50 - https://testing.hpdd.intel.com/test_sets/82d03ae2-ad7d-11e5-8114-5254006e85c2
2016-01-02 10:10:20 - https://testing.hpdd.intel.com/test_sets/2922f26a-b164-11e5-8114-5254006e85c2

Comment by Andreas Dilger [ 17/Dec/15 ]

Another failure: https://testing.hpdd.intel.com/test_sets/d658c48e-a3fc-11e5-8701-5254006e85c2

Comment by Gerrit Updater [ 18/Dec/15 ]

Yang Sheng (yang.sheng@intel.com) uploaded a new patch: http://review.whamcloud.com/17673
Subject: LU-7084 mem: cleanup & debug patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 587aa2138882a6f2e95a666a66b6dda47223e960

Comment by John Hammond [ 30/Dec/15 ]

This is due to a lack of protective parens in the definitions of the OBD_SLAB_ALLOC_GFP_*() macros. If moving_req is true (as it is in the path for the stack traces seen here) then __GFP_ZERO is not used for the allocation:

static int nrs_orr_res_get(struct ptlrpc_nrs_policy *policy,
                           struct ptlrpc_nrs_request *nrq,
                           const struct ptlrpc_nrs_resource *parent,
                           struct ptlrpc_nrs_resource **resp, bool moving_req)
{
        ...
	OBD_SLAB_CPT_ALLOC_PTR_GFP(orro, orrd->od_cache,
                                   nrs_pol2cptab(policy), nrs_pol2cptid(policy),
                                   moving_req ? GFP_ATOMIC : GFP_NOFS);
        ...
}

#define OBD_SLAB_CPT_ALLOC_PTR_GFP(ptr, slab, cptab, cpt, flags)                      \
        OBD_SLAB_CPT_ALLOC_GFP(ptr, slab, cptab, cpt, sizeof *(ptr), flags)

#define OBD_SLAB_CPT_ALLOC_GFP(ptr, slab, cptab, cpt, size, flags)            \
        __OBD_SLAB_ALLOC_VERBOSE(ptr, slab, cptab, cpt, size, flags)

#define __OBD_SLAB_ALLOC_VERBOSE(ptr, slab, cptab, cpt, size, type)           \
do {                                                                          \
        LASSERT(ergo((type) != GFP_ATOMIC, !in_interrupt()));         \
        (ptr) = (cptab) == NULL ?                                             \
                kmem_cache_alloc(slab, type | __GFP_ZERO) :                   \
                cfs_mem_cache_cpt_alloc(slab, cptab, cpt, type | __GFP_ZERO); \
        if (likely((ptr)))                                                    \
                OBD_ALLOC_POST(ptr, size, "slab-alloced");                    \
} while(0)
Comment by Gerrit Updater [ 30/Dec/15 ]

John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/17755
Subject: LU-7084 obd: correct some OBD allocator macro defines
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b8b6a98d7b37d1a4f196d2f7b17c42688ec88eda

Comment by Gerrit Updater [ 06/Jan/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17755/
Subject: LU-7084 obd: correct some OBD allocator macro defines
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 08fcdbb95cd7ab3fc1246f03c3ef27c0b8a0d218

Comment by James Nunez (Inactive) [ 08/Jan/16 ]

It looks like we experienced this error on a master patch that contains http://review.whamcloud.com/17755/. Logs are at:
https://testing.hpdd.intel.com/test_sets/90f9959a-b459-11e5-9134-5254006e85c2

Comment by John Hammond [ 08/Jan/16 ]

> It looks like we experienced this error on a master patch that contains http://review.whamcloud.com/17755/. Logs are at:
https://testing.hpdd.intel.com/test_sets/90f9959a-b459-11e5-9134-5254006e85c2

The revision that failed was 9d1cf5779235716d9801148aee4d06597ceaab6f. This is patch set 2 of http://review.whamcloud.com/#/c/17633/ which is based on eb6cd4804d65dda1b6ea4a1289cc01647d03a47a (LU-7223 tests: print more information when mmp.sh failed) and does not include 17755.

Comment by Mikhail Pershin [ 01/Feb/16 ]

patch was landed

Comment by Gerrit Updater [ 23/Mar/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17673/
Subject: LU-7084 mem: code cleanup for memory allocation
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e533c7c6d1eb2bbdc4818f688f76fab1d03bc2c8

Comment by Joseph Gmitter (Inactive) [ 23/Mar/16 ]

Landed for 2.9.0

Generated at Sat Feb 10 02:05:51 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.