[LU-7084] sanityn test_77c: cfs_hash_find_or_add()) ASSERTION( hlist_unhashed(hnode) ) failed Created: 01/Sep/15 Updated: 26/Aug/16 Resolved: 23/Mar/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for Bob Glossman <bob.glossman@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/6bace048-50dc-11e5-95a9-5254006e85c2. The sub-test test_77c failed with the following error: test failed to respond and timed out this test fail looks like 16:23:54:LustreError: 24876:0:(hash.c:1253:cfs_hash_find_or_add()) ASSERTION( hlist_unhashed(hnode) ) failed: 16:23:54:LustreError: 24876:0:(hash.c:1253:cfs_hash_find_or_add()) LBUG 16:23:54:Kernel panic - not syncing: LBUG in interrupt. 16:23:54: 16:23:54:Pid: 24876, comm: ll_ost00_006 Tainted: P -- ------------ 2.6.32-573.3.1.el6_lustre.g4276203.x86_64 #1 16:23:54:Call Trace: 16:23:54: [<ffffffff815384e4>] ? panic+0xa7/0x16f 16:23:54: [<ffffffffa0713ebd>] ? lbug_with_loc+0x8d/0xb0 [libcfs] 16:23:54: [<ffffffffa0727d10>] ? cfs_hash_findadd_unique+0x0/0x30 [libcfs] 16:23:54: [<ffffffffa0727d28>] ? cfs_hash_findadd_unique+0x18/0x30 [libcfs] 16:23:54: [<ffffffffa0adf390>] ? nrs_orr_res_get+0x430/0xb70 [ptlrpc] 16:23:54: [<ffffffffa0ad4f86>] ? nrs_resource_get+0x56/0x110 [ptlrpc] 16:23:54: [<ffffffffa0a91935>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] 16:23:54: [<ffffffffa0ad5d0b>] ? nrs_resource_get_safe+0x8b/0x100 [ptlrpc] 16:23:54: [<ffffffffa0ad99eb>] ? ptlrpc_nrs_req_hp_move+0x6b/0x210 [ptlrpc] 16:23:54: [<ffffffffa0abb165>] ? req_capsule_client_get+0x15/0x20 [ptlrpc] 16:23:54: [<ffffffffa0a6fab8>] ? ldlm_server_blocking_ast+0x238/0x8c0 [ptlrpc] 16:23:54: [<ffffffffa0af7329>] ? tgt_blocking_ast+0x1b9/0x8c0 [ptlrpc] 16:23:54: [<ffffffff8129bc34>] ? snprintf+0x34/0x40 16:23:54: [<ffffffffa0a4139e>] ? ldlm_work_bl_ast_lock+0xde/0x290 [ptlrpc] 16:23:54: [<ffffffffa0a87064>] ? ptlrpc_set_wait+0x74/0xa20 [ptlrpc] 16:23:54: [<ffffffff8117900d>] ? kmem_cache_alloc_node_trace+0x1cd/0x200 16:23:54: [<ffffffffa0a7e23e>] ? ptlrpc_prep_set+0xbe/0x270 [ptlrpc] 16:23:54: [<ffffffffa0a412c0>] ? ldlm_work_bl_ast_lock+0x0/0x290 [ptlrpc] 16:23:54: [<ffffffffa0a3dfcf>] ? ldlm_run_ast_work+0xcf/0x4a0 [ptlrpc] 16:23:54: [<ffffffffa0a5ca45>] ? ldlm_process_extent_lock+0x155/0xab0 [ptlrpc] 16:23:54: [<ffffffffa0a445be>] ? ldlm_lock_enqueue+0x47e/0x8e0 [ptlrpc] 16:23:54: [<ffffffffa0a710d7>] ? ldlm_handle_enqueue0+0x807/0x15b0 [ptlrpc] 16:23:54: [<ffffffffa0afbea1>] ? tgt_enqueue+0x61/0x230 [ptlrpc] 16:23:54: [<ffffffffa0afcaec>] ? tgt_request_handle+0x8bc/0x12e0 [ptlrpc] 16:23:54: [<ffffffffa0aa4731>] ? ptlrpc_main+0xe41/0x1910 [ptlrpc] 16:23:54: [<ffffffffa0aa38f0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc] 16:23:54: [<ffffffff810a101e>] ? kthread+0x9e/0xc0 16:23:54: [<ffffffff8100c28a>] ? child_rip+0xa/0x20 16:23:54: [<ffffffff810a0f80>] ? kthread+0x0/0xc0 Info required for matching: sanityn 77c |
| Comments |
| Comment by Andreas Dilger [ 02/Sep/15 ] |
12:14:15:Lustre: DEBUG MARKER: lctl set_param ost.OSS.*.nrs_orr_offset_type=logical 12:14:15:LustreError: 23366:0:(hash.c:1253:cfs_hash_find_or_add()) ASSERTION( hlist_unhashed(hnode) ) failed: 12:14:15:LustreError: 23366:0:(hash.c:1253:cfs_hash_find_or_add()) LBUG 12:14:15:Kernel panic - not syncing: LBUG in interrupt. 12:14:15: 12:14:15:Pid: 23366, comm: ll_ost00_017 Tainted: P --------------- 2.6.32-504.30.3.el6_lustre.gc67434c.x86_64 #1 12:14:15:Call Trace: 12:14:15: [<ffffffff81529c9c>] ? panic+0xa7/0x16f 12:14:15: [<ffffffffa0709ebd>] ? lbug_with_loc+0x8d/0xb0 [libcfs] 12:14:15: [<ffffffffa071dd60>] ? cfs_hash_findadd_unique+0x0/0x30 [libcfs] 12:14:15: [<ffffffffa071dd78>] ? cfs_hash_findadd_unique+0x18/0x30 [libcfs] 12:14:15: [<ffffffffa0ad9070>] ? nrs_orr_res_get+0x430/0xb70 [ptlrpc] 12:14:15: [<ffffffffa0acec46>] ? nrs_resource_get+0x56/0x110 [ptlrpc] 12:14:15: [<ffffffffa0a8b7b5>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] 12:14:15: [<ffffffffa0acf9eb>] ? nrs_resource_get_safe+0x8b/0x100 [ptlrpc] 12:14:15: [<ffffffffa0ad36cb>] ? ptlrpc_nrs_req_hp_move+0x6b/0x210 [ptlrpc] 12:14:15: [<ffffffffa0ab4eb5>] ? req_capsule_client_get+0x15/0x20 [ptlrpc] 12:14:15: [<ffffffffa0a69988>] ? ldlm_server_blocking_ast+0x238/0x8c0 [ptlrpc] 12:14:15: [<ffffffffa0af0ff9>] ? tgt_blocking_ast+0x1b9/0x8c0 [ptlrpc] 12:14:15: [<ffffffffa0a3b39e>] ? ldlm_work_bl_ast_lock+0xde/0x290 [ptlrpc] 12:14:15: [<ffffffffa0a80ee4>] ? ptlrpc_set_wait+0x74/0xa20 [ptlrpc] 12:14:15: [<ffffffff8117591d>] ? kmem_cache_alloc_node_trace+0x1cd/0x200 12:14:15: [<ffffffffa0a780ae>] ? ptlrpc_prep_set+0xbe/0x270 [ptlrpc] 12:14:15: [<ffffffffa0a3b2c0>] ? ldlm_work_bl_ast_lock+0x0/0x290 [ptlrpc] 12:14:15: [<ffffffffa0a37fcf>] ? ldlm_run_ast_work+0xcf/0x4a0 [ptlrpc] 12:14:15: [<ffffffffa0a56915>] ? ldlm_process_extent_lock+0x155/0xab0 [ptlrpc] 12:14:15: [<ffffffffa0a3e5be>] ? ldlm_lock_enqueue+0x47e/0x8e0 [ptlrpc] 12:14:15: [<ffffffffa0a6afa7>] ? ldlm_handle_enqueue0+0x807/0x15b0 [ptlrpc] 12:14:15: [<ffffffffa0af5b71>] ? tgt_enqueue+0x61/0x230 [ptlrpc] 12:14:15: [<ffffffffa0af694c>] ? tgt_request_handle+0xa4c/0x1290 [ptlrpc] 12:14:15: [<ffffffffa0a9e5b1>] ? ptlrpc_main+0xe41/0x1910 [ptlrpc] 12:14:15: [<ffffffff8109e78e>] ? kthread+0x9e/0xc0 |
| Comment by Emoly Liu [ 16/Sep/15 ] |
|
Another instance: https://testing.hpdd.intel.com/test_sets/64c3d51a-5c13-11e5-9dac-5254006e85c2 |
| Comment by Joseph Gmitter (Inactive) [ 30/Sep/15 ] |
|
Here is a failure on master within the past week: |
| Comment by Bob Glossman (Inactive) [ 11/Oct/15 ] |
|
another on master: |
| Comment by Bruno Faccini (Inactive) [ 15/Oct/15 ] |
|
+1 at https://testing.hpdd.intel.com/test_sets/b442abee-72dd-11e5-b8fe-5254006e85c2 |
| Comment by Bob Glossman (Inactive) [ 22/Oct/15 ] |
|
another on master: |
| Comment by Mikhail Pershin [ 27/Oct/15 ] |
|
another one: |
| Comment by Gerrit Updater [ 27/Oct/15 ] |
|
Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/16952 |
| Comment by James Nunez (Inactive) [ 27/Oct/15 ] |
|
More failures at: |
| Comment by Mikhail Pershin [ 28/Oct/15 ] |
|
I was wrong with that patch, ignore it. |
| Comment by Mikhail Pershin [ 28/Oct/15 ] |
|
In fact this ticket is just duplicate of |
| Comment by Mikhail Pershin [ 28/Oct/15 ] |
|
Looking at Bruno analysis in |
| Comment by James Nunez (Inactive) [ 30/Oct/15 ] |
|
Another failure on master: https://testing.hpdd.intel.com/test_sets/12eb8de2-7e8b-11e5-aa3e-5254006e85c2 |
| Comment by nasf (Inactive) [ 02/Nov/15 ] |
|
Another failure instance on master: |
| Comment by Gerrit Updater [ 05/Nov/15 ] |
|
Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/17051 |
| Comment by Alex Zhuravlev [ 03/Dec/15 ] |
|
https://testing.hpdd.intel.com/test_sets/d90368e4-9991-11e5-b944-5254006e85c2 |
| Comment by Bob Glossman (Inactive) [ 03/Dec/15 ] |
|
another on master: |
| Comment by James Nunez (Inactive) [ 04/Dec/15 ] |
|
More failures on master in review-zfs-part-1: |
| Comment by Di Wang [ 11/Dec/15 ] |
|
2015-12-11 06:50:30 https://testing.hpdd.intel.com/sub_tests/140fdcdc-a00f-11e5-8d69-5254006e85c2 |
| Comment by Andreas Dilger [ 13/Dec/15 ] |
|
Again: https://testing.hpdd.intel.com/test_sets/4f246306-9e34-11e5-98a4-5254006e85c2 |
| Comment by James Nunez (Inactive) [ 15/Dec/15 ] |
|
More failures on master: |
| Comment by Andreas Dilger [ 17/Dec/15 ] |
|
Another failure: https://testing.hpdd.intel.com/test_sets/d658c48e-a3fc-11e5-8701-5254006e85c2 |
| Comment by Gerrit Updater [ 18/Dec/15 ] |
|
Yang Sheng (yang.sheng@intel.com) uploaded a new patch: http://review.whamcloud.com/17673 |
| Comment by John Hammond [ 30/Dec/15 ] |
|
This is due to a lack of protective parens in the definitions of the OBD_SLAB_ALLOC_GFP_*() macros. If moving_req is true (as it is in the path for the stack traces seen here) then __GFP_ZERO is not used for the allocation: static int nrs_orr_res_get(struct ptlrpc_nrs_policy *policy, struct ptlrpc_nrs_request *nrq, const struct ptlrpc_nrs_resource *parent, struct ptlrpc_nrs_resource **resp, bool moving_req) { ... OBD_SLAB_CPT_ALLOC_PTR_GFP(orro, orrd->od_cache, nrs_pol2cptab(policy), nrs_pol2cptid(policy), moving_req ? GFP_ATOMIC : GFP_NOFS); ... } #define OBD_SLAB_CPT_ALLOC_PTR_GFP(ptr, slab, cptab, cpt, flags) \ OBD_SLAB_CPT_ALLOC_GFP(ptr, slab, cptab, cpt, sizeof *(ptr), flags) #define OBD_SLAB_CPT_ALLOC_GFP(ptr, slab, cptab, cpt, size, flags) \ __OBD_SLAB_ALLOC_VERBOSE(ptr, slab, cptab, cpt, size, flags) #define __OBD_SLAB_ALLOC_VERBOSE(ptr, slab, cptab, cpt, size, type) \ do { \ LASSERT(ergo((type) != GFP_ATOMIC, !in_interrupt())); \ (ptr) = (cptab) == NULL ? \ kmem_cache_alloc(slab, type | __GFP_ZERO) : \ cfs_mem_cache_cpt_alloc(slab, cptab, cpt, type | __GFP_ZERO); \ if (likely((ptr))) \ OBD_ALLOC_POST(ptr, size, "slab-alloced"); \ } while(0) |
| Comment by Gerrit Updater [ 30/Dec/15 ] |
|
John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/17755 |
| Comment by Gerrit Updater [ 06/Jan/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17755/ |
| Comment by James Nunez (Inactive) [ 08/Jan/16 ] |
|
It looks like we experienced this error on a master patch that contains http://review.whamcloud.com/17755/. Logs are at: |
| Comment by John Hammond [ 08/Jan/16 ] |
|
> It looks like we experienced this error on a master patch that contains http://review.whamcloud.com/17755/. Logs are at: The revision that failed was 9d1cf5779235716d9801148aee4d06597ceaab6f. This is patch set 2 of http://review.whamcloud.com/#/c/17633/ which is based on eb6cd4804d65dda1b6ea4a1289cc01647d03a47a ( |
| Comment by Mikhail Pershin [ 01/Feb/16 ] |
|
patch was landed |
| Comment by Gerrit Updater [ 23/Mar/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17673/ |
| Comment by Joseph Gmitter (Inactive) [ 23/Mar/16 ] |
|
Landed for 2.9.0 |