[LU-4499] NRS ORR cfs_hash_find_or_add() LBUG Created: 16/Jan/14  Updated: 19/Mar/19  Resolved: 26/Aug/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0, Lustre 2.6.0, Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Critical
Reporter: Shuichi Ihara (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: cea
Environment:

Lustre-2.5.52 (server), Lustre-2.5.53(Client)


Issue Links:
Duplicate
is duplicated by LU-6633 sanityn test_77c: ASSERTION( hlist_un... Resolved
Related
is related to LU-7084 sanityn test_77c: cfs_hash_find_or_ad... Resolved
is related to LU-6688 sanityn 77a and 77b fail to set the N... Resolved
Severity: 3
Rank (Obsolete): 12304

 Description   

Hit LBUG and crach on OSS during IOR SSF(Single shared file) test with striping setting on all OSTs(lfs setstripe -c -1).

<0>LustreError: 28757:0:(hash.c:1252:cfs_hash_find_or_add()) ASSERTION( hlist_unhashed(hnode) ) failed: 
<0>LustreError: 28757:0:(hash.c:1252:cfs_hash_find_or_add()) LBUG
<0>Kernel panic - not syncing: LBUG in interrupt.
<0>
<4>Pid: 28757, comm: ll_ost01_008 Not tainted 2.6.32-358.23.2.el6_lustre.ge975b1c.x86_64 #1
<4>Call Trace:
<4> [<ffffffff8150deec>] ? panic+0xa7/0x16f
<4> [<ffffffffa065aedd>] ? lbug_with_loc+0x8d/0xb0 [libcfs]
<4> [<ffffffffa0672d80>] ? cfs_hash_findadd_unique+0x0/0x30 [libcfs]
<4> [<ffffffffa0672d98>] ? cfs_hash_findadd_unique+0x18/0x30 [libcfs]
<4> [<ffffffffa0c83c76>] ? nrs_orr_res_get+0x696/0xb90 [ptlrpc]
<4> [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
<4> [<ffffffffa0c79e36>] ? nrs_resource_get+0x56/0x110 [ptlrpc]
<4> [<ffffffffa0c37d95>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
<4> [<ffffffffa0c7a7fb>] ? nrs_resource_get_safe+0x8b/0x100 [ptlrpc]
<4> [<ffffffffa0c7ce38>] ? ptlrpc_nrs_req_hp_move+0x68/0x210 [ptlrpc]
<4> [<ffffffffa0c5f845>] ? req_capsule_client_get+0x15/0x20 [ptlrpc]
<4> [<ffffffffa0c1a158>] ? ldlm_server_blocking_ast+0x228/0x880 [ptlrpc]
<4> [<ffffffffa0c8f65b>] ? tgt_blocking_ast+0x7b/0x5e0 [ptlrpc]
<4> [<ffffffffa0beb1ba>] ? ldlm_add_bl_work_item+0x8a/0x1e0 [ptlrpc]
<4> [<ffffffffa0bee405>] ? ldlm_add_ast_work_item+0x55/0x180 [ptlrpc]
<4> [<ffffffffa0bed38d>] ? ldlm_work_bl_ast_lock+0xdd/0x290 [ptlrpc]
<4> [<ffffffffa0c2e3bc>] ? ptlrpc_set_wait+0x6c/0x860 [ptlrpc]
<4> [<ffffffff811685ac>] ? __kmalloc+0x20c/0x220
<4> [<ffffffffa0c2b06a>] ? ptlrpc_prep_set+0xfa/0x2f0 [ptlrpc]
<4> [<ffffffffa0bed2b0>] ? ldlm_work_bl_ast_lock+0x0/0x290 [ptlrpc]
<4> [<ffffffffa0bf006b>] ? ldlm_run_ast_work+0x1bb/0x470 [ptlrpc]
<4> [<ffffffffa0c070ad>] ? ldlm_process_extent_lock+0x13d/0xa90 [ptlrpc]
<4> [<ffffffffa0bef5ab>] ? ldlm_lock_enqueue+0x3fb/0x920 [ptlrpc]
<4> [<ffffffffa0c18c4f>] ? ldlm_handle_enqueue0+0x4ef/0x10a0 [ptlrpc]
<4> [<ffffffffa0c92562>] ? tgt_enqueue+0x62/0x1d0 [ptlrpc]
<4> [<ffffffffa0c94f5a>] ? tgt_handle_request0+0x2ea/0x1490 [ptlrpc]
<4> [<ffffffffa065b4ce>] ? cfs_timer_arm+0xe/0x10 [libcfs]
<4> [<ffffffffa066c3af>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
<4> [<ffffffffa0c3792c>] ? lustre_msg_get_opc+0x9c/0x110 [ptlrpc]
<4> [<ffffffffa0c9653a>] ? tgt_request_handle+0x43a/0x980 [ptlrpc]
<4> [<ffffffffa0c4a295>] ? ptlrpc_main+0xd25/0x1970 [ptlrpc]
<4> [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
<4> [<ffffffff8150e600>] ? thread_return+0x4e/0x76e
<4> [<ffffffffa0c49570>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]
<4> [<ffffffff81096a36>] ? kthread+0x96/0xa0
<4> [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
<4> [<ffffffff810969a0>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c0c0>] ? child_rip+0x0/0x20


 Comments   
Comment by Shuichi Ihara (Inactive) [ 16/Jan/14 ]

ORR(Object-Based Round Robin) is turned on as NRS policy in this testing.

Comment by Peter Jones [ 17/Jan/14 ]

Lai

could you please help with this one?

Thanks

Peter

Comment by Lai Siyao [ 29/Jan/14 ]

This doesn't look likely to happen, because though orro->oo_hnode is not initialise before use, it's zeroed at allocation time by default. Anyway I composed a patch to initialise it, could you patch it and test on your system?

Patch is on http://review.whamcloud.com/#/c/9046/

Comment by Shuichi Ihara (Inactive) [ 03/Apr/14 ]

patch http://review.whamcloud.com/#/c/9046/ doesn't help. We hit another crach here.
Just chnage NRS policy to ORR(lctl set_param ost.OSS.ost_io.nrs_policies=orr), run IOR from client, then hit crach on OSS.

<4>------------[ cut here ]------------
<2>kernel BUG at mm/slab.c:2835!
<4>invalid opcode: 0000 [#1] SMP 
<4>last sysfs file: /sys/devices/pci0000:00/0000:00:03.0/0000:05:00.0/host7/target7:0:0/7:0:0:2/state
<4>CPU 3 
<4>Modules linked in: osp(U) ofd(U) ost(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) ldiskfs(U) lquota(U) jbd2 mdd(U) fid(U) fld(U) ptlrpc(U) ko2iblnd(U) obdc
lass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) ib_srp(U) scsi_transport_srp(U) bridge stp llc ipmi_devintf dell_rbu nfs lockd 
fscache auth_rpcgss nfs_acl sunrpc rdma_ucm(U) ib_ucm(U) rdma_cm(U) iw_cm(U) ib_ipoib(U) ib_cm(U) ib_uverbs(U) ib_umad(U) mlx5_ib(U) mlx5_core(U) mlx4_en(U)
 mlx4_ib(U) ib_sa(U) ib_mad(U) ib_core(U) ib_addr(U) ipv6 mlx4_core(U) compat(U) dm_round_robin dm_multipath vhost_net macvtap macvlan tun kvm knem(U) power
_meter ses enclosure sg shpchp tg3 dcdbas microcode iTCO_wdt iTCO_vendor_support ext3 jbd mbcache sr_mod cdrom sd_mod crc_t10dif ahci wmi megaraid_sas dm_mi
rror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
<4>
<4>Pid: 1868, comm: ll_ost_io01_002 Not tainted 2.6.32-358.23.2.el6_lustre.x86_64 #1 Dell Inc. PowerEdge R620/01W23F
<4>RIP: 0010:[<ffffffff81167473>]  [<ffffffff81167473>] cache_grow+0x313/0x320
<4>RSP: 0018:ffff881fdf96fc10  EFLAGS: 00010002
<4>RAX: ffff883f50d87c80 RBX: ffff881f879c1b80 RCX: 0000000000000000
<4>RDX: 0000000000000001 RSI: 0000000000041212 RDI: ffff881f879c1b80
<4>RBP: ffff881fdf96fc70 R08: 0000000000000000 R09: 000000000000dee6
<4>R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000041212
<4>R13: ffff883f50d87c40 R14: 0000000000000010 R15: 0000000000000000
<4>FS:  00007f7bccdf8700(0000) GS:ffff8820f0c20000(0000) knlGS:0000000000000000
<4>CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
<4>CR2: 0000000000481000 CR3: 00000040517c6000 CR4: 00000000001407e0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process ll_ost_io01_002 (pid: 1868, threadinfo ffff881fdf96e000, task ffff88203554e040)
<4>Stack:
<4> 00000000533d4823 0000000000000000 0000000000800000 0800000000000000
<4><d> 0000000000000000 0000000000002000 000000000000127e ffff881f879c1b80
<4><d> ffff883fd9162800 ffff883f50d87c40 0000000000000010 ffff883f50d87c60
<4>Call Trace:
<4> [<ffffffff81167682>] cache_alloc_refill+0x202/0x240
<4> [<ffffffff8116714e>] kmem_cache_alloc_node+0x1be/0x1d0
<4> [<ffffffffa049e8a1>] cfs_mem_cache_cpt_alloc+0x41/0x50 [libcfs]
<4> [<ffffffffa09ee769>] nrs_orr_res_get+0x5d9/0xba0 [ptlrpc]
<4> [<ffffffffa09e4b56>] nrs_resource_get+0x56/0x110 [ptlrpc]
<4> [<ffffffffa09a5860>] ? lustre_swab_niobuf_remote+0x0/0x30 [ptlrpc]
<4> [<ffffffffa09e551b>] nrs_resource_get_safe+0x8b/0x100 [ptlrpc]
<4> [<ffffffffa09e7a48>] ptlrpc_nrs_req_initialize+0x38/0x90 [ptlrpc]
<4> [<ffffffffa09b4e00>] ptlrpc_main+0x1180/0x1700 [ptlrpc]
<4> [<ffffffffa09b3c80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
<4> [<ffffffff8100c0ca>] child_rip+0xa/0x20
<4> [<ffffffffa09b3c80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
<4> [<ffffffffa09b3c80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
<4> [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
<4>Code: 0f 1f 84 00 00 00 00 00 49 8d 54 24 30 48 c7 c0 fc ff ff ff 48 89 55 c8 e9 e1 fe ff ff 0f 0b eb fe ba 01 00 00 00 e9 2a fe ff ff <0f> 0b eb fe 66 0
f 1f 84 00 00 00 00 00 55 48 89 e5 41 57 41 56 
<1>RIP  [<ffffffff81167473>] cache_grow+0x313/0x320
<4> RSP <ffff881fdf96fc10>
Comment by Lai Siyao [ 21/Apr/14 ]

This time it LBUG in kernel mm/slab.c:2835

BUG_ON(flags & GFP_SLAB_BUG_MASK)

I don't see how it can happen yet, because it seems in entry of kmem_cache_alloc_node() has adjusted "flags" with "gfp_allowed_mask", and the only possible flags lustre code will set is GFP_ATOMIC, GFP_NOFS and __GFP_ZERO which are all valid.

I will do more test to find out the cause.

Comment by Lai Siyao [ 06/Jan/15 ]

This may be a duplicate of LU-4362, could you verify http://review.whamcloud.com/#/c/8509/ is included in your branch? if not, you may apply it and test again.

Comment by James Nunez (Inactive) [ 12/May/15 ]

In the latest version of master, I'm seeing this LBUG in sanityn test 77c. Is this related to LU-6558?

Here are some recent test sessions that have failed with this LBUG:
review-zfs - https://testing.hpdd.intel.com/test_sets/63ca7db8-f830-11e4-a933-5254006e85c2
review-zfs - https://testing.hpdd.intel.com/test_sets/2d786c4c-f8b8-11e4-bb24-5254006e85c2

Comment by Jinshan Xiong (Inactive) [ 13/May/15 ]

hit again at: https://testing.hpdd.intel.com/test_logs/05f59e66-f989-11e4-939f-5254006e85c2/show_text

with the following backtrace:

00:49:01:LustreError: 12123:0:(hash.c:1256:cfs_hash_find_or_add()) ASSERTION( hlist_unhashed(hnode) ) failed: 
00:49:01:LustreError: 12123:0:(hash.c:1256:cfs_hash_find_or_add()) LBUG
00:49:01:Kernel panic - not syncing: LBUG in interrupt.
00:49:01:
00:49:01:Pid: 12123, comm: ll_ost00_004 Tainted: P           ---------------    2.6.32-504.16.2.el6_lustre.gd805a88.x86_64 #1
00:49:01:Call Trace:
00:49:01: [<ffffffff81529fbc>] ? panic+0xa7/0x16f
00:49:01: [<ffffffffa0709ebd>] ? lbug_with_loc+0x8d/0xb0 [libcfs]
00:49:01: [<ffffffffa071df20>] ? cfs_hash_findadd_unique+0x0/0x30 [libcfs]
00:49:01: [<ffffffffa071df38>] ? cfs_hash_findadd_unique+0x18/0x30 [libcfs]
00:49:01: [<ffffffffa0aceb4b>] ? nrs_orr_res_get+0x43b/0xc30 [ptlrpc]
00:49:01: [<ffffffffa0ac4fb6>] ? nrs_resource_get+0x56/0x110 [ptlrpc]
00:49:01: [<ffffffffa0a83665>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
00:49:01: [<ffffffffa0ac597b>] ? nrs_resource_get_safe+0x8b/0x100 [ptlrpc]
00:49:01: [<ffffffffa0ac807b>] ? ptlrpc_nrs_req_hp_move+0x6b/0x210 [ptlrpc]
00:49:01: [<ffffffffa0aab445>] ? req_capsule_client_get+0x15/0x20 [ptlrpc]
00:49:01: [<ffffffffa0a633f8>] ? ldlm_server_blocking_ast+0x228/0x8b0 [ptlrpc]
00:49:01: [<ffffffffa0ae1ba1>] ? tgt_blocking_ast+0x1b1/0x8b0 [ptlrpc]
00:49:01: [<ffffffff812975c4>] ? snprintf+0x34/0x40
00:49:01: [<ffffffffa0a36dbd>] ? ldlm_work_bl_ast_lock+0xdd/0x290 [ptlrpc]
00:49:01: [<ffffffffa0a791c4>] ? ptlrpc_set_wait+0x74/0x900 [ptlrpc]
00:49:01: [<ffffffff81174c13>] ? kmem_cache_alloc_trace+0x1b3/0x1c0
00:49:01: [<ffffffff81174f6c>] ? __kmalloc+0x21c/0x230
00:49:01: [<ffffffffa0a758d2>] ? ptlrpc_prep_set+0x112/0x2e0 [ptlrpc]
00:49:01: [<ffffffffa0a36ce0>] ? ldlm_work_bl_ast_lock+0x0/0x290 [ptlrpc]
00:49:01: [<ffffffffa0a38f7b>] ? ldlm_run_ast_work+0x1db/0x470 [ptlrpc]
00:49:01: [<ffffffffa0a50685>] ? ldlm_process_extent_lock+0x155/0xab0 [ptlrpc]
00:49:01: [<ffffffffa0a3883b>] ? ldlm_lock_enqueue+0x46b/0x9d0 [ptlrpc]
00:49:01: [<ffffffffa0a6472b>] ? ldlm_handle_enqueue0+0x51b/0x13f0 [ptlrpc]
00:49:01: [<ffffffffa0ae56b1>] ? tgt_enqueue+0x61/0x230 [ptlrpc]
00:49:01: [<ffffffffa0ae61ce>] ? tgt_request_handle+0x94e/0x10a0 [ptlrpc]
00:49:01: [<ffffffffa0a95bf1>] ? ptlrpc_main+0xe41/0x1970 [ptlrpc]
00:49:01: [<ffffffffa0a94db0>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]
00:49:01: [<ffffffff8109e71e>] ? kthread+0x9e/0xc0
00:49:01: [<ffffffff8100c20a>] ? child_rip+0xa/0x20
00:49:01: [<ffffffff8109e680>] ? kthread+0x0/0xc0
00:49:01: [<ffffffff8100c200>] ? child_rip+0x0/0x20
Comment by Jinshan Xiong (Inactive) [ 13/May/15 ]

It seems like a memory corruption issue.

Comment by Jian Yu [ 21/May/15 ]

More instances on master branch:
https://testing.hpdd.intel.com/test_sets/d086aa90-fffa-11e4-a3db-5254006e85c2
https://testing.hpdd.intel.com/test_sets/2d9d9918-ff2f-11e4-be81-5254006e85c2
https://testing.hpdd.intel.com/test_sets/2d9d9918-ff2f-11e4-be81-5254006e85c2
https://testing.hpdd.intel.com/test_sets/b8f327a0-fec9-11e4-a4ed-5254006e85c2
https://testing.hpdd.intel.com/test_sets/d505ade0-fe8e-11e4-919c-5254006e85c2

Comment by Bob Glossman (Inactive) [ 22/May/15 ]

another on master:
https://testing.hpdd.intel.com/test_sets/3916eec4-00c5-11e5-9650-5254006e85c2

Comment by Andreas Dilger [ 26/May/15 ]

This is failing about twice a day on average. It has been hitting regularly since 2015-05-04, so maybe some patch that landed within the previous day or two caused this problem to fail more often?

In hindsight, that is because the NRS testing was enabled via http://review.whamcloud.com/9286 "LU-3266 test: regression tests for nrs policies" on 2015-05-01, so the bug itself has probably been around a long time.

Comment by Sebastien Buisson (Inactive) [ 04/Jun/15 ]

Hi,

One more instance:
https://testing.hpdd.intel.com/test_sets/320f7778-09c5-11e5-8421-5254006e85c2

Comment by Bob Glossman (Inactive) [ 19/Jun/15 ]

another on master:
https://testing.hpdd.intel.com/test_sets/92cbbd1a-16d1-11e5-8436-5254006e85c2

Comment by James Nunez (Inactive) [ 30/Jun/15 ]

Several recent failures on master:
2015-06-25 04:16:38 - https://testing.hpdd.intel.com/test_sets/3777cfde-1b55-11e5-ac09-5254006e85c2
2015-06-25 10:17:54 - https://testing.hpdd.intel.com/test_sets/a76ea20e-1b9b-11e5-ac09-5254006e85c2
2015-06-26 15:02:25 - https://testing.hpdd.intel.com/test_sets/474e2b02-1c76-11e5-9e33-5254006e85c2
2015-06-27 09:20:38 - https://testing.hpdd.intel.com/test_sets/a148e48a-1d16-11e5-9df2-5254006e85c2
2015-06-29 10:47:41 - https://testing.hpdd.intel.com/test_sets/923d2e80-1eae-11e5-8f20-5254006e85c2

Comment by Alex Zhuravlev [ 15/Jul/15 ]

https://testing.hpdd.intel.com/test_sets/1bbf8b9a-2a6b-11e5-b04d-5254006e85c2

Comment by Bob Glossman (Inactive) [ 17/Jul/15 ]

another on master:
https://testing.hpdd.intel.com/test_sets/d2ef2ba8-2c15-11e5-8c67-5254006e85c2

Comment by Alex Zhuravlev [ 20/Jul/15 ]

https://testing.hpdd.intel.com/test_sets/f8aae4b0-2d58-11e5-831c-5254006e85c2

Comment by Henri Doreau (Inactive) [ 21/Jul/15 ]

This is not an ORR-only issue, we hit it on CRR-N too on a MDS:

#0 [ffff880ffaa3f4c8] machine_kexec at ffffffff8103b5bb
#1 [ffff880ffaa3f528] crash_kexec at ffffffff810c9c82
#2 [ffff880ffaa3f5f8] panic at ffffffff81529b1e
#3 [ffff880ffaa3f678] lbug_with_loc at ffffffffa038aedd [libcfs]
#4 [ffff880ffaa3f6f8] cfs_hash_findadd_unique at ffffffffa03a2db8 [libcfs]
#5 [ffff880ffaa3f718] nrs_crrn_res_get at ffffffffa06a7a93 [ptlrpc]
#6 [ffff880ffaa3f758] nrs_resource_get at ffffffffa06a1116 [ptlrpc]
#7 [ffff880ffaa3f7b8] nrs_resource_get_safe at ffffffffa06a1adb [ptlrpc]
#8 [ffff880ffaa3f7f8] ptlrpc_nrs_req_hp_move at ffffffffa06a42b8 [ptlrpc]
#9 [ffff880ffaa3f848] ldlm_server_blocking_ast at ffffffffa0642018 [ptlrpc]
#10 [ffff880ffaa3f898] ldlm_work_bl_ast_lock at ffffffffa061538d [ptlrpc]
#11 [ffff880ffaa3f918] ptlrpc_set_wait at ffffffffa065648c [ptlrpc]
#12 [ffff880ffaa3f9b8] ldlm_run_ast_work at ffffffffa061800b [ptlrpc]
#13 [ffff880ffaa3f9e8] ldlm_process_inodebits_lock at ffffffffa0646507 [ptlrpc]
#14 [ffff880ffaa3fa68] ldlm_lock_enqueue at ffffffffa06175b5 [ptlrpc]
#15 [ffff880ffaa3fac8] ldlm_cli_enqueue_local at ffffffffa0636b53 [ptlrpc]
#16 [ffff880ffaa3fb48] mdt_object_lock0 at ffffffffa0c988f6 [mdt]
#17 [ffff880ffaa3fbf8] mdt_object_lock at ffffffffa0c99334 [mdt]
#18 [ffff880ffaa3fc08] mdt_reint_unlink at ffffffffa0cb1cce [mdt]
#19 [ffff880ffaa3fc88] mdt_reint_rec at ffffffffa0cae671 [mdt]
#20 [ffff880ffaa3fca8] mdt_reint_internal at ffffffffa0c93cb3 [mdt]
#21 [ffff880ffaa3fce8] mdt_reint at ffffffffa0c93fb4 [mdt]
#22 [ffff880ffaa3fd08] mdt_handle_common at ffffffffa0c96aba [mdt]
#23 [ffff880ffaa3fd58] mds_regular_handle at ffffffffa0cd3985 [mdt]
#24 [ffff880ffaa3fd68] ptlrpc_server_handle_request at ffffffffa0670cf5 [ptlrpc]
#25 [ffff880ffaa3fe48] ptlrpc_main at ffffffffa067205d [ptlrpc]
#26 [ffff880ffaa3fee8] kthread at ffffffff8109e66e
#27 [ffff880ffaa3ff48] kernel_thread at ffffffff8100c20a

Comment by Bruno Faccini (Inactive) [ 21/Jul/15 ]

Hello Henri, I think the CRR-N needs the same kind of patch that Lai had pushed for ORR!
Lai, why did you have decided to finally abandon your original patch ??

Comment by Jinshan Xiong (Inactive) [ 21/Jul/15 ]

Bruno,

That patch didn't address the problem. This looks like a memory corruption issue where mostly like a piece of freed memory was accessed and written again. Later on this piece of memory was allocated and used by NRS.

Comment by Bruno Faccini (Inactive) [ 21/Jul/15 ]

Just to be complete, one of my auto-tests session (https://testing.hpdd.intel.com/test_sessions/01e63cdc-2fa7-11e5-97d6-5254006e85c2) has experienced the original LBUG/problem for this ticket and it looks like only the whole struct nrs_orr_object just being (re?) allocated has been poisonned causing the LBUG :

Concerned struct nrs_orr_object at 0xffff88006e3c2fa0 :
$4 = {
  oo_res = {
    res_parent = 0x5a5a5a5a5a5a5a5a, 
    res_policy = 0x5a5a5a5a5a5a5a5a
  }, 
  oo_hnode = {
    next = 0x5a5a5a5a5a5a5a5a, 
    pprev = 0x5a5a5a5a5a5a5a5a  <<<< causing the LBUG!
  }, 
  oo_round = 0x5a5a5a5a5a5a5a5a, 
  oo_sequence = 0x5a5a5a5a5a5a5a5a, 
  oo_key = {                <<<<<<<<  has just been initialized in nrs_orr_res_get()
    {
      ok_fid = {
        f_seq = 0x100000000, 
        f_oid = 0x794e, 
        f_ver = 0x0
      }, 
      ok_idx = 0x0
    }
  }, 
  oo_ref = 0x1,         <<<<<<<<  has just been initialized in nrs_orr_res_get()
  oo_quantum = 0x5a5a, 
  oo_active = 0x5a5a
}

Slab containing this struct :
ACHE            NAME                 OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE
ffff880079bb4500 nrs_orr_hp_0              80          2        48      1     4k
SLAB              MEMORY            TOTAL  ALLOCATED  FREE
ffff88006e3c2000  ffff88006e3c20f0     48          2    46
FREE / [ALLOCATED]
  [ffff88006e3c2fa0]

      PAGE       PHYSICAL      MAPPING       INDEX CNT FLAGS
ffffea000181d270 6e3c2000                0 ffff88003757ad80  1 20000000000080

Memory around in same Slab and only/also poisonned :
ffff88006e3c25a0:  5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a   ZZZZZZZZZZZZZZZZ
ffff88006e3c25b0:  5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a   ZZZZZZZZZZZZZZZZ
ffff88006e3c25c0:  5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a   ZZZZZZZZZZZZZZZZ
ffff88006e3c25d0:  5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a   ZZZZZZZZZZZZZZZZ
ffff88006e3c25e0:  5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a   ZZZZZZZZZZZZZZZZ
ffff88006e3c2fa0:  5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a   ZZZZZZZZZZZZZZZZ
ffff88006e3c2fb0:  5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a   ZZZZZZZZZZZZZZZZ
ffff88006e3c2fc0:  5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a   ZZZZZZZZZZZZZZZZ
ffff88006e3c2fe0:  0000000000000001 5a5a5a5a5a5a5a5a   ........ZZZZZZZZ

Where:
CACHE            NAME                 OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE
ffff880079bb4500 nrs_orr_hp_0              80          2        48      1     4k
SLAB              MEMORY            TOTAL  ALLOCATED  FREE
ffff88006e3c2000  ffff88006e3c20f0     48          2    46
FREE / [ALLOCATED]
   ffff88006e3c25a0  (cpu 0 cache)

      PAGE       PHYSICAL      MAPPING       INDEX CNT FLAGS
ffffea000181d270 6e3c2000                0 ffff88003757ad80  1 20000000000080

So only an other free nrs_orr_object in same Slab and the one just allocated causing the LBUG !!
Comment by Bruno Faccini (Inactive) [ 21/Jul/15 ]

Hello Jinshan, I agree that this is confusing and looking as a corruption, but if you have a look at my crash-dump extracts, it should have been a very precise one!! By the way even the end of the Slab, right after the concerned nrs_orr_object has been preserved :

ffff88006e3c2f60:  0000000000000000 0000000000000000   ................
ffff88006e3c2f70:  0000000000000000 0000000000000000   ................
ffff88006e3c2f80:  0000000000000000 0000000000000000   ................
ffff88006e3c2f90:  0000000000000000 0000000000000000   ................
ffff88006e3c2fa0:  5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a   ZZZZZZZZZZZZZZZZ
ffff88006e3c2fb0:  5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a   ZZZZZZZZZZZZZZZZ
ffff88006e3c2fc0:  5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a   ZZZZZZZZZZZZZZZZ
ffff88006e3c2fd0:  0000000100000000 000000000000794e   ........Ny......
ffff88006e3c2fe0:  0000000000000001 5a5a5a5a5a5a5a5a   ........ZZZZZZZZ
ffff88006e3c2ff0:  0000000000000000 0000000001a87067   ........gp......
Comment by James Nunez (Inactive) [ 23/Jul/15 ]

Another case:
2015-07-22 13:58:00 - https://testing.hpdd.intel.com/test_sets/e15f2588-30bd-11e5-ae23-5254006e85c2

Comment by Bob Glossman (Inactive) [ 27/Jul/15 ]

another on master:
https://testing.hpdd.intel.com/test_sets/ff5a31b8-308b-11e5-aa87-5254006e85c2

Comment by James Nunez (Inactive) [ 29/Jul/15 ]

More from master review-zfs-part-1:
2015-07-28 20:20:41 - https://testing.hpdd.intel.com/test_sets/7b300648-35a3-11e5-b949-5254006e85c2
2015-07-28 22:07:39 - https://testing.hpdd.intel.com/test_sets/8927e85c-35ac-11e5-bbc3-5254006e85c2
2015-07-29 04:24:48 - https://testing.hpdd.intel.com/test_sets/c3dfd67e-35e5-11e5-8c30-5254006e85c2

Comment by Bob Glossman (Inactive) [ 01/Aug/15 ]

another on master:
https://testing.hpdd.intel.com/test_sets/5bfd264e-3886-11e5-9969-5254006e85c2

Comment by Bob Glossman (Inactive) [ 02/Aug/15 ]

another on master:
https://testing.hpdd.intel.com/test_sets/b9aec8d4-38ee-11e5-8dec-5254006e85c2

Comment by Di Wang [ 05/Aug/15 ]

hit again
https://testing.hpdd.intel.com/test_sets/c6cd44f2-3b53-11e5-95fa-5254006e85c2

Comment by Jinshan Xiong (Inactive) [ 06/Aug/15 ]

I made a patch at: http://review.whamcloud.com/15670 to reproduce it with MALLOC debug flag set.

Comment by Jinshan Xiong (Inactive) [ 10/Aug/15 ]

I'm working on this issue.

Comment by Gerrit Updater [ 11/Aug/15 ]

Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/15943
Subject: LU-4499 nrs: adjust the order of REQ NRS initilization
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 263a722ce77cac0fe55f316c100e83b92b451064

Comment by James Nunez (Inactive) [ 24/Aug/15 ]

We hit this bug again:
2015-08-21 10:39:47 - https://testing.hpdd.intel.com/test_sets/3147f89a-4829-11e5-8db5-5254006e85c2

Comment by Gerrit Updater [ 26/Aug/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15943/
Subject: LU-4499 nrs: adjust the order of REQ NRS initilization
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f904a6617b57eb8b4b90f5bc198bdec758133922

Comment by Joseph Gmitter (Inactive) [ 26/Aug/15 ]

Landed for 2.8.

Generated at Sat Feb 10 01:43:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.