[LU-14752] LustreError: 3594004:0:(lu_object.c:2472:lu_object_assign_fid()) ASSERTION( rc == 0 ) failed: failed hashtable insertion: rc = -16 Created: 09/Jun/21 Updated: 09/Nov/21 Resolved: 19/Jul/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0 |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Gian-Carlo Defazio | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnll | ||
| Environment: |
lustre-2.14.0_2.llnl-2.t4.x86_64 |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
While running a performance test that uses mdtest, the error LustreError: 3594004:0:(lu_object.c:2472:lu_object_assign_fid()) LBUG has been see several times. Here's an example: 2021-05-21 15:07:06 [1990667.397573] LustreError: 3594004:0:(lu_object.c:2472:lu_object_assign_fid()) ASSERTION( rc == 0 ) failed: failed hashtable insertion: rc = -16 2021-05-21 15:07:06 [1990667.410521] LustreError: 3594004:0:(lu_object.c:2472:lu_object_assign_fid()) LBUG 2021-05-21 15:07:06 [1990667.418250] Pid: 3594004, comm: mdt01_020 4.18.0-240.22.1.1toss.t4.x86_64 #1 SMP Tue Apr 13 17:18:40 PDT 2021 2021-05-21 15:07:06 [1990667.428331] Call Trace TBD: 2021-05-21 15:07:06 [1990667.431303] Kernel panic - not syncing: LBUG 2021-05-21 15:07:06 [1990667.435744] CPU: 49 PID: 3594004 Comm: mdt01_020 Kdump: loaded Tainted: P W OE --------- - - 4.18.0-240.22.1.1toss.t4.x86_64 #1 2021-05-21 15:07:06 [1990667.448421] Hardware name: Supermicro SSG-2029P-DN2R24N-LL013/X11DSN-TS, BIOS 3.4 11/04/2020 2021-05-21 15:07:06 [1990667.457027] Call Trace: 2021-05-21 15:07:06 [1990667.459662] dump_stack+0x5c/0x80 2021-05-21 15:07:06 [1990667.463150] panic+0xe7/0x2a9 2021-05-21 15:07:06 [1990667.466306] lbug_with_loc.cold.10+0x18/0x18 [libcfs] 2021-05-21 15:07:06 [1990667.471566] lu_object_assign_fid+0x3b8/0x3c0 [obdclass] 2021-05-21 15:07:06 [1990667.477056] osp_create+0x471/0xba0 [osp] 2021-05-21 15:07:06 [1990667.481257] ? lod_sub_create+0x244/0x4a0 [lod] 2021-05-21 15:07:06 [1990667.485963] lod_sub_create+0x244/0x4a0 [lod] 2021-05-21 15:07:06 [1990667.490497] lod_striped_create+0x174/0x4c0 [lod] 2021-05-21 15:07:06 [1990667.495377] lod_create+0x218/0x330 [lod] 2021-05-21 15:07:06 [1990667.499559] ? spl_kmem_alloc+0xd5/0x130 [spl] 2021-05-21 15:07:06 [1990667.504186] mdd_create_object_internal+0xa9/0x300 [mdd] 2021-05-21 15:07:06 [1990667.509673] mdd_create_object+0x92/0x870 [mdd] 2021-05-21 15:07:06 [1990667.514380] mdd_create+0xd45/0x1740 [mdd] 2021-05-21 15:07:06 [1990667.518653] ? mdd_links_rename+0x440/0x440 [mdd] 2021-05-21 15:07:06 [1990667.523549] mdt_reint_open+0x225e/0x2f20 [mdt] 2021-05-21 15:07:06 [1990667.528276] ? check_unlink_entry+0x14/0xb0 [obdclass] 2021-05-21 15:07:06 [1990667.533589] ? mdt_root_squash+0x1e/0x3e0 [mdt] 2021-05-21 15:07:06 [1990667.538307] mdt_reint_rec+0x11f/0x250 [mdt] 2021-05-21 15:07:06 [1990667.542759] mdt_reint_internal+0x498/0x780 [mdt] 2021-05-21 15:07:06 [1990667.547649] mdt_intent_open+0x124/0x3c0 [mdt] 2021-05-21 15:07:06 [1990667.552272] ? cfs_hash_bd_from_key.isra.41+0x36/0xb0 [libcfs] 2021-05-21 15:07:06 [1990667.558279] ? mdt_intent_fixup_resent+0x1f0/0x1f0 [mdt] 2021-05-21 15:07:06 [1990667.563766] mdt_intent_opc+0x44d/0xa80 [mdt] 2021-05-21 15:07:06 [1990667.568300] mdt_intent_policy+0x1f6/0x380 [mdt] 2021-05-21 15:07:06 [1990667.573153] ldlm_lock_enqueue+0x4c1/0x9f0 [ptlrpc] 2021-05-21 15:07:06 [1990667.578235] ldlm_handle_enqueue0+0x61a/0x16d0 [ptlrpc] 2021-05-21 15:07:06 [1990667.583659] tgt_enqueue+0xa4/0x1f0 [ptlrpc] 2021-05-21 15:07:06 [1990667.588131] tgt_request_handle+0xc8b/0x1930 [ptlrpc] 2021-05-21 15:07:06 [1990667.593381] ? ptlrpc_nrs_req_get_nolock0+0xfb/0x1f0 [ptlrpc] 2021-05-21 15:07:06 [1990667.599328] ptlrpc_server_handle_request+0x31a/0xba0 [ptlrpc] 2021-05-21 15:07:06 [1990667.605358] ptlrpc_main+0xba9/0x14a0 [ptlrpc] 2021-05-21 15:07:06 [1990667.609974] ? __schedule+0x2cf/0x720 2021-05-21 15:07:06 [1990667.613833] ? ptlrpc_register_service+0xfc0/0xfc0 [ptlrpc] 2021-05-21 15:07:06 [1990667.619574] kthread+0x11d/0x140 2021-05-21 15:07:06 [1990667.622977] ? kthread_flush_work_fn+0x10/0x10 2021-05-21 15:07:06 [1990667.627588] ret_from_fork+0x1f/0x40 The version of lustre used is at https://github.com/llnl/lustre/tree/2.14.0_2.llnl
|
| Comments |
| Comment by Alex Zhuravlev [ 10/Jun/21 ] |
|
anything in the logs before LBUG() ? |
| Comment by Peter Jones [ 10/Jun/21 ] |
|
Lai Could you please review the logs once supplied Thanks Peter |
| Comment by John Hammond [ 10/Jun/21 ] |
|
Could you try reverting aff14dbc522e140945f9207d4e9bbd27d0326819 (LU-8130 lu_object: convert lu_object cache to rhashtable)? |
| Comment by Gian-Carlo Defazio [ 10/Jun/21 ] |
|
I'm working on getting my test setup to match the ones that failed, I'll generate more logs and post them here when I do, and I'll run the test with aff14dbc522e140945f9207d4e9bbd27d0326819 as well. |
| Comment by James A Simmons [ 10/Jun/21 ] |
|
That is a EBUSY error which can happen when the hashtable being resized. BTW can you share your mdtest parameters and setup so people can duplicate it. |
| Comment by Gerrit Updater [ 10/Jun/21 ] |
|
James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/43968 |
| Comment by Gian-Carlo Defazio [ 10/Jun/21 ] |
|
I'm still working on getting the configuration for the tests that were running when the error occurred. Here are some logs from the hour before. 2021-05-21 14:01:26 [1986727.160802] LNet: 29533:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1 2021-05-21 14:01:26 [1986727.175487] LNet: 29533:0:(router.c:384:lnet_check_route_inconsistency()) Skipped 39 previous similar messages 2021-05-21 14:11:26 [1987327.224529] LNet: 29532:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1 2021-05-21 14:11:26 [1987327.239200] LNet: 29532:0:(router.c:384:lnet_check_route_inconsistency()) Skipped 39 previous similar messages 2021-05-21 14:21:27 [1987928.312076] LNet: 29532:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1 2021-05-21 14:21:27 [1987928.326754] LNet: 29532:0:(router.c:384:lnet_check_route_inconsistency()) Skipped 39 previous similar messages 2021-05-21 14:31:27 [1988528.439424] LNet: 29533:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1 2021-05-21 14:31:27 [1988528.454105] LNet: 29533:0:(router.c:384:lnet_check_route_inconsistency()) Skipped 39 previous similar messages 2021-05-21 14:41:27 [1989128.567097] LNet: 29532:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.55@o2ib100 is detected to be multi-hop but hop count is set to -1 2021-05-21 14:41:27 [1989128.581778] LNet: 29532:0:(router.c:384:lnet_check_route_inconsistency()) Skipped 39 previous similar messages 2021-05-21 14:51:27 [1989728.630972] LNet: 29532:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1 2021-05-21 14:51:27 [1989728.645652] LNet: 29532:0:(router.c:384:lnet_check_route_inconsistency()) Skipped 39 previous similar messages<ConMan> Console [garter1] log at 2021-05-21 15:00:00 PDT. 2021-05-21 15:01:28 [1990329.206387] LNet: 29535:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1 2021-05-21 15:01:28 [1990329.221064] LNet: 29535:0:(router.c:384:lnet_check_route_inconsistency()) Skipped 39 previous similar messages 2021-05-21 15:07:06 [1990667.397573] LustreError: 3594004:0:(lu_object.c:2472:lu_object_assign_fid()) ASSERTION( rc == 0 ) failed: failed hashtable insertion: rc = -16 |
| Comment by Gian-Carlo Defazio [ 15/Jun/21 ] |
|
I've been having some trouble getting it to consistently reproduce, but I have gotten a few failures with this command in my script my_mdtest_command: ./mdtest '-n' '1000000' '-u' '-L' '-F' '-P' '-N' '1' '-d' '/mnt/lustre/defazio1/mdtest-temp' '-Y' '-W' '300' '-a' 'POSIX' I'm running it with srun -N8 -n32 my_mdtest_command I was getting around 29K open and close ops (all on a single MDT) on ltop at the time of the failures, the the failover node then finished the test. I'm doing larger tests to see if I can get more consistent failures.
|
| Comment by James A Simmons [ 16/Jun/21 ] |
|
Is this without my patch? |
| Comment by Gian-Carlo Defazio [ 16/Jun/21 ] |
|
Yes, this is the unpatched version. |
| Comment by Gerrit Updater [ 30/Jun/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43968/ |
| Comment by Peter Jones [ 30/Jun/21 ] |
|
Patch has landed to master. Is anything else needed here other than porting the patch to b2_12? |
| Comment by Gian-Carlo Defazio [ 06/Jul/21 ] |
|
We don't need anything besides porting to b2_12. The patch should be on the same system where we've been seeing the LustreError soon and we'll see if it continues to happen. |
| Comment by Gerrit Updater [ 07/Jul/21 ] |
|
Gian-Carlo DeFazio (defazio1@llnl.gov) uploaded a new patch: https://review.whamcloud.com/44170 |
| Comment by Olaf Faaland [ 19/Jul/21 ] |
|
> We don't need anything besides porting to b2_12. b2_12 does not have commit "LU-8130 lu_object: convert lu_object cache to rhashtable", and so it still uses a cfs_hash not an rhashtable. So I don't believe that this fix is needed b2_12. |
| Comment by Gerrit Updater [ 09/Nov/21 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44170/ |