[LU-14752] LustreError: 3594004:0:(lu_object.c:2472:lu_object_assign_fid()) ASSERTION( rc == 0 ) failed: failed hashtable insertion: rc = -16 Created: 09/Jun/21  Updated: 09/Nov/21  Resolved: 19/Jul/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Major
Reporter: Gian-Carlo Defazio Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: llnll
Environment:

lustre-2.14.0_2.llnl-2.t4.x86_64
zfs-2.0.52_2llnl-2.t4.x86_6
kernel 4.18.0-240.22.1.1toss.t4.x86_64


Issue Links:
Related
is related to LU-8130 Migrate from libcfs hash to rhashtable Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

While running a performance test that uses mdtest, the error

LustreError: 3594004:0:(lu_object.c:2472:lu_object_assign_fid()) LBUG

 has been see several times. Here's an example:

2021-05-21 15:07:06 [1990667.397573] LustreError: 3594004:0:(lu_object.c:2472:lu_object_assign_fid()) ASSERTION( rc == 0 ) failed: failed hashtable insertion: rc = -16
2021-05-21 15:07:06 [1990667.410521] LustreError: 3594004:0:(lu_object.c:2472:lu_object_assign_fid()) LBUG
2021-05-21 15:07:06 [1990667.418250] Pid: 3594004, comm: mdt01_020 4.18.0-240.22.1.1toss.t4.x86_64 #1 SMP Tue Apr 13 17:18:40 PDT 2021
2021-05-21 15:07:06 [1990667.428331] Call Trace TBD:
2021-05-21 15:07:06 [1990667.431303] Kernel panic - not syncing: LBUG
2021-05-21 15:07:06 [1990667.435744] CPU: 49 PID: 3594004 Comm: mdt01_020 Kdump: loaded Tainted: P        W  OE    --------- -  - 4.18.0-240.22.1.1toss.t4.x86_64 #1
2021-05-21 15:07:06 [1990667.448421] Hardware name: Supermicro SSG-2029P-DN2R24N-LL013/X11DSN-TS, BIOS 3.4 11/04/2020
2021-05-21 15:07:06 [1990667.457027] Call Trace:
2021-05-21 15:07:06 [1990667.459662]  dump_stack+0x5c/0x80
2021-05-21 15:07:06 [1990667.463150]  panic+0xe7/0x2a9
2021-05-21 15:07:06 [1990667.466306]  lbug_with_loc.cold.10+0x18/0x18 [libcfs]
2021-05-21 15:07:06 [1990667.471566]  lu_object_assign_fid+0x3b8/0x3c0 [obdclass]
2021-05-21 15:07:06 [1990667.477056]  osp_create+0x471/0xba0 [osp]
2021-05-21 15:07:06 [1990667.481257]  ? lod_sub_create+0x244/0x4a0 [lod]
2021-05-21 15:07:06 [1990667.485963]  lod_sub_create+0x244/0x4a0 [lod]
2021-05-21 15:07:06 [1990667.490497]  lod_striped_create+0x174/0x4c0 [lod]
2021-05-21 15:07:06 [1990667.495377]  lod_create+0x218/0x330 [lod]
2021-05-21 15:07:06 [1990667.499559]  ? spl_kmem_alloc+0xd5/0x130 [spl]
2021-05-21 15:07:06 [1990667.504186]  mdd_create_object_internal+0xa9/0x300 [mdd]
2021-05-21 15:07:06 [1990667.509673]  mdd_create_object+0x92/0x870 [mdd]
2021-05-21 15:07:06 [1990667.514380]  mdd_create+0xd45/0x1740 [mdd]
2021-05-21 15:07:06 [1990667.518653]  ? mdd_links_rename+0x440/0x440 [mdd]
2021-05-21 15:07:06 [1990667.523549]  mdt_reint_open+0x225e/0x2f20 [mdt]
2021-05-21 15:07:06 [1990667.528276]  ? check_unlink_entry+0x14/0xb0 [obdclass]
2021-05-21 15:07:06 [1990667.533589]  ? mdt_root_squash+0x1e/0x3e0 [mdt]
2021-05-21 15:07:06 [1990667.538307]  mdt_reint_rec+0x11f/0x250 [mdt]
2021-05-21 15:07:06 [1990667.542759]  mdt_reint_internal+0x498/0x780 [mdt]
2021-05-21 15:07:06 [1990667.547649]  mdt_intent_open+0x124/0x3c0 [mdt]
2021-05-21 15:07:06 [1990667.552272]  ? cfs_hash_bd_from_key.isra.41+0x36/0xb0 [libcfs]
2021-05-21 15:07:06 [1990667.558279]  ? mdt_intent_fixup_resent+0x1f0/0x1f0 [mdt]
2021-05-21 15:07:06 [1990667.563766]  mdt_intent_opc+0x44d/0xa80 [mdt]
2021-05-21 15:07:06 [1990667.568300]  mdt_intent_policy+0x1f6/0x380 [mdt]
2021-05-21 15:07:06 [1990667.573153]  ldlm_lock_enqueue+0x4c1/0x9f0 [ptlrpc]
2021-05-21 15:07:06 [1990667.578235]  ldlm_handle_enqueue0+0x61a/0x16d0 [ptlrpc]
2021-05-21 15:07:06 [1990667.583659]  tgt_enqueue+0xa4/0x1f0 [ptlrpc]
2021-05-21 15:07:06 [1990667.588131]  tgt_request_handle+0xc8b/0x1930 [ptlrpc]
2021-05-21 15:07:06 [1990667.593381]  ? ptlrpc_nrs_req_get_nolock0+0xfb/0x1f0 [ptlrpc]
2021-05-21 15:07:06 [1990667.599328]  ptlrpc_server_handle_request+0x31a/0xba0 [ptlrpc]
2021-05-21 15:07:06 [1990667.605358]  ptlrpc_main+0xba9/0x14a0 [ptlrpc]
2021-05-21 15:07:06 [1990667.609974]  ? __schedule+0x2cf/0x720
2021-05-21 15:07:06 [1990667.613833]  ? ptlrpc_register_service+0xfc0/0xfc0 [ptlrpc]
2021-05-21 15:07:06 [1990667.619574]  kthread+0x11d/0x140
2021-05-21 15:07:06 [1990667.622977]  ? kthread_flush_work_fn+0x10/0x10
2021-05-21 15:07:06 [1990667.627588]  ret_from_fork+0x1f/0x40 

The version of lustre used is at https://github.com/llnl/lustre/tree/2.14.0_2.llnl

 



 Comments   
Comment by Alex Zhuravlev [ 10/Jun/21 ]

anything in the logs before LBUG() ?

Comment by Peter Jones [ 10/Jun/21 ]

Lai

Could you please review the logs once supplied

Thanks

Peter

Comment by John Hammond [ 10/Jun/21 ]

Could you try reverting aff14dbc522e140945f9207d4e9bbd27d0326819 (LU-8130 lu_object: convert lu_object cache to rhashtable)?

Comment by Gian-Carlo Defazio [ 10/Jun/21 ]

I'm working on getting my test setup to match the ones that failed, I'll generate more logs and post them here when I do, and I'll run the test with aff14dbc522e140945f9207d4e9bbd27d0326819 as well.

Comment by James A Simmons [ 10/Jun/21 ]

That is a EBUSY error which can happen when the hashtable being resized. BTW can you share your mdtest parameters and setup so people can duplicate it.

Comment by Gerrit Updater [ 10/Jun/21 ]

James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/43968
Subject: LU-14752 obdclass: handle EBUSY returned for lu_object hashtable
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f58d666acab032784d9cfaddf34d0cf587a6f4af

Comment by Gian-Carlo Defazio [ 10/Jun/21 ]

I'm still working on getting the configuration for the tests that were running when the error occurred.

Here are some logs from the hour before.

2021-05-21 14:01:26 [1986727.160802] LNet: 29533:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1
2021-05-21 14:01:26 [1986727.175487] LNet: 29533:0:(router.c:384:lnet_check_route_inconsistency()) Skipped 39 previous similar messages
2021-05-21 14:11:26 [1987327.224529] LNet: 29532:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1
2021-05-21 14:11:26 [1987327.239200] LNet: 29532:0:(router.c:384:lnet_check_route_inconsistency()) Skipped 39 previous similar messages
2021-05-21 14:21:27 [1987928.312076] LNet: 29532:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1
2021-05-21 14:21:27 [1987928.326754] LNet: 29532:0:(router.c:384:lnet_check_route_inconsistency()) Skipped 39 previous similar messages
2021-05-21 14:31:27 [1988528.439424] LNet: 29533:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1
2021-05-21 14:31:27 [1988528.454105] LNet: 29533:0:(router.c:384:lnet_check_route_inconsistency()) Skipped 39 previous similar messages
2021-05-21 14:41:27 [1989128.567097] LNet: 29532:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.55@o2ib100 is detected to be multi-hop but hop count is set to -1
2021-05-21 14:41:27 [1989128.581778] LNet: 29532:0:(router.c:384:lnet_check_route_inconsistency()) Skipped 39 previous similar messages
2021-05-21 14:51:27 [1989728.630972] LNet: 29532:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1
2021-05-21 14:51:27 [1989728.645652] LNet: 29532:0:(router.c:384:lnet_check_route_inconsistency()) Skipped 39 previous similar messages<ConMan> Console [garter1] log at 2021-05-21 15:00:00 PDT.
2021-05-21 15:01:28 [1990329.206387] LNet: 29535:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1
2021-05-21 15:01:28 [1990329.221064] LNet: 29535:0:(router.c:384:lnet_check_route_inconsistency()) Skipped 39 previous similar messages
2021-05-21 15:07:06 [1990667.397573] LustreError: 3594004:0:(lu_object.c:2472:lu_object_assign_fid()) ASSERTION( rc == 0 ) failed: failed hashtable insertion: rc = -16

Comment by Gian-Carlo Defazio [ 15/Jun/21 ]

I've been having some trouble getting it to consistently reproduce, but I have gotten a few failures with this command in my script my_mdtest_command:

./mdtest '-n' '1000000' '-u' '-L' '-F' '-P' '-N' '1' '-d' '/mnt/lustre/defazio1/mdtest-temp' '-Y' '-W' '300' '-a' 'POSIX'

I'm running it with

srun -N8 -n32 my_mdtest_command

I was getting around 29K open and close ops (all on a single MDT) on ltop at the time of the failures, the the failover node then finished the test.

I'm doing larger tests to see if I can get more consistent failures.

 

 

Comment by James A Simmons [ 16/Jun/21 ]

Is this without my patch?

Comment by Gian-Carlo Defazio [ 16/Jun/21 ]

Yes, this is the unpatched version.

Comment by Gerrit Updater [ 30/Jun/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43968/
Subject: LU-14752 obdclass: handle EBUSY returned for lu_object hashtable
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 285a29d3b5e47f63a94c0682040ddbf09614f130

Comment by Peter Jones [ 30/Jun/21 ]

Patch has landed to master. Is anything else needed here other than porting the patch to b2_12?

Comment by Gian-Carlo Defazio [ 06/Jul/21 ]

We don't need anything besides porting to b2_12.

The patch should be on the same system where we've been seeing the LustreError soon and we'll see if it continues to happen.

Comment by Gerrit Updater [ 07/Jul/21 ]

Gian-Carlo DeFazio (defazio1@llnl.gov) uploaded a new patch: https://review.whamcloud.com/44170
Subject: LU-14752 obdclass: handle EBUSY returned for lu_object hashtable
Project: fs/lustre-release
Branch: b2_14
Current Patch Set: 1
Commit: 34764e15e007db3215745afafadba57d1aae2ed1

Comment by Olaf Faaland [ 19/Jul/21 ]

> We don't need anything besides porting to b2_12.

b2_12 does not have commit "LU-8130 lu_object: convert lu_object cache to rhashtable", and so it still uses a cfs_hash not an rhashtable.  So I don't believe that this fix is needed b2_12.

Comment by Gerrit Updater [ 09/Nov/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44170/
Subject: LU-14752 obdclass: handle EBUSY returned for lu_object hashtable
Project: fs/lustre-release
Branch: b2_14
Current Patch Set:
Commit: 17b174db287e450343e51051623e2fcaa792fef3

Generated at Sat Feb 10 03:12:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.