[LU-11900] MDT hit (ldlm_lib.c:1595:target_finish_recovery()) LBUG during recovery Created: 29/Jan/19  Updated: 16/Jan/22  Resolved: 16/Jan/22

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.7
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Sarah Liu Assignee: Mikhail Pershin
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

server and client: 2.10.6_34_gb5ad8a0


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

1 MDT hit following LBUG during recovery on soak after running 15 hours, it looks like LU-6190, but that ticket is too old, open new one to track

[ 1119.601893] sd 0:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, queueing MODE_SELECT command
[ 1120.258577] sd 0:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, MODE_SELECT completed
[ 1120.288628] Lustre: soaked-MDT0002: Recovery over after 3:19, of 28 clients 28 recovered and 57 were evicted.
[ 1120.354278] LustreError: 14264:0:(ldlm_lib.c:1593:target_finish_recovery()) soaked-MDT0002: Recovery queues ( lock ) are not empty
[ 1120.367393] LustreError: 14264:0:(ldlm_lib.c:1595:target_finish_recovery()) LBUG
[ 1120.375656] Pid: 14264, comm: tgt_recover_2 3.10.0-957.el7_lustre.x86_64 #1 SMP Mon Jan 7 20:06:41 UTC 2019
[ 1120.386533] Call Trace:
[ 1120.389272]  [<ffffffffc09e67cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[ 1120.396592]  [<ffffffffc09e687c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[ 1120.403512]  [<ffffffffc0d35479>] target_recovery_thread+0x1359/0x1370 [ptlrpc]
[ 1120.411766]  [<ffffffffb00c1c31>] kthread+0xd1/0xe0
[ 1120.417229]  [<ffffffffb0774c37>] ret_from_fork_nospec_end+0x0/0x39
[ 1120.424239]  [<ffffffffffffffff>] 0xffffffffffffffff
[ 1120.429817] Kernel panic - not syncing: LBUG
[ 1120.434583] CPU: 27 PID: 14264 Comm: tgt_recover_2 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.el7_lustre.x86_64 #1
[ 1120.448260] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
[ 1120.460781] Call Trace:
[ 1120.463512]  [<ffffffffb0761dc1>] dump_stack+0x19/0x1b
[ 1120.469239]  [<ffffffffb075b4d0>] panic+0xe8/0x21f
[ 1120.474590]  [<ffffffffc09e68cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
[ 1120.481517]  [<ffffffffc0d35479>] target_recovery_thread+0x1359/0x1370 [ptlrpc]
[ 1120.489698]  [<ffffffffc0d34120>] ? replay_request_or_update.isra.21+0x8c0/0x8c0 [ptlrpc]
[ 1120.498846]  [<ffffffffb00c1c31>] kthread+0xd1/0xe0
[ 1120.504291]  [<ffffffffb00c1b60>] ? insert_kthread_work+0x40/0x40
[ 1120.511094]  [<ffffffffb0774c37>] ret_from_fork_nospec_begin+0x21/0x21
[ 1120.518380]  [<ffffffffb00c1b60>] ? insert_kthread_work+0x40/0x40
[    0.000000] Initializing cgroup subsys cpuset


 Comments   
Comment by Peter Jones [ 30/Jan/19 ]

Mike

Could you please advise?

Thanks

Peter

Comment by Oleg Drokin [ 31/Jan/19 ]

This is caused by https://review.whamcloud.com/#/c/33977/ landing.

I pushed a revert to https://review.whamcloud.com/#/c/34149/

Comment by Peter Jones [ 01/Feb/19 ]

hongchao.zhang do you understand why this might affect only b2_10 not master?

Comment by Hongchao Zhang [ 01/Feb/19 ]

the patch https://review.whamcloud.com/#/c/33977/ depends on the patch https://review.whamcloud.com/#/c/34027/
I have pushed the two patches at the same time (push #33977 on top of #34027), does the landing process not check
the dependency? sorry!

Comment by Hongchao Zhang [ 23/Mar/19 ]

the dependance patch https://review.whamcloud.com/#/c/34027/ has been landed on b2_10

Comment by Peter Jones [ 23/Mar/19 ]

I think that this can be closed as cannot repro now, right?

Generated at Sat Feb 10 02:47:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.