[LU-13462] MDS deadlocks in osd_read_lock() Created: 18/Apr/20  Updated: 25/Jul/22  Resolved: 25/Jul/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.2
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Yang Sheng
Resolution: Fixed Votes: 0
Labels: None

Attachments: File s.600.Apr17.2020_crash.bt.all    
Issue Links:
Related
is related to LU-13073 Multiple MDS deadlocks (in lod_qos_pr... Resolved
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

MDS deadlocked

Similar to LU-13073


 12287155.058187] LNet: Service thread pid 15312 was inactive for 550.48s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[12287155.109703] LNet: Skipped 2 previous similar messages
[12287155.125609]  [<ffffffffa5f87398>] call_rwsem_down_read_failed+0x18/0x30
[12287155.130583]  [<ffffffffc144acfc>] osd_read_lock+0x5c/0xe0 [osd_ldiskfs]
[12287155.130612]  [<ffffffffc16f28ea>] lod_read_lock+0x3a/0xd0 [lod]
[12287155.130625]  [<ffffffffc17779aa>] mdd_read_lock+0x3a/0xd0 [mdd]
[12287155.130632]  [<ffffffffc177d730>] mdd_xattr_get+0x70/0x5c0 [mdd]
[12287155.130648]  [<ffffffffc15e6ea6>] mdt_stripe_get+0xd6/0x400 [mdt]
[12287155.130657]  [<ffffffffc15e7a2d>] mdt_attr_get_complex+0x46d/0x850 [mdt]
[12287155.130665]  [<ffffffffc15e800c>] mdt_getattr_internal+0x1fc/0xf60 [mdt]
[12287155.130673]  [<ffffffffc15ebd60>] mdt_getattr_name_lock+0x950/0x1c30 [mdt]
[12287155.130681]  [<ffffffffc15f3c05>] mdt_intent_getattr+0x2b5/0x480 [mdt]
[12287155.130691]  [<ffffffffc15f0a18>] mdt_intent_policy+0x2e8/0xd00 [mdt]
[12287155.130736]  [<ffffffffc0f2dd26>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
[12287155.130769]  [<ffffffffc0f56587>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
[12287155.130815]  [<ffffffffc0fde882>] tgt_enqueue+0x62/0x210 [ptlrpc]
[12287155.130853]  [<ffffffffc0fe31da>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[12287155.130887]  [<ffffffffc0f8880b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[12287155.130921]  [<ffffffffc0f8c13c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[12287155.130925]  [<ffffffffa5cc1da1>] kthread+0xd1/0xe0
[12287155.130929]  [<ffffffffa6375c37>] ret_from_fork_nospec_end+0x0/0x39
[12287155.130947]  [<ffffffffffffffff>] 0xffffffffffffffff


 Comments   
Comment by Peter Jones [ 18/Apr/20 ]

Mahmoud

Could you please supply details of the kernel version that you are running?

Yang Sheng

Could you please advise

Thanks

Peter

Comment by Yang Sheng [ 18/Apr/20 ]

Hi, Mahmoud,

Could you please provide more info? What do you mean for similar LU-13073?

Thanks,
Yangsheng

Comment by Mahmoud Hanafi [ 20/Apr/20 ]

The stack trace for hung threads is the same as LU-13073. But in our case we didn't have a OSS crash

Our kernel is: 3.10.0-957.21.3.el7_lustre212.x86_64

 

Comment by Yang Sheng [ 23/Apr/20 ]

Then have any possible to provide sysrq-t info? From stack trace i don't think it same as lu-13073.

Comment by Mahmoud Hanafi [ 23/Apr/20 ]

Attached the stack trace.

Comment by Yang Sheng [ 24/Apr/20 ]

Hi, Mahmoud,

The log you attached really duplicated with LU-13073. But it is different with which you pasted stackstrace. You pasted log shows thread stuck on osd_read_lock. This most was caused by some local filesystem issue. But the LU-13073 is not. It is a long outstanding issue caused by OSS.

Thanks,
YangSheng

Generated at Sat Feb 10 03:01:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.