[LU-11091] MDS threads stuck in lod_qos_prep_create after OSS crash Created: 19/Jun/18  Updated: 10/Sep/20  Resolved: 27/Feb/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Hongchao Zhang
Resolution: Cannot Reproduce Votes: 1
Labels: None
Environment:

lustre2.7.3 fe


Attachments: File s600.crash.jun18.2018.bt.all    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

OST disk issue required  reboot of OSS. This caused MDT threads to hang in lod_qos_prep_create. The MDT required a reboot about 6 hours after the OST recovered.

OST Disk ERRORS

 Jun 18 09:56:37 nbp2-oss5 kernel: sd 16:0:0:7: [sdcu]  Sense Key : Recovered Error [current] 
Jun 18 09:56:37 nbp2-oss5 kernel: sd 16:0:0:7: [sdcu]  <<vendor>> ASC=0x95 ASCQ=0x1

OSS Rebooted at Jun 18 14:30:00

MDT Errors at OSS reboot time


Jun 18 12:31:12 nbp2-mds kernel: Call Trace:
Jun 18 12:31:12 nbp2-mds kernel: [<ffffffff811cb40c>] ? __getblk+0x2c/0x2a0
Jun 18 12:31:12 nbp2-mds kernel: [<ffffffff81584435>] rwsem_down_failed_common+0x95/0x1d0
Jun 18 12:31:12 nbp2-mds kernel: [<ffffffff81584593>] rwsem_down_write_failed+0x23/0x30
Jun 18 12:31:12 nbp2-mds kernel: [<ffffffff812c7fe3>] call_rwsem_down_write_failed+0x13/0x20
Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa11f07c0>] ? lod_declare_object_create+0x0/0x450 [lod]
Jun 18 12:31:12 nbp2-mds kernel: [<ffffffff81583a92>] ? down_write+0x32/0x40
Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa11f7065>] lod_qos_prep_create+0xc25/0x1aa0 [lod]
Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa0f41459>] ? osd_declare_qid+0x289/0x480 [osd_ldiskfs]
Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa11e8c02>] lod_declare_striped_object+0x162/0x980 [lod]
Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa0f1b735>] ? osd_declare_object_create+0x1c5/0x340 [osd_ldiskfs]
Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa11f0a7f>] lod_declare_object_create+0x2bf/0x450 [lod]
Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa125ad76>] mdd_declare_object_create_internal+0x116/0x340 [mdd]
Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa125670e>] mdd_create+0x69e/0x1740 [mdd]
Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa1118348>] mdo_create+0x18/0x50 [mdt]
Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa11224ff>] mdt_reint_open+0x1f8f/0x2c70 [mdt]
Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa05d491c>] ? upcall_cache_get_entry+0x29c/0x880 [libcfs]
Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa110928d>] mdt_reint_rec+0x5d/0x200 [mdt]
Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa10ece7b>] mdt_reint_internal+0x4cb/0x7a0 [mdt]
Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa10ed346>] mdt_intent_reint+0x1f6/0x440 [mdt]
Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa10eb92e>] mdt_intent_policy+0x4be/0xd10 [mdt]
Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa09047a7>] ldlm_lock_enqueue+0x127/0xa50 [ptlrpc]
Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa093055b>] ldlm_handle_enqueue0+0x51b/0x14d0 [ptlrpc]
Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa09b9eb1>] tgt_enqueue+0x61/0x230 [ptlrpc]
Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa09baece>] tgt_request_handle+0x8be/0x1020 [ptlrpc]
Jun 18 12:31:13 nbp2-mds kernel: [<ffffffffa0964ca1>] ptlrpc_main+0xf41/0x1a80 [ptlrpc]
Jun 18 12:31:13 nbp2-mds kernel: [<ffffffffa0963d60>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
Jun 18 12:31:13 nbp2-mds kernel: [<ffffffff810a379e>] kthread+0x9e/0xc0
Jun 18 12:31:13 nbp2-mds kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
Jun 18 12:31:13 nbp2-mds kernel: [<ffffffff810a3700>] ? kthread+0x0/0xc0
Jun 18 12:31:13 nbp2-mds kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
 

 

MDS rebooted at Jun 18 17:58:59

 

Backtrace at time of MDS crash is attached.



 Comments   
Comment by Peter Jones [ 20/Jun/18 ]

Hongchao

Can you please assist with this issue?

Thanks

Peter

Comment by Hongchao Zhang [ 28/Jun/18 ]

Could you attach the logs (Lustre debug log, syslog, console log, etc) at OST and MDT?
Thanks!

Comment by Mahmoud Hanafi [ 27/Feb/20 ]

We can close this issue. unable to reproduce.

Generated at Sat Feb 10 02:40:51 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.