[LU-11091] MDS threads stuck in lod_qos_prep_create after OSS crash Created: 19/Jun/18 Updated: 10/Sep/20 Resolved: 27/Feb/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Mahmoud Hanafi | Assignee: | Hongchao Zhang |
| Resolution: | Cannot Reproduce | Votes: | 1 |
| Labels: | None | ||
| Environment: |
lustre2.7.3 fe |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
OST disk issue required reboot of OSS. This caused MDT threads to hang in lod_qos_prep_create. The MDT required a reboot about 6 hours after the OST recovered. OST Disk ERRORS Jun 18 09:56:37 nbp2-oss5 kernel: sd 16:0:0:7: [sdcu] Sense Key : Recovered Error [current] Jun 18 09:56:37 nbp2-oss5 kernel: sd 16:0:0:7: [sdcu] <<vendor>> ASC=0x95 ASCQ=0x1 OSS Rebooted at Jun 18 14:30:00 MDT Errors at OSS reboot time Jun 18 12:31:12 nbp2-mds kernel: Call Trace: Jun 18 12:31:12 nbp2-mds kernel: [<ffffffff811cb40c>] ? __getblk+0x2c/0x2a0 Jun 18 12:31:12 nbp2-mds kernel: [<ffffffff81584435>] rwsem_down_failed_common+0x95/0x1d0 Jun 18 12:31:12 nbp2-mds kernel: [<ffffffff81584593>] rwsem_down_write_failed+0x23/0x30 Jun 18 12:31:12 nbp2-mds kernel: [<ffffffff812c7fe3>] call_rwsem_down_write_failed+0x13/0x20 Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa11f07c0>] ? lod_declare_object_create+0x0/0x450 [lod] Jun 18 12:31:12 nbp2-mds kernel: [<ffffffff81583a92>] ? down_write+0x32/0x40 Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa11f7065>] lod_qos_prep_create+0xc25/0x1aa0 [lod] Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa0f41459>] ? osd_declare_qid+0x289/0x480 [osd_ldiskfs] Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa11e8c02>] lod_declare_striped_object+0x162/0x980 [lod] Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa0f1b735>] ? osd_declare_object_create+0x1c5/0x340 [osd_ldiskfs] Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa11f0a7f>] lod_declare_object_create+0x2bf/0x450 [lod] Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa125ad76>] mdd_declare_object_create_internal+0x116/0x340 [mdd] Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa125670e>] mdd_create+0x69e/0x1740 [mdd] Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa1118348>] mdo_create+0x18/0x50 [mdt] Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa11224ff>] mdt_reint_open+0x1f8f/0x2c70 [mdt] Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa05d491c>] ? upcall_cache_get_entry+0x29c/0x880 [libcfs] Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa110928d>] mdt_reint_rec+0x5d/0x200 [mdt] Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa10ece7b>] mdt_reint_internal+0x4cb/0x7a0 [mdt] Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa10ed346>] mdt_intent_reint+0x1f6/0x440 [mdt] Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa10eb92e>] mdt_intent_policy+0x4be/0xd10 [mdt] Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa09047a7>] ldlm_lock_enqueue+0x127/0xa50 [ptlrpc] Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa093055b>] ldlm_handle_enqueue0+0x51b/0x14d0 [ptlrpc] Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa09b9eb1>] tgt_enqueue+0x61/0x230 [ptlrpc] Jun 18 12:31:12 nbp2-mds kernel: [<ffffffffa09baece>] tgt_request_handle+0x8be/0x1020 [ptlrpc] Jun 18 12:31:13 nbp2-mds kernel: [<ffffffffa0964ca1>] ptlrpc_main+0xf41/0x1a80 [ptlrpc] Jun 18 12:31:13 nbp2-mds kernel: [<ffffffffa0963d60>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc] Jun 18 12:31:13 nbp2-mds kernel: [<ffffffff810a379e>] kthread+0x9e/0xc0 Jun 18 12:31:13 nbp2-mds kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20 Jun 18 12:31:13 nbp2-mds kernel: [<ffffffff810a3700>] ? kthread+0x0/0xc0 Jun 18 12:31:13 nbp2-mds kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
MDS rebooted at Jun 18 17:58:59
Backtrace at time of MDS crash is attached. |
| Comments |
| Comment by Peter Jones [ 20/Jun/18 ] |
|
Hongchao Can you please assist with this issue? Thanks Peter |
| Comment by Hongchao Zhang [ 28/Jun/18 ] |
|
Could you attach the logs (Lustre debug log, syslog, console log, etc) at OST and MDT? |
| Comment by Mahmoud Hanafi [ 27/Feb/20 ] |
|
We can close this issue. unable to reproduce. |