[LU-10222] DNE recovery is failed or stuck Created: 09/Nov/17 Updated: 20/Nov/17 Resolved: 20/Nov/17 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Olaf Faaland | Assignee: | Peter Jones |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
fs/lustre-release-fe |
||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
After an MDT is stopped on one node and brought up another, recovery fails to complete and attempts to access the filesystem from clients hang. In the console log of the affected MDT: Nov 8 13:49:41 jet15 kernel: [11181.278822] Lustre: lquake-MDT000e: Recovery already passed deadline 5:59. It is due to DNE recovery failed/stuck on the 1 MDT(s): 0001. Please wait until all MDTs recovered or abort the recovery by force. In the recovery_status procfile of the affected MDT, lquake-MDT000E (on host jet15) status: WAITING non-ready MDTs: 0001 recovery_start: 1510186333 time_waited: 76934 In the recovery_status procfile of lquake-MDT0001 (on host jet2) status: COMPLETE recovery_start: 1509398176 recovery_duration: 39 completed_clients: 91/91 replayed_requests: 0 last_transno: 163209257851 VBR: DISABLED IR: DISABLED At 13:47 on jet2, the kernel watchdog reports several blocked threads, whose stacks all look like this: INFO: task z_wr_iss:15546 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. z_wr_iss D ffff883f6eaeaf70 0 15546 2 0x00000080 ffff883f6671bbc0 0000000000000046 ffff883f6671bfd8 ffff883f6671bfd8 ffff883f6671bfd8 ffff883f6671bfd8 ffff882c14c70fd0 ffff883f6eaeaf70 ffffffff00000001 ffff887f7471cac0 ffff883f6671bfd8 ffffffff00000000 Call Trace: [<ffffffff816ca759>] schedule+0x29/0x70 [<ffffffff816cc335>] rwsem_down_write_failed+0x285/0x3f0 [<ffffffff813481d7>] call_rwsem_down_write_failed+0x17/0x30 [<ffffffffc03fef85>] ? spl_kmem_free+0x35/0x40 [spl] [<ffffffff816c9b40>] down_write+0x40/0x50 [<ffffffffc07beb27>] dbuf_write_ready+0x207/0x310 [zfs] [<ffffffffc07b8b26>] arc_write_ready+0xa6/0x310 [zfs] [<ffffffff816c88d5>] ? mutex_lock+0x25/0x42 [<ffffffffc0885ec4>] zio_ready+0x94/0x420 [zfs] [<ffffffffc040783e>] ? tsd_get_by_thread+0x2e/0x50 [spl] [<ffffffffc04013c8>] ? taskq_member+0x18/0x30 [spl] [<ffffffffc087f7ac>] zio_execute+0x9c/0x100 [zfs] [<ffffffffc0402326>] taskq_thread+0x246/0x470 [spl] [<ffffffff810c9de0>] ? wake_up_state+0x20/0x20 [<ffffffffc04020e0>] ? taskq_thread_spawn+0x60/0x60 [spl] [<ffffffff810b4eef>] kthread+0xcf/0xe0 [<ffffffff810b4e20>] ? insert_kthread_work+0x40/0x40 [<ffffffff816d6818>] ret_from_fork+0x58/0x90 [<ffffffff810b4e20>] ? insert_kthread_work+0x40/0x40 Which I believe means they were blocked in dbuf_write_ready() at rw_enter(): if (!BP_IS_EMBEDDED(bp))
BP_SET_FILL(bp, fill);
mutex_exit(&db->db_mtx);
rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
*db->db_blkptr = *bp;
rw_exit(&dn->dn_struct_rwlock);
|
| Comments |
| Comment by Olaf Faaland [ 10/Nov/17 ] |
|
Don't do any work on this yet. I'm attempting to find out what process is holding the lock. |
| Comment by Peter Jones [ 10/Nov/17 ] |
|
ok Olaf. We'll hold tight for now. |
| Comment by Olaf Faaland [ 20/Nov/17 ] |
|
We concluded this was not caused by Lustre, as our 2.8.0-based Lustre build does not take this lock itself. Closing. |