[LU-3202] SWL Causes OST hang Created: 22/Apr/13 Updated: 26/Apr/13 Resolved: 26/Apr/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Cliff White (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HB | ||
| Environment: |
LLNL/Hyperion |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 7822 |
| Description |
|
Started SWL. Had immediate watchdog timeouts on OSTs, clients are stuck in 'comp' state, are evicted by servers and unable to reconnect, as servers are returning EBUSY. Appears to be some sort of watchdog storm on OSTs, stack dump attached. |
| Comments |
| Comment by Cliff White (Inactive) [ 22/Apr/13 ] |
|
sysrq -p from one OST |
| Comment by Cliff White (Inactive) [ 22/Apr/13 ] |
|
Reformatted the filesystem, restarted the test, had failure immediately. attached are consoles from all OSTs, and console from the first client to be evicted. |
| Comment by Peter Jones [ 22/Apr/13 ] |
|
Oleg is looking into this |
| Comment by Oleg Drokin [ 22/Apr/13 ] |
|
This seems to be a pretty clear deadlock: 2013-04-22 11:21:58 ll_ost01_001 D 000000000000000a 0 5490 2 0x000000 00 2013-04-22 11:21:58 ffff8803b0c4f9b0 0000000000000046 0000000000000000 ffff8803 2f4e8950 2013-04-22 11:21:58 ffff8803b0c4fa10 ffff8802ad507020 ffff8803b0c4f9a0 ffff8802 0a13c518 2013-04-22 11:21:58 ffff8803b0c49058 ffff8803b0c4ffd8 000000000000fb88 ffff8803 b0c49058 2013-04-22 11:21:58 Call Trace: 2013-04-22 11:21:58 [<ffffffff8150f3be>] __mutex_lock_slowpath+0x13e/0x180 2013-04-22 11:21:58 [<ffffffff8150f25b>] mutex_lock+0x2b/0x50 2013-04-22 11:21:58 [<ffffffffa100265e>] osd_object_destroy+0x1be/0x580 [osd_ldiskfs] 2013-04-22 11:21:58 [<ffffffffa089ebcf>] ? ldiskfs_dirty_inode+0x4f/0x60 [ldiskfs] 2013-04-22 11:21:58 [<ffffffffa10dcd20>] ofd_object_destroy+0x380/0x680 [ofd] 2013-04-22 11:21:58 [<ffffffffa10d0056>] ofd_destroy_by_fid+0x266/0x610 [ofd] 2013-04-22 11:21:58 [<ffffffffa0aca7a0>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc] 2013-04-22 11:21:58 [<ffffffffa0acbe00>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-04-22 11:21:58 [<ffffffffa0af4345>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] 2013-04-22 11:21:58 [<ffffffffa10d3bb7>] ofd_destroy+0x1a7/0x8b0 [ofd] vs 2013-04-22 11:21:56 [<ffffffffa0873ce5>] jbd2_log_wait_commit+0xc5/0x140 [jbd2] 2013-04-22 11:21:56 [<ffffffff81096ca0>] ? autoremove_wake_function+0x0/0x40 2013-04-22 11:21:56 [<ffffffffa089714e>] ldiskfs_sync_file+0x15e/0x260 [ldiskfs ] 2013-04-22 11:21:56 [<ffffffffa0fff854>] osd_object_sync+0x144/0x190 [osd_ldisk fs] 2013-04-22 11:21:56 [<ffffffffa10cfadb>] ofd_sync+0x36b/0x680 [ofd] 2013-04-22 11:21:56 [<ffffffffa109eb30>] ost_blocking_ast+0x640/0x10f0 [ost] 2013-04-22 11:21:56 [<ffffffffa0964baf>] ? lu_context_fini+0x2f/0xc0 [obdclass] 2013-04-22 11:21:56 [<ffffffffa0aab16c>] ldlm_cancel_callback+0x6c/0x1a0 [ptlrp c] 2013-04-22 11:21:56 [<ffffffffa0aab2fa>] ldlm_lock_cancel+0x5a/0x1e0 [ptlrpc] 2013-04-22 11:21:56 [<ffffffffa0aceee4>] ldlm_request_cancel+0x254/0x410 [ptlrp c] 2013-04-22 11:21:56 [<ffffffffa0acf1dd>] ldlm_handle_cancel+0x13d/0x240 [ptlrpc ] 2013-04-22 11:21:56 [<ffffffffa0ad1218>] ldlm_cancel_handler+0x3f8/0x600 [ptlrp c] 2013-04-22 11:21:56 [<ffffffffa0b0539c>] ptlrpc_server_handle_request+0x40c/0xd 90 [ptlrpc] In fact osd_object_destroy trying to take i_mutex while having a transaction open is a big no-no as it could lead to this exact lock inversions - usually i_mutex is taken bfore having a transaction open. What's unclear is how come we did not hit this any earlier, this code has been present in our tree since mid last year, it seems. |
| Comment by Oleg Drokin [ 23/Apr/13 ] |
|
Reassigned to FanYong as he's going to take it further with a fix. |
| Comment by nasf (Inactive) [ 23/Apr/13 ] |
|
Originally, to control the race between the OI scrub inserting But the unlink thread which called osd_object_destroy() already So now, the osd_object_destroy() will not obtain inode::i_mutex, This is the patch: |
| Comment by Jodi Levi (Inactive) [ 26/Apr/13 ] |
|
Patch landed. Let me know if more work is needed and I will reopen the ticket. |