[LU-5877] lfs changelog hangs after lfs changelog_clear Created: 05/Nov/14 Updated: 19/Jan/15 Resolved: 19/Jan/15 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Robert Read (Inactive) | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 16431 | ||||||||||||||||||||
| Description |
|
To reproduce: [root@test1 vagrant]# lctl --device lustre-MDT0000 changelog_register lustre-MDT0000: Registered changelog userid 'cl1' [root@test1 vagrant]# lfs changelog lustre-MDT0000 [root@test1 vagrant]# touch /mnt/lustre/foo [root@test1 vagrant]# lfs changelog lustre-MDT0000 1 01CREAT 23:24:41.475662597 2014.11.05 0x0 t=[0x200000400:0x2c:0x0] j=touch.0 p=[0x200000007:0x1:0x0] foo [root@test1 vagrant]# lfs changelog_clear lustre-MDT0000 cl1 1 [root@test1 vagrant]# lfs changelog lustre-MDT0000 This last command hangs, even when changes are made to the filesystem. I also tried a version that calls llapi_changelog_start() without CHANGELOG_FLAG_BLOCK, and it still blocks here. The lfs process stack trace: Nov 5 15:26:04 test1 kernel: lfs S 0000000000000000 0 16712 10059 0x00000080 Nov 5 15:26:04 test1 kernel: ffff88001ade9ca8 0000000000000086 ffff88001ade9c18 ffffffff8111fc67 Nov 5 15:26:04 test1 kernel: 000000000040c9b0 ffff88001dbc6890 ffff88001ade9ca8 ffffffff8114a669 Nov 5 15:26:04 test1 kernel: ffff88002aad8638 ffff88001ade9fd8 000000000000fbc8 ffff88002aad8638 Nov 5 15:26:04 test1 kernel: Call Trace: Nov 5 15:26:04 test1 kernel: [<ffffffff8111fc67>] ? unlock_page+0x27/0x30 Nov 5 15:26:04 test1 kernel: [<ffffffff8114a669>] ? __do_fault+0x469/0x530 Nov 5 15:26:04 test1 kernel: [<ffffffff8109b2ce>] ? prepare_to_wait+0x4e/0x80 Nov 5 15:26:04 test1 kernel: [<ffffffff81193e1b>] pipe_wait+0x5b/0x80 Nov 5 15:26:04 test1 kernel: [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40 Nov 5 15:26:04 test1 kernel: [<ffffffff8152ad0e>] ? mutex_lock+0x1e/0x50 Nov 5 15:26:04 test1 kernel: [<ffffffff811948c6>] pipe_read+0x3e6/0x4e0 Nov 5 15:26:04 test1 kernel: [<ffffffffa10f0c50>] ? ll_dir_open+0x0/0xf0 [lustre] Nov 5 15:26:04 test1 kernel: [<ffffffff811890da>] do_sync_read+0xfa/0x140 Nov 5 15:26:04 test1 kernel: [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40 Nov 5 15:26:04 test1 kernel: [<ffffffff811ab840>] ? mntput_no_expire+0x30/0x110 Nov 5 15:26:04 test1 kernel: [<ffffffff81227386>] ? security_file_permission+0x16/0x20 Nov 5 15:26:04 test1 kernel: [<ffffffff81189a95>] vfs_read+0xb5/0x1a0 Nov 5 15:26:04 test1 kernel: [<ffffffff81189bd1>] sys_read+0x51/0x90 Nov 5 15:26:04 test1 kernel: [<ffffffff8152c93e>] ? do_device_not_available+0xe/0x10 Nov 5 15:26:04 test1 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b I also noticed this stack for mdc_clg_send, and it seems the mdc has lost connection with the mdt (which is on the same node) Nov 5 15:26:04 test1 kernel: mdc_clg_send_ S 0000000000000000 0 16713 2 0x00000080 Nov 5 15:26:04 test1 kernel: ffff88001fc77ad0 0000000000000046 ffff88001c2f9138 ffff88001bbf04c8 Nov 5 15:26:04 test1 kernel: ffff88001fc77a50 ffffffffa06f83ea ffff88001bbf0400 00000000ffffffed Nov 5 15:26:04 test1 kernel: ffff88002aad9af8 ffff88001fc77fd8 000000000000fbc8 ffff88002aad9af8 Nov 5 15:26:04 test1 kernel: Call Trace: Nov 5 15:26:04 test1 kernel: [<ffffffffa06f83ea>] ? ptlrpc_request_handle_notconn+0x11a/0x290 [ptlrpc] Nov 5 15:26:04 test1 kernel: [<ffffffff8152a532>] schedule_timeout+0x192/0x2e0 Nov 5 15:26:04 test1 kernel: [<ffffffff81083f30>] ? process_timeout+0x0/0x10 Nov 5 15:26:04 test1 kernel: [<ffffffffa06f73ea>] ptlrpc_set_wait+0x2da/0x860 [ptlrpc] Nov 5 15:26:04 test1 kernel: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20 Nov 5 15:26:04 test1 kernel: [<ffffffffa0702066>] ? lustre_msg_set_jobid+0xb6/0x140 [ptlrpc] Nov 5 15:26:04 test1 kernel: [<ffffffffa06f79f7>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc] Nov 5 15:26:04 test1 kernel: [<ffffffffa071840c>] llog_client_destroy+0xcc/0x340 [ptlrpc] Nov 5 15:26:04 test1 kernel: [<ffffffffa0483421>] llog_destroy+0x51/0x170 [obdclass] Nov 5 15:26:04 test1 kernel: [<ffffffffa04859dd>] llog_cat_process_cb+0x39d/0x5e0 [obdclass] Nov 5 15:26:04 test1 kernel: [<ffffffffa0480dc2>] llog_process_thread+0x9a2/0xde0 [obdclass] Nov 5 15:26:04 test1 kernel: [<ffffffffa0485640>] ? llog_cat_process_cb+0x0/0x5e0 [obdclass] Nov 5 15:26:04 test1 kernel: [<ffffffffa0482a1f>] llog_process_or_fork+0x13f/0x540 [obdclass] Nov 5 15:26:04 test1 kernel: [<ffffffffa0483d5d>] llog_cat_process_or_fork+0x1ad/0x300 [obdclass] Nov 5 15:26:04 test1 kernel: [<ffffffffa07175c3>] ? llog_client_read_header+0x193/0x5e0 [ptlrpc] Nov 5 15:26:04 test1 kernel: [<ffffffffa0a27740>] ? changelog_kkuc_cb+0x0/0x490 [mdc] Nov 5 15:26:04 test1 kernel: [<ffffffffa0483ec9>] llog_cat_process+0x19/0x20 [obdclass] Nov 5 15:26:04 test1 kernel: [<ffffffffa0a2c42b>] mdc_changelog_send_thread+0x4bb/0x890 [mdc] Nov 5 15:26:04 test1 kernel: [<ffffffffa0a2bf70>] ? mdc_changelog_send_thread+0x0/0x890 [mdc] Nov 5 15:26:04 test1 kernel: [<ffffffff8109abf6>] kthread+0x96/0xa0 Nov 5 15:26:04 test1 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20 Nov 5 15:26:04 test1 kernel: [<ffffffff8109ab60>] ? kthread+0x0/0xa0 Nov 5 15:26:04 test1 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20 Actually it appears the MDC is spinning on reconnecting to the local MDT: Nov 5 15:25:09 test1 kernel: Lustre: lustre-MDT0000-mdc-ffff88001ccf2800: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete Nov 5 15:25:09 test1 kernel: Lustre: Skipped 55176 previous similar messages Nov 5 15:25:09 test1 kernel: Lustre: lustre-MDT0000: Client 43cec978-39b5-b4ea-ff11-d7e5611258e4 (at 0@lo) reconnecting Nov 5 15:25:09 test1 kernel: Lustre: Skipped 55177 previous similar messages Nov 5 15:25:09 test1 kernel: Lustre: lustre-MDT0000-mdc-ffff88001ccf2800: Connection restored to lustre-MDT0000 (at 0@lo) Nov 5 15:25:09 test1 kernel: Lustre: Skipped 55177 previous similar messages Nov 5 15:25:25 test1 kernel: Lustre: lustre-MDT0000-mdc-ffff88001ccf2800: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete Nov 5 15:25:25 test1 kernel: Lustre: Skipped 117892 previous similar messages Nov 5 15:25:25 test1 kernel: Lustre: lustre-MDT0000: Client 43cec978-39b5-b4ea-ff11-d7e5611258e4 (at 0@lo) reconnecting Nov 5 15:25:25 test1 kernel: Lustre: Skipped 117892 previous similar messages Nov 5 15:25:25 test1 kernel: Lustre: lustre-MDT0000-mdc-ffff88001ccf2800: Connection restored to lustre-MDT0000 (at 0@lo) Nov 5 15:25:25 test1 kernel: Lustre: Skipped 117892 previous similar messages Nov 5 15:25:57 test1 kernel: Lustre: lustre-MDT0000-mdc-ffff88001ccf2800: Connection restored to lustre-MDT0000 (at 0@lo) Nov 5 15:25:57 test1 kernel: Lustre: Skipped 230776 previous similar messages Nov 5 15:25:57 test1 kernel: Lustre: lustre-MDT0000-mdc-ffff88001ccf2800: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete Nov 5 15:25:57 test1 kernel: Lustre: Skipped 230777 previous similar messages |
| Comments |
| Comment by Andreas Dilger [ 06/Nov/14 ] |
|
Robert, what Lustre version is this (commit hash also if not on a tag)? Is this new behaviour (i.e. it wasn't broken in a previous release), or this is something you haven't tested before and just found? |
| Comment by Andreas Dilger [ 06/Nov/14 ] |
|
Also, what is the impact/severity of this problem? Should it be critical/blocker for 2.7.0, or a corner case that doesn't need to be fixed immediately? |
| Comment by Robert Read (Inactive) [ 06/Nov/14 ] |
|
I installed from rpm on Monday, the commit hash is c071633. I'd consider this pretty severe since it makes changelogs unusable if you can't clear them. |
| Comment by Robert Read (Inactive) [ 06/Nov/14 ] |
|
This problem goes away if I revert |
| Comment by Robert Read (Inactive) [ 14/Jan/15 ] |
|
Poke - this bug should be a candidate for 2.7 since this makes changelogs unusable. Note, reverting |
| Comment by Henri Doreau (Inactive) [ 19/Jan/15 ] |
|
As expected, the patch for |
| Comment by Jodi Levi (Inactive) [ 19/Jan/15 ] |
|
Should this be closed as a duplicate of |
| Comment by Robert Read (Inactive) [ 19/Jan/15 ] |
|
Sure, they appear to have the same cause. |