Details
-
Bug
-
Resolution: Duplicate
-
Minor
-
None
-
None
-
None
-
3
-
16431
Description
To reproduce:
[root@test1 vagrant]# lctl --device lustre-MDT0000 changelog_register lustre-MDT0000: Registered changelog userid 'cl1' [root@test1 vagrant]# lfs changelog lustre-MDT0000 [root@test1 vagrant]# touch /mnt/lustre/foo [root@test1 vagrant]# lfs changelog lustre-MDT0000 1 01CREAT 23:24:41.475662597 2014.11.05 0x0 t=[0x200000400:0x2c:0x0] j=touch.0 p=[0x200000007:0x1:0x0] foo [root@test1 vagrant]# lfs changelog_clear lustre-MDT0000 cl1 1 [root@test1 vagrant]# lfs changelog lustre-MDT0000
This last command hangs, even when changes are made to the filesystem. I also tried a version that calls llapi_changelog_start() without CHANGELOG_FLAG_BLOCK, and it still blocks here. The lfs process stack trace:
Nov 5 15:26:04 test1 kernel: lfs S 0000000000000000 0 16712 10059 0x00000080 Nov 5 15:26:04 test1 kernel: ffff88001ade9ca8 0000000000000086 ffff88001ade9c18 ffffffff8111fc67 Nov 5 15:26:04 test1 kernel: 000000000040c9b0 ffff88001dbc6890 ffff88001ade9ca8 ffffffff8114a669 Nov 5 15:26:04 test1 kernel: ffff88002aad8638 ffff88001ade9fd8 000000000000fbc8 ffff88002aad8638 Nov 5 15:26:04 test1 kernel: Call Trace: Nov 5 15:26:04 test1 kernel: [<ffffffff8111fc67>] ? unlock_page+0x27/0x30 Nov 5 15:26:04 test1 kernel: [<ffffffff8114a669>] ? __do_fault+0x469/0x530 Nov 5 15:26:04 test1 kernel: [<ffffffff8109b2ce>] ? prepare_to_wait+0x4e/0x80 Nov 5 15:26:04 test1 kernel: [<ffffffff81193e1b>] pipe_wait+0x5b/0x80 Nov 5 15:26:04 test1 kernel: [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40 Nov 5 15:26:04 test1 kernel: [<ffffffff8152ad0e>] ? mutex_lock+0x1e/0x50 Nov 5 15:26:04 test1 kernel: [<ffffffff811948c6>] pipe_read+0x3e6/0x4e0 Nov 5 15:26:04 test1 kernel: [<ffffffffa10f0c50>] ? ll_dir_open+0x0/0xf0 [lustre] Nov 5 15:26:04 test1 kernel: [<ffffffff811890da>] do_sync_read+0xfa/0x140 Nov 5 15:26:04 test1 kernel: [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40 Nov 5 15:26:04 test1 kernel: [<ffffffff811ab840>] ? mntput_no_expire+0x30/0x110 Nov 5 15:26:04 test1 kernel: [<ffffffff81227386>] ? security_file_permission+0x16/0x20 Nov 5 15:26:04 test1 kernel: [<ffffffff81189a95>] vfs_read+0xb5/0x1a0 Nov 5 15:26:04 test1 kernel: [<ffffffff81189bd1>] sys_read+0x51/0x90 Nov 5 15:26:04 test1 kernel: [<ffffffff8152c93e>] ? do_device_not_available+0xe/0x10 Nov 5 15:26:04 test1 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
I also noticed this stack for mdc_clg_send, and it seems the mdc has lost connection with the mdt (which is on the same node)
Nov 5 15:26:04 test1 kernel: mdc_clg_send_ S 0000000000000000 0 16713 2 0x00000080 Nov 5 15:26:04 test1 kernel: ffff88001fc77ad0 0000000000000046 ffff88001c2f9138 ffff88001bbf04c8 Nov 5 15:26:04 test1 kernel: ffff88001fc77a50 ffffffffa06f83ea ffff88001bbf0400 00000000ffffffed Nov 5 15:26:04 test1 kernel: ffff88002aad9af8 ffff88001fc77fd8 000000000000fbc8 ffff88002aad9af8 Nov 5 15:26:04 test1 kernel: Call Trace: Nov 5 15:26:04 test1 kernel: [<ffffffffa06f83ea>] ? ptlrpc_request_handle_notconn+0x11a/0x290 [ptlrpc] Nov 5 15:26:04 test1 kernel: [<ffffffff8152a532>] schedule_timeout+0x192/0x2e0 Nov 5 15:26:04 test1 kernel: [<ffffffff81083f30>] ? process_timeout+0x0/0x10 Nov 5 15:26:04 test1 kernel: [<ffffffffa06f73ea>] ptlrpc_set_wait+0x2da/0x860 [ptlrpc] Nov 5 15:26:04 test1 kernel: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20 Nov 5 15:26:04 test1 kernel: [<ffffffffa0702066>] ? lustre_msg_set_jobid+0xb6/0x140 [ptlrpc] Nov 5 15:26:04 test1 kernel: [<ffffffffa06f79f7>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc] Nov 5 15:26:04 test1 kernel: [<ffffffffa071840c>] llog_client_destroy+0xcc/0x340 [ptlrpc] Nov 5 15:26:04 test1 kernel: [<ffffffffa0483421>] llog_destroy+0x51/0x170 [obdclass] Nov 5 15:26:04 test1 kernel: [<ffffffffa04859dd>] llog_cat_process_cb+0x39d/0x5e0 [obdclass] Nov 5 15:26:04 test1 kernel: [<ffffffffa0480dc2>] llog_process_thread+0x9a2/0xde0 [obdclass] Nov 5 15:26:04 test1 kernel: [<ffffffffa0485640>] ? llog_cat_process_cb+0x0/0x5e0 [obdclass] Nov 5 15:26:04 test1 kernel: [<ffffffffa0482a1f>] llog_process_or_fork+0x13f/0x540 [obdclass] Nov 5 15:26:04 test1 kernel: [<ffffffffa0483d5d>] llog_cat_process_or_fork+0x1ad/0x300 [obdclass] Nov 5 15:26:04 test1 kernel: [<ffffffffa07175c3>] ? llog_client_read_header+0x193/0x5e0 [ptlrpc] Nov 5 15:26:04 test1 kernel: [<ffffffffa0a27740>] ? changelog_kkuc_cb+0x0/0x490 [mdc] Nov 5 15:26:04 test1 kernel: [<ffffffffa0483ec9>] llog_cat_process+0x19/0x20 [obdclass] Nov 5 15:26:04 test1 kernel: [<ffffffffa0a2c42b>] mdc_changelog_send_thread+0x4bb/0x890 [mdc] Nov 5 15:26:04 test1 kernel: [<ffffffffa0a2bf70>] ? mdc_changelog_send_thread+0x0/0x890 [mdc] Nov 5 15:26:04 test1 kernel: [<ffffffff8109abf6>] kthread+0x96/0xa0 Nov 5 15:26:04 test1 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20 Nov 5 15:26:04 test1 kernel: [<ffffffff8109ab60>] ? kthread+0x0/0xa0 Nov 5 15:26:04 test1 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
Actually it appears the MDC is spinning on reconnecting to the local MDT:
Nov 5 15:25:09 test1 kernel: Lustre: lustre-MDT0000-mdc-ffff88001ccf2800: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete Nov 5 15:25:09 test1 kernel: Lustre: Skipped 55176 previous similar messages Nov 5 15:25:09 test1 kernel: Lustre: lustre-MDT0000: Client 43cec978-39b5-b4ea-ff11-d7e5611258e4 (at 0@lo) reconnecting Nov 5 15:25:09 test1 kernel: Lustre: Skipped 55177 previous similar messages Nov 5 15:25:09 test1 kernel: Lustre: lustre-MDT0000-mdc-ffff88001ccf2800: Connection restored to lustre-MDT0000 (at 0@lo) Nov 5 15:25:09 test1 kernel: Lustre: Skipped 55177 previous similar messages Nov 5 15:25:25 test1 kernel: Lustre: lustre-MDT0000-mdc-ffff88001ccf2800: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete Nov 5 15:25:25 test1 kernel: Lustre: Skipped 117892 previous similar messages Nov 5 15:25:25 test1 kernel: Lustre: lustre-MDT0000: Client 43cec978-39b5-b4ea-ff11-d7e5611258e4 (at 0@lo) reconnecting Nov 5 15:25:25 test1 kernel: Lustre: Skipped 117892 previous similar messages Nov 5 15:25:25 test1 kernel: Lustre: lustre-MDT0000-mdc-ffff88001ccf2800: Connection restored to lustre-MDT0000 (at 0@lo) Nov 5 15:25:25 test1 kernel: Lustre: Skipped 117892 previous similar messages Nov 5 15:25:57 test1 kernel: Lustre: lustre-MDT0000-mdc-ffff88001ccf2800: Connection restored to lustre-MDT0000 (at 0@lo) Nov 5 15:25:57 test1 kernel: Lustre: Skipped 230776 previous similar messages Nov 5 15:25:57 test1 kernel: Lustre: lustre-MDT0000-mdc-ffff88001ccf2800: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete Nov 5 15:25:57 test1 kernel: Lustre: Skipped 230777 previous similar messages