Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5877

lfs changelog hangs after lfs changelog_clear

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • None
    • None
    • 3
    • 16431

    Description

      To reproduce:

      [root@test1 vagrant]# lctl --device lustre-MDT0000 changelog_register
      lustre-MDT0000: Registered changelog userid 'cl1'
      [root@test1 vagrant]# lfs changelog lustre-MDT0000
      [root@test1 vagrant]# touch /mnt/lustre/foo
      [root@test1 vagrant]# lfs changelog lustre-MDT0000
      1 01CREAT 23:24:41.475662597 2014.11.05 0x0 t=[0x200000400:0x2c:0x0] j=touch.0 p=[0x200000007:0x1:0x0] foo
      [root@test1 vagrant]# lfs changelog_clear lustre-MDT0000 cl1 1
      [root@test1 vagrant]# lfs changelog lustre-MDT0000
      
      

      This last command hangs, even when changes are made to the filesystem. I also tried a version that calls llapi_changelog_start() without CHANGELOG_FLAG_BLOCK, and it still blocks here. The lfs process stack trace:

      Nov  5 15:26:04 test1 kernel: lfs           S 0000000000000000     0 16712  10059 0x00000080
      Nov  5 15:26:04 test1 kernel: ffff88001ade9ca8 0000000000000086 ffff88001ade9c18 ffffffff8111fc67
      Nov  5 15:26:04 test1 kernel: 000000000040c9b0 ffff88001dbc6890 ffff88001ade9ca8 ffffffff8114a669
      Nov  5 15:26:04 test1 kernel: ffff88002aad8638 ffff88001ade9fd8 000000000000fbc8 ffff88002aad8638
      Nov  5 15:26:04 test1 kernel: Call Trace:
      Nov  5 15:26:04 test1 kernel: [<ffffffff8111fc67>] ? unlock_page+0x27/0x30
      Nov  5 15:26:04 test1 kernel: [<ffffffff8114a669>] ? __do_fault+0x469/0x530
      Nov  5 15:26:04 test1 kernel: [<ffffffff8109b2ce>] ? prepare_to_wait+0x4e/0x80
      Nov  5 15:26:04 test1 kernel: [<ffffffff81193e1b>] pipe_wait+0x5b/0x80
      Nov  5 15:26:04 test1 kernel: [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40
      Nov  5 15:26:04 test1 kernel: [<ffffffff8152ad0e>] ? mutex_lock+0x1e/0x50
      Nov  5 15:26:04 test1 kernel: [<ffffffff811948c6>] pipe_read+0x3e6/0x4e0
      Nov  5 15:26:04 test1 kernel: [<ffffffffa10f0c50>] ? ll_dir_open+0x0/0xf0 [lustre]
      Nov  5 15:26:04 test1 kernel: [<ffffffff811890da>] do_sync_read+0xfa/0x140
      Nov  5 15:26:04 test1 kernel: [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40
      Nov  5 15:26:04 test1 kernel: [<ffffffff811ab840>] ? mntput_no_expire+0x30/0x110
      Nov  5 15:26:04 test1 kernel: [<ffffffff81227386>] ? security_file_permission+0x16/0x20
      Nov  5 15:26:04 test1 kernel: [<ffffffff81189a95>] vfs_read+0xb5/0x1a0
      Nov  5 15:26:04 test1 kernel: [<ffffffff81189bd1>] sys_read+0x51/0x90
      Nov  5 15:26:04 test1 kernel: [<ffffffff8152c93e>] ? do_device_not_available+0xe/0x10
      Nov  5 15:26:04 test1 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      

      I also noticed this stack for mdc_clg_send, and it seems the mdc has lost connection with the mdt (which is on the same node)

      Nov  5 15:26:04 test1 kernel: mdc_clg_send_ S 0000000000000000     0 16713      2 0x00000080
      Nov  5 15:26:04 test1 kernel: ffff88001fc77ad0 0000000000000046 ffff88001c2f9138 ffff88001bbf04c8
      Nov  5 15:26:04 test1 kernel: ffff88001fc77a50 ffffffffa06f83ea ffff88001bbf0400 00000000ffffffed
      Nov  5 15:26:04 test1 kernel: ffff88002aad9af8 ffff88001fc77fd8 000000000000fbc8 ffff88002aad9af8
      Nov  5 15:26:04 test1 kernel: Call Trace:
      Nov  5 15:26:04 test1 kernel: [<ffffffffa06f83ea>] ? ptlrpc_request_handle_notconn+0x11a/0x290 [ptlrpc]
      Nov  5 15:26:04 test1 kernel: [<ffffffff8152a532>] schedule_timeout+0x192/0x2e0
      Nov  5 15:26:04 test1 kernel: [<ffffffff81083f30>] ? process_timeout+0x0/0x10
      Nov  5 15:26:04 test1 kernel: [<ffffffffa06f73ea>] ptlrpc_set_wait+0x2da/0x860 [ptlrpc]
      Nov  5 15:26:04 test1 kernel: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      Nov  5 15:26:04 test1 kernel: [<ffffffffa0702066>] ? lustre_msg_set_jobid+0xb6/0x140 [ptlrpc]
      Nov  5 15:26:04 test1 kernel: [<ffffffffa06f79f7>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc]
      Nov  5 15:26:04 test1 kernel: [<ffffffffa071840c>] llog_client_destroy+0xcc/0x340 [ptlrpc]
      Nov  5 15:26:04 test1 kernel: [<ffffffffa0483421>] llog_destroy+0x51/0x170 [obdclass]
      Nov  5 15:26:04 test1 kernel: [<ffffffffa04859dd>] llog_cat_process_cb+0x39d/0x5e0 [obdclass]
      Nov  5 15:26:04 test1 kernel: [<ffffffffa0480dc2>] llog_process_thread+0x9a2/0xde0 [obdclass]
      Nov  5 15:26:04 test1 kernel: [<ffffffffa0485640>] ? llog_cat_process_cb+0x0/0x5e0 [obdclass]
      Nov  5 15:26:04 test1 kernel: [<ffffffffa0482a1f>] llog_process_or_fork+0x13f/0x540 [obdclass]
      Nov  5 15:26:04 test1 kernel: [<ffffffffa0483d5d>] llog_cat_process_or_fork+0x1ad/0x300 [obdclass]
      Nov  5 15:26:04 test1 kernel: [<ffffffffa07175c3>] ? llog_client_read_header+0x193/0x5e0 [ptlrpc]
      Nov  5 15:26:04 test1 kernel: [<ffffffffa0a27740>] ? changelog_kkuc_cb+0x0/0x490 [mdc]
      Nov  5 15:26:04 test1 kernel: [<ffffffffa0483ec9>] llog_cat_process+0x19/0x20 [obdclass]
      Nov  5 15:26:04 test1 kernel: [<ffffffffa0a2c42b>] mdc_changelog_send_thread+0x4bb/0x890 [mdc]
      Nov  5 15:26:04 test1 kernel: [<ffffffffa0a2bf70>] ? mdc_changelog_send_thread+0x0/0x890 [mdc]
      Nov  5 15:26:04 test1 kernel: [<ffffffff8109abf6>] kthread+0x96/0xa0
      Nov  5 15:26:04 test1 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
      Nov  5 15:26:04 test1 kernel: [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      Nov  5 15:26:04 test1 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
      

      Actually it appears the MDC is spinning on reconnecting to the local MDT:

      Nov  5 15:25:09 test1 kernel: Lustre: lustre-MDT0000-mdc-ffff88001ccf2800: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
      Nov  5 15:25:09 test1 kernel: Lustre: Skipped 55176 previous similar messages
      Nov  5 15:25:09 test1 kernel: Lustre: lustre-MDT0000: Client 43cec978-39b5-b4ea-ff11-d7e5611258e4 (at 0@lo) reconnecting
      Nov  5 15:25:09 test1 kernel: Lustre: Skipped 55177 previous similar messages
      Nov  5 15:25:09 test1 kernel: Lustre: lustre-MDT0000-mdc-ffff88001ccf2800: Connection restored to lustre-MDT0000 (at 0@lo)
      Nov  5 15:25:09 test1 kernel: Lustre: Skipped 55177 previous similar messages
      Nov  5 15:25:25 test1 kernel: Lustre: lustre-MDT0000-mdc-ffff88001ccf2800: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
      Nov  5 15:25:25 test1 kernel: Lustre: Skipped 117892 previous similar messages
      Nov  5 15:25:25 test1 kernel: Lustre: lustre-MDT0000: Client 43cec978-39b5-b4ea-ff11-d7e5611258e4 (at 0@lo) reconnecting
      Nov  5 15:25:25 test1 kernel: Lustre: Skipped 117892 previous similar messages
      Nov  5 15:25:25 test1 kernel: Lustre: lustre-MDT0000-mdc-ffff88001ccf2800: Connection restored to lustre-MDT0000 (at 0@lo)
      Nov  5 15:25:25 test1 kernel: Lustre: Skipped 117892 previous similar messages
      Nov  5 15:25:57 test1 kernel: Lustre: lustre-MDT0000-mdc-ffff88001ccf2800: Connection restored to lustre-MDT0000 (at 0@lo)
      Nov  5 15:25:57 test1 kernel: Lustre: Skipped 230776 previous similar messages
      Nov  5 15:25:57 test1 kernel: Lustre: lustre-MDT0000-mdc-ffff88001ccf2800: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
      Nov  5 15:25:57 test1 kernel: Lustre: Skipped 230777 previous similar messages
      

      Attachments

        Issue Links

          Activity

            People

              bfaccini Bruno Faccini (Inactive)
              rread Robert Read (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: