Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      2016-04-14T13:52:45.530794+00:00 c3-2c1s12n0 Lustre: 21681:0:(client.c:1944:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1460641949/real 1460641949] req@ffff881fe6245700 x1530142443559856/t0(0) o101->snx11209-OST0007-osc-ffff880fe6fbbc00@10.149.209.11@o2ib1302:28/4 lens 328/400 e 1 to 1 dl 1460641965 ref 1 fl Rpc:XU/40/ffffffff rc 0/-1
      2016-04-14T13:52:45.530830+00:00 c3-2c1s12n0 Lustre: 21681:0:(client.c:1944:ptlrpc_expire_one_request()) Skipped 224 previous similar messages
      2016-04-14T13:53:17.536750+00:00 c3-2c1s12n0 LustreError: 21681:0:(import.c:646:ptlrpc_connect_import()) already connecting
      2016-04-14T13:53:17.561952+00:00 c3-2c1s12n0 LustreError: 21681:0:(import.c:646:ptlrpc_connect_import()) Skipped 175 previous similar messages
      2016-04-14T13:55:25.623058+00:00 c3-2c1s12n0 Lustre: 21681:0:(client.c:1944:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1460642109/real 1460642109] req@ffff881fe6103180 x1530142443588744/t0(0) o101->snx11209-OST0007-osc-ffff880fe6fbbc00@10.149.209.11@o2ib1302:28/4 lens 328/400 e 1 to 1 dl 1460642125 ref 1 fl Rpc:XU/40/ffffffff rc 0/-1
      2016-04-14T13:55:25.623115+00:00 c3-2c1s12n0 Lustre: 21681:0:(client.c:1944:ptlrpc_expire_one_request()) Skipped 449 previous similar messages
      2016-04-14T13:55:25.648246+00:00 c3-2c1s12n0 LustreError: 21681:0:(import.c:646:ptlrpc_connect_import()) already connecting
      2016-04-14T13:55:25.648306+00:00 c3-2c1s12n0 LustreError: 21681:0:(import.c:646:ptlrpc_connect_import()) Skipped 351 previous similar messages
      2016-04-14T13:57:45.501013+00:00 c3-2c1s12n0 Lustre: snx11209-OST0007-osc-ffff880fe6fbbc00: Connection restored to snx11209-OST0007 (at 10.149.209.11@o2ib1302)
      2016-04-14T13:57:54.435474+00:00 c3-2c1s12n0 LustreError: 22135:0:(osc_cache.c:3107:discard_cb()) ASSERTION( (!(page->cp_type == CPT_CACHEABLE) || (!PageDirty(cl_page_vmpage(page)))) ) failed:
      2016-04-14T13:57:54.460638+00:00 c3-2c1s12n0 LustreError: 22135:0:(osc_cache.c:3107:discard_cb()) LBUG
      2016-04-14T13:57:54.460793+00:00 c3-2c1s12n0 Pid: 22135, comm: ldlm_bl_05
      2016-04-14T13:57:54.460821+00:00 c3-2c1s12n0 Call Trace:
      2016-04-14T13:57:54.460842+00:00 c3-2c1s12n0 [<ffffffff81006109>] try_stack_unwind+0x169/0x1b0
      2016-04-14T13:57:54.485882+00:00 c3-2c1s12n0 [<ffffffff81004b99>] dump_trace+0x89/0x440
      2016-04-14T13:57:54.485909+00:00 c3-2c1s12n0 [<ffffffffa02108c7>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
      2016-04-14T13:57:54.511077+00:00 c3-2c1s12n0 [<ffffffffa0210e27>] lbug_with_loc+0x47/0xc0 [libcfs]
      2016-04-14T13:57:54.511124+00:00 c3-2c1s12n0 [<ffffffffa06f264a>] discard_cb+0x19a/0x1d0 [osc]
      2016-04-14T13:57:54.511142+00:00 c3-2c1s12n0 [<ffffffffa06f29c8>] osc_page_gang_lookup+0x1b8/0x330 [osc]
      2016-04-14T13:57:54.536311+00:00 c3-2c1s12n0 [<ffffffffa06f2c6b>] osc_lock_discard_pages+0x12b/0x220 [osc]
      2016-04-14T13:57:54.536403+00:00 c3-2c1s12n0 [<ffffffffa06e9408>] osc_lock_flush+0xf8/0x260 [osc]
      2016-04-14T13:57:54.536433+00:00 c3-2c1s12n0 [<ffffffffa06e9651>] osc_lock_cancel+0xe1/0x1c0 [osc]
      2016-04-14T13:57:54.561488+00:00 c3-2c1s12n0 [<ffffffffa035e115>] cl_lock_cancel0+0x75/0x160 [obdclass]
      2016-04-14T13:57:54.561551+00:00 c3-2c1s12n0 [<ffffffffa035ee6b>] cl_lock_cancel+0x13b/0x140 [obdclass]
      2016-04-14T13:57:54.586775+00:00 c3-2c1s12n0 [<ffffffffa06eb3cc>] osc_ldlm_blocking_ast+0x20c/0x330 [osc]
      2016-04-14T13:57:54.586835+00:00 c3-2c1s12n0 [<ffffffffa046e3f4>] ldlm_handle_bl_callback+0xd4/0x430 [ptlrpc]
      2016-04-14T13:57:54.586915+00:00 c3-2c1s12n0 [<ffffffffa046e964>] ldlm_bl_thread_main+0x214/0x460 [ptlrpc]
      2016-04-14T13:57:54.611996+00:00 c3-2c1s12n0 [<ffffffff8107374e>] kthread+0x9e/0xb0
      2016-04-14T13:57:54.612081+00:00 c3-2c1s12n0 [<ffffffff81427bb4>] kernel_thread_helper+0x4/0x10
      2016-04-14T13:57:54.612204+00:00 c3-2c1s12n0 Kernel panic - not syncing: LBUG

      Attachments

        Issue Links

          Activity

            [LU-8347] granting conflicting locks

            It's not integrated in to the test framework, but the test I attached to LU-8202 (duped to here) does a great job of reproducing this problem during failover. (Hit rate of >50% in my testing.) The LU-8347 patch (I believe with LU-8175 as well) makes that test pass consistently, without it, we see data corruption in that test.

            https://jira.hpdd.intel.com/secure/attachment/21600/mpi_test.c

            Running the test case is a little complicated:
            https://jira.hpdd.intel.com/browse/LU-8202?focusedCommentId=153406&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-153406

            Exactly what it's doing is described in this comment:
            https://jira.hpdd.intel.com/browse/LU-8202?focusedCommentId=153406&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-153406

            paf Patrick Farrell (Inactive) added a comment - - edited It's not integrated in to the test framework, but the test I attached to LU-8202 (duped to here) does a great job of reproducing this problem during failover. (Hit rate of >50% in my testing.) The LU-8347 patch (I believe with LU-8175 as well) makes that test pass consistently, without it, we see data corruption in that test. https://jira.hpdd.intel.com/secure/attachment/21600/mpi_test.c Running the test case is a little complicated: https://jira.hpdd.intel.com/browse/LU-8202?focusedCommentId=153406&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-153406 Exactly what it's doing is described in this comment: https://jira.hpdd.intel.com/browse/LU-8202?focusedCommentId=153406&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-153406
            jhammond John Hammond added a comment -

            Hi Andriy,

            Could you point us to that test or upload it to gerrit?

            jhammond John Hammond added a comment - Hi Andriy, Could you point us to that test or upload it to gerrit?

            Yes, I have a test, but it depends on other ticket.
            The patch disables reprocess during lock resend, thus waiting lock can't get granted before already granted lock resent.

            askulysh Andriy Skulysh added a comment - Yes, I have a test, but it depends on other ticket. The patch disables reprocess during lock resend, thus waiting lock can't get granted before already granted lock resent.

            I realize you must be talking about resent locks. However, I don't think this patch fixes the problem.

            jay Jinshan Xiong (Inactive) added a comment - I realize you must be talking about resent locks. However, I don't think this patch fixes the problem.

            as far as I know, replaying locks will be added into lists directly, and they won't be processed with lock enqueue policy at replay time.

            Do you have a reproducer?

            jay Jinshan Xiong (Inactive) added a comment - as far as I know, replaying locks will be added into lists directly, and they won't be processed with lock enqueue policy at replay time. Do you have a reproducer?

            the waiting lock replay could come to the server first and gets granted immediately as no conflict exists, the next granted lock replay is just placed to the granted list - so conflicts could be granted.

            askulysh Andriy Skulysh added a comment - the waiting lock replay could come to the server first and gets granted immediately as no conflict exists, the next granted lock replay is just placed to the granted list - so conflicts could be granted.

            Hi Andriy, can you please describe the root cause of this problem in detail? The reason to remove the assert is that this usually occurs when an OSC import is being evicted therefore this error is not that severe to make the client stop working.

            jay Jinshan Xiong (Inactive) added a comment - Hi Andriy, can you please describe the root cause of this problem in detail? The reason to remove the assert is that this usually occurs when an OSC import is being evicted therefore this error is not that severe to make the client stop working.
            jhammond John Hammond added a comment -

            This assertion was removed by http://review.whamcloud.com/#/c/14989/ LU-6271 osc: handle osc eviction correctly.

            jhammond John Hammond added a comment - This assertion was removed by http://review.whamcloud.com/#/c/14989/ LU-6271 osc: handle osc eviction correctly.

            Andriy Skulysh (andriy.skulysh@seagate.com) uploaded a new patch: http://review.whamcloud.com/21059
            Subject: LU-8347 ldlm: granting conflicting locks
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 2abc35352d2a1e3be9ee5dd8b2979ccda115c2e1

            gerrit Gerrit Updater added a comment - Andriy Skulysh (andriy.skulysh@seagate.com) uploaded a new patch: http://review.whamcloud.com/21059 Subject: LU-8347 ldlm: granting conflicting locks Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2abc35352d2a1e3be9ee5dd8b2979ccda115c2e1

            People

              wc-triage WC Triage
              askulysh Andriy Skulysh
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: