Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      2016-04-14T13:52:45.530794+00:00 c3-2c1s12n0 Lustre: 21681:0:(client.c:1944:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1460641949/real 1460641949] req@ffff881fe6245700 x1530142443559856/t0(0) o101->snx11209-OST0007-osc-ffff880fe6fbbc00@10.149.209.11@o2ib1302:28/4 lens 328/400 e 1 to 1 dl 1460641965 ref 1 fl Rpc:XU/40/ffffffff rc 0/-1
      2016-04-14T13:52:45.530830+00:00 c3-2c1s12n0 Lustre: 21681:0:(client.c:1944:ptlrpc_expire_one_request()) Skipped 224 previous similar messages
      2016-04-14T13:53:17.536750+00:00 c3-2c1s12n0 LustreError: 21681:0:(import.c:646:ptlrpc_connect_import()) already connecting
      2016-04-14T13:53:17.561952+00:00 c3-2c1s12n0 LustreError: 21681:0:(import.c:646:ptlrpc_connect_import()) Skipped 175 previous similar messages
      2016-04-14T13:55:25.623058+00:00 c3-2c1s12n0 Lustre: 21681:0:(client.c:1944:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1460642109/real 1460642109] req@ffff881fe6103180 x1530142443588744/t0(0) o101->snx11209-OST0007-osc-ffff880fe6fbbc00@10.149.209.11@o2ib1302:28/4 lens 328/400 e 1 to 1 dl 1460642125 ref 1 fl Rpc:XU/40/ffffffff rc 0/-1
      2016-04-14T13:55:25.623115+00:00 c3-2c1s12n0 Lustre: 21681:0:(client.c:1944:ptlrpc_expire_one_request()) Skipped 449 previous similar messages
      2016-04-14T13:55:25.648246+00:00 c3-2c1s12n0 LustreError: 21681:0:(import.c:646:ptlrpc_connect_import()) already connecting
      2016-04-14T13:55:25.648306+00:00 c3-2c1s12n0 LustreError: 21681:0:(import.c:646:ptlrpc_connect_import()) Skipped 351 previous similar messages
      2016-04-14T13:57:45.501013+00:00 c3-2c1s12n0 Lustre: snx11209-OST0007-osc-ffff880fe6fbbc00: Connection restored to snx11209-OST0007 (at 10.149.209.11@o2ib1302)
      2016-04-14T13:57:54.435474+00:00 c3-2c1s12n0 LustreError: 22135:0:(osc_cache.c:3107:discard_cb()) ASSERTION( (!(page->cp_type == CPT_CACHEABLE) || (!PageDirty(cl_page_vmpage(page)))) ) failed:
      2016-04-14T13:57:54.460638+00:00 c3-2c1s12n0 LustreError: 22135:0:(osc_cache.c:3107:discard_cb()) LBUG
      2016-04-14T13:57:54.460793+00:00 c3-2c1s12n0 Pid: 22135, comm: ldlm_bl_05
      2016-04-14T13:57:54.460821+00:00 c3-2c1s12n0 Call Trace:
      2016-04-14T13:57:54.460842+00:00 c3-2c1s12n0 [<ffffffff81006109>] try_stack_unwind+0x169/0x1b0
      2016-04-14T13:57:54.485882+00:00 c3-2c1s12n0 [<ffffffff81004b99>] dump_trace+0x89/0x440
      2016-04-14T13:57:54.485909+00:00 c3-2c1s12n0 [<ffffffffa02108c7>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
      2016-04-14T13:57:54.511077+00:00 c3-2c1s12n0 [<ffffffffa0210e27>] lbug_with_loc+0x47/0xc0 [libcfs]
      2016-04-14T13:57:54.511124+00:00 c3-2c1s12n0 [<ffffffffa06f264a>] discard_cb+0x19a/0x1d0 [osc]
      2016-04-14T13:57:54.511142+00:00 c3-2c1s12n0 [<ffffffffa06f29c8>] osc_page_gang_lookup+0x1b8/0x330 [osc]
      2016-04-14T13:57:54.536311+00:00 c3-2c1s12n0 [<ffffffffa06f2c6b>] osc_lock_discard_pages+0x12b/0x220 [osc]
      2016-04-14T13:57:54.536403+00:00 c3-2c1s12n0 [<ffffffffa06e9408>] osc_lock_flush+0xf8/0x260 [osc]
      2016-04-14T13:57:54.536433+00:00 c3-2c1s12n0 [<ffffffffa06e9651>] osc_lock_cancel+0xe1/0x1c0 [osc]
      2016-04-14T13:57:54.561488+00:00 c3-2c1s12n0 [<ffffffffa035e115>] cl_lock_cancel0+0x75/0x160 [obdclass]
      2016-04-14T13:57:54.561551+00:00 c3-2c1s12n0 [<ffffffffa035ee6b>] cl_lock_cancel+0x13b/0x140 [obdclass]
      2016-04-14T13:57:54.586775+00:00 c3-2c1s12n0 [<ffffffffa06eb3cc>] osc_ldlm_blocking_ast+0x20c/0x330 [osc]
      2016-04-14T13:57:54.586835+00:00 c3-2c1s12n0 [<ffffffffa046e3f4>] ldlm_handle_bl_callback+0xd4/0x430 [ptlrpc]
      2016-04-14T13:57:54.586915+00:00 c3-2c1s12n0 [<ffffffffa046e964>] ldlm_bl_thread_main+0x214/0x460 [ptlrpc]
      2016-04-14T13:57:54.611996+00:00 c3-2c1s12n0 [<ffffffff8107374e>] kthread+0x9e/0xb0
      2016-04-14T13:57:54.612081+00:00 c3-2c1s12n0 [<ffffffff81427bb4>] kernel_thread_helper+0x4/0x10
      2016-04-14T13:57:54.612204+00:00 c3-2c1s12n0 Kernel panic - not syncing: LBUG

      Attachments

        Issue Links

          Activity

            [LU-8347] granting conflicting locks
            pjones Peter Jones added a comment -

            Landed for 2.9

            pjones Peter Jones added a comment - Landed for 2.9

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21059/
            Subject: LU-8347 ldlm: granting conflicting locks
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: a42a91c783903ea15ad902032166c7e312dad7ee

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21059/ Subject: LU-8347 ldlm: granting conflicting locks Project: fs/lustre-release Branch: master Current Patch Set: Commit: a42a91c783903ea15ad902032166c7e312dad7ee

            I've added the test to the patch

            askulysh Andriy Skulysh added a comment - I've added the test to the patch
            jhammond John Hammond added a comment -

            Andriy, can you add a test to http://review.whamcloud.com/#/c/21059/?

            jhammond John Hammond added a comment - Andriy, can you add a test to http://review.whamcloud.com/#/c/21059/?

            It's not integrated in to the test framework, but the test I attached to LU-8202 (duped to here) does a great job of reproducing this problem during failover. (Hit rate of >50% in my testing.) The LU-8347 patch (I believe with LU-8175 as well) makes that test pass consistently, without it, we see data corruption in that test.

            https://jira.hpdd.intel.com/secure/attachment/21600/mpi_test.c

            Running the test case is a little complicated:
            https://jira.hpdd.intel.com/browse/LU-8202?focusedCommentId=153406&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-153406

            Exactly what it's doing is described in this comment:
            https://jira.hpdd.intel.com/browse/LU-8202?focusedCommentId=153406&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-153406

            paf Patrick Farrell (Inactive) added a comment - - edited It's not integrated in to the test framework, but the test I attached to LU-8202 (duped to here) does a great job of reproducing this problem during failover. (Hit rate of >50% in my testing.) The LU-8347 patch (I believe with LU-8175 as well) makes that test pass consistently, without it, we see data corruption in that test. https://jira.hpdd.intel.com/secure/attachment/21600/mpi_test.c Running the test case is a little complicated: https://jira.hpdd.intel.com/browse/LU-8202?focusedCommentId=153406&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-153406 Exactly what it's doing is described in this comment: https://jira.hpdd.intel.com/browse/LU-8202?focusedCommentId=153406&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-153406
            jhammond John Hammond added a comment -

            Hi Andriy,

            Could you point us to that test or upload it to gerrit?

            jhammond John Hammond added a comment - Hi Andriy, Could you point us to that test or upload it to gerrit?

            Yes, I have a test, but it depends on other ticket.
            The patch disables reprocess during lock resend, thus waiting lock can't get granted before already granted lock resent.

            askulysh Andriy Skulysh added a comment - Yes, I have a test, but it depends on other ticket. The patch disables reprocess during lock resend, thus waiting lock can't get granted before already granted lock resent.

            I realize you must be talking about resent locks. However, I don't think this patch fixes the problem.

            jay Jinshan Xiong (Inactive) added a comment - I realize you must be talking about resent locks. However, I don't think this patch fixes the problem.

            as far as I know, replaying locks will be added into lists directly, and they won't be processed with lock enqueue policy at replay time.

            Do you have a reproducer?

            jay Jinshan Xiong (Inactive) added a comment - as far as I know, replaying locks will be added into lists directly, and they won't be processed with lock enqueue policy at replay time. Do you have a reproducer?

            the waiting lock replay could come to the server first and gets granted immediately as no conflict exists, the next granted lock replay is just placed to the granted list - so conflicts could be granted.

            askulysh Andriy Skulysh added a comment - the waiting lock replay could come to the server first and gets granted immediately as no conflict exists, the next granted lock replay is just placed to the granted list - so conflicts could be granted.

            People

              wc-triage WC Triage
              askulysh Andriy Skulysh
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: