Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.11.0
    • Lustre 2.11.0
    • None
    • 3
    • 9223372036854775807

    Description

      https://testing.hpdd.intel.com/test_sets/713fb70e-119d-11e8-a6ad-52540065bddc

      It fails very often:

      Error: 'Timeout occurred after 227 mins, last suite running was sanity-flr, restarting cluster to continue tests' 
      Failure Rate: 41.18% of most recent 17 runs, 22 skipped (all branches)
      

      On a client:

      [10077.749514] Lustre: DEBUG MARKER: == sanity-flr test 43: mirror pick on write ========================================================== 12:14:55 (1518610495)
      [10320.098013] INFO: task dd:23892 blocked for more than 120 seconds.
      [10320.114074] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [10320.116709] dd              D ffff88007b96dee0     0 23892  23675 0x00000080
      [10320.119330] Call Trace:
      [10320.125475]  [<ffffffff810c6632>] ? default_wake_function+0x12/0x20
      [10320.150782]  [<ffffffff810bc2d8>] ? __wake_up_common+0x58/0x90
      [10320.154162]  [<ffffffff816ab8a9>] schedule+0x29/0x70
      [10320.170306]  [<ffffffff816a92b9>] schedule_timeout+0x239/0x2c0
      [10320.176336]  [<ffffffffc09f5e88>] ? ptlrpc_set_add_new_req+0xd8/0x150 [ptlrpc]
      [10320.178829]  [<ffffffffc0bd50c0>] ? osc_io_ladvise_end+0x50/0x50 [osc]
      [10320.181237]  [<ffffffffc0a25ffb>] ? ptlrpcd_add_req+0x22b/0x300 [ptlrpc]
      [10320.183701]  [<ffffffffc09fbe99>] ? ptlrpc_request_bufs_pack+0x1d9/0x480 [ptlrpc]
      [10320.186106]  [<ffffffff816abc5d>] wait_for_completion+0xfd/0x140
      [10320.188437]  [<ffffffff810c6620>] ? wake_up_state+0x20/0x20
      [10320.190651]  [<ffffffffc0bd5284>] osc_io_setattr_end+0xc4/0x180 [osc]
      [10320.192955]  [<ffffffffc0bd63d0>] ? osc_io_setattr_start+0x260/0x700 [osc]
      [10320.195231]  [<ffffffffc0c28490>] ? lov_io_iter_fini_wrapper+0x50/0x50 [lov]
      [10320.197659]  [<ffffffffc0832e8d>] cl_io_end+0x5d/0x150 [obdclass]
      [10320.199802]  [<ffffffffc0c2856b>] lov_io_end_wrapper+0xdb/0xe0 [lov]
      [10320.202033]  [<ffffffffc0c28bc5>] lov_io_call.isra.5+0x85/0x140 [lov]
      [10320.204170]  [<ffffffffc0c28cb6>] lov_io_end+0x36/0xb0 [lov]
      [10320.206291]  [<ffffffffc0832e8d>] cl_io_end+0x5d/0x150 [obdclass]
      [10320.208353]  [<ffffffffc083551f>] cl_io_loop+0x13f/0xc70 [obdclass]
      [10320.210509]  [<ffffffffc0cd1460>] cl_setattr_ost+0x250/0x3c0 [lustre]
      [10320.212550]  [<ffffffffc0cab495>] ll_setattr_raw+0x1165/0x1270 [lustre]
      [10320.214631]  [<ffffffffc0cab60c>] ll_setattr+0x6c/0xd0 [lustre]
      [10320.217542]  [<ffffffff81220fc1>] notify_change+0x2c1/0x420
      [10320.228621]  [<ffffffff812b45b6>] ? security_inode_need_killpriv+0x16/0x20
      [10320.230605]  [<ffffffff81200ad5>] do_truncate+0x75/0xc0
      [10320.232485]  [<ffffffff81211d97>] do_last+0x627/0x12c0
      [10320.234244]  [<ffffffff81212af2>] path_openat+0xc2/0x490
      [10320.236065]  [<ffffffff811af746>] ? do_read_fault.isra.44+0xe6/0x130
      [10320.237871]  [<ffffffff8121508b>] do_filp_open+0x4b/0xb0
      [10320.239642]  [<ffffffff8122233a>] ? __alloc_fd+0x8a/0x130
      [10320.241313]  [<ffffffff81201bc3>] do_sys_open+0xf3/0x1f0
      [10320.243068]  [<ffffffff816b8945>] ? system_call_after_swapgs+0x172/0x214
      [10320.244820]  [<ffffffff81201cde>] SyS_open+0x1e/0x20
      [10320.246469]  [<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b
      [10320.248096]  [<ffffffff816b889d>] ? system_call_after_swapgs+0xca/0x214
      

      Attachments

        Issue Links

          Activity

            [LU-10670] sanity-flr test 43 timeout

            Ah, OK, I will ask him to rebase.

            paf Patrick Farrell (Inactive) added a comment - Ah, OK, I will ask him to rebase.
            bobijam Zhenyu Xu added a comment -

            the build's parent is e528677e1630093362394ae36d725c321d0da4f2, which does not have this fix.

            bobijam Zhenyu Xu added a comment - the build's parent is e528677e1630093362394ae36d725c321d0da4f2, which does not have this fix.
            paf Patrick Farrell (Inactive) added a comment - Looks like a hit with this fix in: https://testing.hpdd.intel.com/test_sessions/0ecb0a78-c3ff-423c-8fdd-0255a7b3a203
            pjones Peter Jones added a comment -

            Landed for 2.11

            pjones Peter Jones added a comment - Landed for 2.11

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31315/
            Subject: LU-10670 test: make sanity-flr test_43 more reliable
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 9af57d0bdce9949dc3fe91817263758b57efbe9b

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31315/ Subject: LU-10670 test: make sanity-flr test_43 more reliable Project: fs/lustre-release Branch: master Current Patch Set: Commit: 9af57d0bdce9949dc3fe91817263758b57efbe9b
            paf Patrick Farrell (Inactive) added a comment - Another on master: https://testing.hpdd.intel.com/test_sessions/c96f08e7-6ef5-4ac4-bc20-dd5f9ef8010a
            bogl Bob Glossman (Inactive) added a comment - another on master: https://testing.hpdd.intel.com/test_sets/4ff42be0-16cd-11e8-bd00-52540065bddc

            Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/31315
            Subject: LU-10670 test: make sanity-flr test_43 more reliable
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: cc39d8716708fdec6656aa46bb9a36fe8e770c98

            gerrit Gerrit Updater added a comment - Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/31315 Subject: LU-10670 test: make sanity-flr test_43 more reliable Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: cc39d8716708fdec6656aa46bb9a36fe8e770c98
            yujian Jian Yu added a comment -

            This is a regression failure introduced by patch https://review.whamcloud.com/30711 for LU-10448. The failure occurred 8 times in one day, which is now affecting patch review testing on master branch.

            yujian Jian Yu added a comment - This is a regression failure introduced by patch https://review.whamcloud.com/30711 for LU-10448 . The failure occurred 8 times in one day, which is now affecting patch review testing on master branch.

            People

              bobijam Zhenyu Xu
              tappro Mikhail Pershin
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: