Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8680

replay-single test_20b: BUG: soft lockup - osc_makes_rpc()

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.9.0
    • Lustre 2.9.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Niu Yawei <yawei.niu@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/a265b3dc-8b49-11e6-ad53-5254006e85c2.

      The sub-test test_20b failed with the following error:

      test failed to respond and timed out
      

      Please provide additional information about the failure here.

      Info required for matching: replay-single 20b

      Attachments

        Activity

          [LU-8680] replay-single test_20b: BUG: soft lockup - osc_makes_rpc()
          pjones Peter Jones added a comment -

          Landed for 2.9

          pjones Peter Jones added a comment - Landed for 2.9

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23326/
          Subject: LU-8680 osc: soft lock - osc_makes_rpc()
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: a687000d2400fee88f122526444700502cb57fe4

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23326/ Subject: LU-8680 osc: soft lock - osc_makes_rpc() Project: fs/lustre-release Branch: master Current Patch Set: Commit: a687000d2400fee88f122526444700502cb57fe4

          Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/23326
          Subject: LU-8680 osc: soft lock - osc_makes_rpc()
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 4b891b4752273c0852bec39e188a6aecfec800de

          gerrit Gerrit Updater added a comment - Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/23326 Subject: LU-8680 osc: soft lock - osc_makes_rpc() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4b891b4752273c0852bec39e188a6aecfec800de
          jay Jinshan Xiong (Inactive) added a comment - - edited

          Indeed. It occurred to me why I would like to have

          osc_max_write_chunks()
          {
                  return PTLRPC_MAX_BRW_SIZE >> cli->cl_chunkbits;
          }
          

          instead of the current implementation.

          jay Jinshan Xiong (Inactive) added a comment - - edited Indeed. It occurred to me why I would like to have osc_max_write_chunks() { return PTLRPC_MAX_BRW_SIZE >> cli->cl_chunkbits; } instead of the current implementation.

          One thing I can't see from the patch of LU-8135 is that: when will the large extent (exceeding chunk limitation) being flushed back?

          niu Niu Yawei (Inactive) added a comment - One thing I can't see from the patch of LU-8135 is that: when will the large extent (exceeding chunk limitation) being flushed back?

          This could be a regression caused by LU-8135, that patch limited chunk number in a write RPC, so for a extent with large number of chunk, osc_check_rpcs() will run into a loop and never break.

          see osc_check_rpcs()

                          if (osc_makes_rpc(cli, osc, OBD_BRW_WRITE)) {
                                  rc = osc_send_write_rpc(env, cli, osc);
                                  if (rc < 0) {
                                          CERROR("Write request failed with %d\n", rc);
          
                                          /* osc_send_write_rpc failed, mostly because of
                                           * memory pressure.
                                           *
                                           * It can't break here, because if:
                                           *  - a page was submitted by osc_io_submit, so
                                           *    page locked;
                                           *  - no request in flight
                                           *  - no subsequent request
                                           * The system will be in live-lock state,
                                           * because there is no chance to call
                                           * osc_io_unplug() and osc_check_rpcs() any
                                           * more. pdflush can't help in this case,
                                           * because it might be blocked at grabbing
                                           * the page lock as we mentioned.
                                           *
                                           * Anyway, continue to drain pages. */
                                          /* break; */
                                  }
                          }
          

          With the fix of LU-8135, osc_send_write_rpc() will do nothing when the extent is too large, and osc_check_rpcs() won't break loop but continue to try on the same object.

          niu Niu Yawei (Inactive) added a comment - This could be a regression caused by LU-8135 , that patch limited chunk number in a write RPC, so for a extent with large number of chunk, osc_check_rpcs() will run into a loop and never break. see osc_check_rpcs() if (osc_makes_rpc(cli, osc, OBD_BRW_WRITE)) { rc = osc_send_write_rpc(env, cli, osc); if (rc < 0) { CERROR( "Write request failed with %d\n" , rc); /* osc_send_write_rpc failed, mostly because of * memory pressure. * * It can't break here, because if : * - a page was submitted by osc_io_submit, so * page locked; * - no request in flight * - no subsequent request * The system will be in live-lock state, * because there is no chance to call * osc_io_unplug() and osc_check_rpcs() any * more. pdflush can't help in this case , * because it might be blocked at grabbing * the page lock as we mentioned. * * Anyway, continue to drain pages. */ /* break ; */ } } With the fix of LU-8135 , osc_send_write_rpc() will do nothing when the extent is too large, and osc_check_rpcs() won't break loop but continue to try on the same object.

          Hi Bobijam,

          Could you please look into this issue?

          Thanks.
          Joe

          jgmitter Joseph Gmitter (Inactive) added a comment - Hi Bobijam, Could you please look into this issue? Thanks. Joe

          Looks all these failures happened in interop testing, Sarah, did you ever observe such failure during interop testing?

          niu Niu Yawei (Inactive) added a comment - Looks all these failures happened in interop testing, Sarah, did you ever observe such failure during interop testing?

          I searched maloo, and seems it appeared since around Sep 24 (not sure why some results don't have console logs, so it's hard to determine the exact date when it appeared first time), and happens quite often.

          niu Niu Yawei (Inactive) added a comment - I searched maloo, and seems it appeared since around Sep 24 (not sure why some results don't have console logs, so it's hard to determine the exact date when it appeared first time), and happens quite often.

          not to my knowledge.

          jay Jinshan Xiong (Inactive) added a comment - not to my knowledge.

          People

            bobijam Zhenyu Xu
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: