Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8680

replay-single test_20b: BUG: soft lockup - osc_makes_rpc()

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.9.0
    • Lustre 2.9.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Niu Yawei <yawei.niu@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/a265b3dc-8b49-11e6-ad53-5254006e85c2.

      The sub-test test_20b failed with the following error:

      test failed to respond and timed out
      

      Please provide additional information about the failure here.

      Info required for matching: replay-single 20b

      Attachments

        Activity

          [LU-8680] replay-single test_20b: BUG: soft lockup - osc_makes_rpc()
          pjones Peter Jones made changes -
          Link New: This issue is related to DDN-993 [ DDN-993 ]
          pjones Peter Jones made changes -
          Resolution New: Fixed [ 1 ]
          Status Original: Open [ 1 ] New: Resolved [ 5 ]
          pjones Peter Jones added a comment -

          Landed for 2.9

          pjones Peter Jones added a comment - Landed for 2.9

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23326/
          Subject: LU-8680 osc: soft lock - osc_makes_rpc()
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: a687000d2400fee88f122526444700502cb57fe4

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23326/ Subject: LU-8680 osc: soft lock - osc_makes_rpc() Project: fs/lustre-release Branch: master Current Patch Set: Commit: a687000d2400fee88f122526444700502cb57fe4

          Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/23326
          Subject: LU-8680 osc: soft lock - osc_makes_rpc()
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 4b891b4752273c0852bec39e188a6aecfec800de

          gerrit Gerrit Updater added a comment - Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/23326 Subject: LU-8680 osc: soft lock - osc_makes_rpc() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4b891b4752273c0852bec39e188a6aecfec800de
          jay Jinshan Xiong (Inactive) added a comment - - edited

          Indeed. It occurred to me why I would like to have

          osc_max_write_chunks()
          {
                  return PTLRPC_MAX_BRW_SIZE >> cli->cl_chunkbits;
          }
          

          instead of the current implementation.

          jay Jinshan Xiong (Inactive) added a comment - - edited Indeed. It occurred to me why I would like to have osc_max_write_chunks() { return PTLRPC_MAX_BRW_SIZE >> cli->cl_chunkbits; } instead of the current implementation.
          jay Jinshan Xiong (Inactive) made changes -
          Priority Original: Critical [ 2 ] New: Blocker [ 1 ]

          One thing I can't see from the patch of LU-8135 is that: when will the large extent (exceeding chunk limitation) being flushed back?

          niu Niu Yawei (Inactive) added a comment - One thing I can't see from the patch of LU-8135 is that: when will the large extent (exceeding chunk limitation) being flushed back?

          This could be a regression caused by LU-8135, that patch limited chunk number in a write RPC, so for a extent with large number of chunk, osc_check_rpcs() will run into a loop and never break.

          see osc_check_rpcs()

                          if (osc_makes_rpc(cli, osc, OBD_BRW_WRITE)) {
                                  rc = osc_send_write_rpc(env, cli, osc);
                                  if (rc < 0) {
                                          CERROR("Write request failed with %d\n", rc);
          
                                          /* osc_send_write_rpc failed, mostly because of
                                           * memory pressure.
                                           *
                                           * It can't break here, because if:
                                           *  - a page was submitted by osc_io_submit, so
                                           *    page locked;
                                           *  - no request in flight
                                           *  - no subsequent request
                                           * The system will be in live-lock state,
                                           * because there is no chance to call
                                           * osc_io_unplug() and osc_check_rpcs() any
                                           * more. pdflush can't help in this case,
                                           * because it might be blocked at grabbing
                                           * the page lock as we mentioned.
                                           *
                                           * Anyway, continue to drain pages. */
                                          /* break; */
                                  }
                          }
          

          With the fix of LU-8135, osc_send_write_rpc() will do nothing when the extent is too large, and osc_check_rpcs() won't break loop but continue to try on the same object.

          niu Niu Yawei (Inactive) added a comment - This could be a regression caused by LU-8135 , that patch limited chunk number in a write RPC, so for a extent with large number of chunk, osc_check_rpcs() will run into a loop and never break. see osc_check_rpcs() if (osc_makes_rpc(cli, osc, OBD_BRW_WRITE)) { rc = osc_send_write_rpc(env, cli, osc); if (rc < 0) { CERROR( "Write request failed with %d\n" , rc); /* osc_send_write_rpc failed, mostly because of * memory pressure. * * It can't break here, because if : * - a page was submitted by osc_io_submit, so * page locked; * - no request in flight * - no subsequent request * The system will be in live-lock state, * because there is no chance to call * osc_io_unplug() and osc_check_rpcs() any * more. pdflush can't help in this case , * because it might be blocked at grabbing * the page lock as we mentioned. * * Anyway, continue to drain pages. */ /* break ; */ } } With the fix of LU-8135 , osc_send_write_rpc() will do nothing when the extent is too large, and osc_check_rpcs() won't break loop but continue to try on the same object.
          jgmitter Joseph Gmitter (Inactive) made changes -
          Priority Original: Major [ 3 ] New: Critical [ 2 ]

          People

            bobijam Zhenyu Xu
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: