Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15086

replay-dual: test_10: @@@@@@ FAIL: test_10 failed with 2

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.15.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Vladimir Saveliev <vlaidimir.saveliev@hpe.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/2f1edc30-7ddf-4a6c-bc7a-fffbf58435b8

      This is relatively regular failure:

      Error: 'test_10 failed with 2' 
      Failure Rate: 6.00% of most recent 100 runs, 0 skipped (all branches)
      

      Attachments

        Activity

          [LU-15086] replay-dual: test_10: @@@@@@ FAIL: test_10 failed with 2
          pjones Peter Jones added a comment -

          Landed for 2.15

          pjones Peter Jones added a comment - Landed for 2.15

          "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45308/
          Subject: LU-15086 ptlrpc: fix timeout after spurious wakeup
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: b8383035406a4b7bee2e6d8674eaef480b3e3b35

          gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45308/ Subject: LU-15086 ptlrpc: fix timeout after spurious wakeup Project: fs/lustre-release Branch: master Current Patch Set: Commit: b8383035406a4b7bee2e6d8674eaef480b3e3b35

          with the patch above I can't reproduce the problem. would like to hear Neil's opinion.

          bzzz Alex Zhuravlev added a comment - with the patch above I can't reproduce the problem. would like to hear Neil's opinion.
          pjones Peter Jones added a comment -

          Would we be better off reverting LU-12362 from master?

          pjones Peter Jones added a comment - Would we be better off reverting LU-12362 from master?

          "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45308
          Subject: LU-15086 ptlrpc: fix timeout after spurious wakeup
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: a495c00874105caa299fdf4dfaa482df9b24ad2e

          gerrit Gerrit Updater added a comment - "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45308 Subject: LU-15086 ptlrpc: fix timeout after spurious wakeup Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a495c00874105caa299fdf4dfaa482df9b24ad2e
          bzzz Alex Zhuravlev added a comment - - edited

          added a simple check to pltrpcd to catch too long waiting:

          LustreError: 5019:0:(ptlrpcd.c:515:ptlrpcd()) ASSERTION( timeout == 0 || end - start < timeout + 4 ) failed: timeout 10, end 97, start 83, diff 14, count 3

          i.e. 10 seconds wait was requested, but actually spent 14 seconds waiting, wait_woken() was called 3 times.

          yet another interesting example from failed replay-dual:

          LustreError: 5009:0:(ptlrpcd.c:513:ptlrpcd()) timeout 19, end 711, start 620, diff 91, count 9
          bzzz Alex Zhuravlev added a comment - - edited added a simple check to pltrpcd to catch too long waiting: LustreError: 5019:0:(ptlrpcd.c:515:ptlrpcd()) ASSERTION( timeout == 0 || end - start < timeout + 4 ) failed: timeout 10, end 97, start 83, diff 14, count 3 i.e. 10 seconds wait was requested, but actually spent 14 seconds waiting, wait_woken() was called 3 times. yet another interesting example from failed replay-dual: LustreError: 5009:0:(ptlrpcd.c:513:ptlrpcd()) timeout 19, end 711, start 620, diff 91, count 9

          my bisection points to one patch:

          COMMIT		TESTED	PASSED	 FAILED		COMMIT DESCRIPTION
          8a770616a5	5	4	1	BAD	 LU-14797 sec: add projid to nodemap
          bbfdc7c167	17	15	2	BAD	 LU-14739 quota: fix quota with root squash enabled
          da1d93513f	7	6	1	BAD	 LU-14475 log: Rewrite some log messages
          885b494632	18	17	1	BAD	 LU-12362 ptlrpc: use wait_woken() in ptlrpcd()
          9ec5e2329b	90	90	0		 LU-6142 lod: return pools_hash_params to being static.
          2a24b6ec67	90	90	0		 LU-14734 ldiskfs: improve message for large_dir
          f7f31f8f96	90	90	0		 LU-12567 ptlrpc: handle reply and resend reorder
          1a409a3e6a	90	90	0		 LU-14711 osc: Do not attempt sending empty pages
          

          I think there are two potential issues in LU-12362:

          -               if (timeout == 0)
          -                       wait_event_idle(set->set_waitq,
          -                                       ptlrpcd_check(&env, pc));
          -               else if (wait_event_idle_timeout(set->set_waitq,
          -                                                ptlrpcd_check(&env, pc),
          -                                                cfs_time_seconds(timeout))
          -                        == 0)
          +
          +               add_wait_queue(&set->set_waitq, &wait);
          +               while (!ptlrpcd_check(&env, pc)) {
          +                       int ret;
          +
          +                       if (timeout == 0)
          +                               ret = wait_woken(&wait, TASK_IDLE,
          +                                                MAX_SCHEDULE_TIMEOUT);
          +                       else
          +                               ret = wait_woken(&wait, TASK_IDLE,
          +                                                cfs_time_seconds(timeout));
          +                       if (ret != 0)
          +                               continue;
          +                       /* Timed out */
                                  ptlrpc_expired_set(set);
          +                       break;
          +               }
          +               remove_wait_queue(&set->set_waitq, &wait);
          +
          

          1) timeout is reused after spurious wakeup, so this can last longer than expected
          2) ptlrpc_expired_set() can be called for the case when timeout=0

          bzzz Alex Zhuravlev added a comment - my bisection points to one patch: COMMIT TESTED PASSED FAILED COMMIT DESCRIPTION 8a770616a5 5 4 1 BAD LU-14797 sec: add projid to nodemap bbfdc7c167 17 15 2 BAD LU-14739 quota: fix quota with root squash enabled da1d93513f 7 6 1 BAD LU-14475 log: Rewrite some log messages 885b494632 18 17 1 BAD LU-12362 ptlrpc: use wait_woken() in ptlrpcd() 9ec5e2329b 90 90 0 LU-6142 lod: return pools_hash_params to being static . 2a24b6ec67 90 90 0 LU-14734 ldiskfs: improve message for large_dir f7f31f8f96 90 90 0 LU-12567 ptlrpc: handle reply and resend reorder 1a409a3e6a 90 90 0 LU-14711 osc: Do not attempt sending empty pages I think there are two potential issues in LU-12362 : - if (timeout == 0) - wait_event_idle(set->set_waitq, - ptlrpcd_check(&env, pc)); - else if (wait_event_idle_timeout(set->set_waitq, - ptlrpcd_check(&env, pc), - cfs_time_seconds(timeout)) - == 0) + + add_wait_queue(&set->set_waitq, &wait); + while (!ptlrpcd_check(&env, pc)) { + int ret; + + if (timeout == 0) + ret = wait_woken(&wait, TASK_IDLE, + MAX_SCHEDULE_TIMEOUT); + else + ret = wait_woken(&wait, TASK_IDLE, + cfs_time_seconds(timeout)); + if (ret != 0) + continue ; + /* Timed out */ ptlrpc_expired_set(set); + break ; + } + remove_wait_queue(&set->set_waitq, &wait); + 1) timeout is reused after spurious wakeup, so this can last longer than expected 2) ptlrpc_expired_set() can be called for the case when timeout=0

          It looks like this has been failing intermittently (once per 4-5 days on full sessions only) since at least 2021-01-01 (the earliest I checked). Since 2021-10-10 it has also failed on 4 review test sessions, and as such is failing more frequently (6x in 2 days). It would be worthwhile to check what changes landed on 10-10 that might have triggered this.

          adilger Andreas Dilger added a comment - It looks like this has been failing intermittently (once per 4-5 days on full sessions only) since at least 2021-01-01 (the earliest I checked). Since 2021-10-10 it has also failed on 4 review test sessions, and as such is failing more frequently (6x in 2 days). It would be worthwhile to check what changes landed on 10-10 that might have triggered this.

          People

            bzzz Alex Zhuravlev
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: