Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8101

sanity test_118c fails with 'No page in writeback, writeback=0'

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.9.0, Lustre 2.13.0, Lustre 2.10.7, Lustre 2.15.0
    • None
    • autotest reivedw-zfs-part-1
    • 3
    • 9223372036854775807

    Description

      sanity test 118c fails in review-zfs-part-1 with

      'No page in writeback, writeback=0' 
      

      In OST console logs when sanity test 118c PASS, there is only one call to cfs_fail_timeout_set(). When test 118c fails, on the OSTs, the console log has multiple calls to cfs_fail_timeout_set():

      10:48:43:Lustre: DEBUG MARKER: == sanity test 118c: Fsync blocks on EROFS until dirty pages are flushed ============ 10:48:38 (1462297718)
      10:48:43:Lustre: DEBUG MARKER: lctl set_param fail_val=0 fail_loc=0x216
      10:48:43:LustreError: 4699:0:(fail.c:133:__cfs_fail_timeout_set()) cfs_fail_timeout id 216 sleeping for 1000ms
      10:48:43:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_118c: @@@@@@ FAIL: No page in writeback, writeback=0 
      10:48:43:Lustre: DEBUG MARKER: sanity test_118c: @@@@@@ FAIL: No page in writeback, writeback=0
      10:48:43:Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /logdir/test_logs/2016-05-03/lustre-reviews-el6_7-x86_64--review-zfs-part-1--1_8_1__38579__-70130514263840-074815/sanity.test_118c.debug_log.$(hostname -s).1462297721.log;
      10:48:43:         dmesg > /logdir/test_logs/2016-05-03/lustre-reviews-el
      10:48:43:LustreError: 4699:0:(fail.c:137:__cfs_fail_timeout_set()) cfs_fail_timeout id 216 awake
      10:50:24:LustreError: 4699:0:(fail.c:133:__cfs_fail_timeout_set()) cfs_fail_timeout id 216 sleeping for 1000ms
      10:50:24:LustreError: 4699:0:(fail.c:137:__cfs_fail_timeout_set()) cfs_fail_timeout id 216 awake
      10:50:24:LustreError: 4699:0:(fail.c:133:__cfs_fail_timeout_set()) cfs_fail_timeout id 216 sleeping for 1000ms
      10:50:24:LustreError: 4699:0:(fail.c:137:__cfs_fail_timeout_set()) cfs_fail_timeout id 216 awake
      10:50:24:Lustre: *** cfs_fail_loc=216, val=0***
      10:50:24:Lustre: Skipped 2 previous similar messages
      10:50:24:Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null || true
      10:50:24:Lustre: DEBUG MARKER: for i in $(find /proc/spl -type f); do
      10:50:24:				echo ====${i}=======================;
      10:50:24:				cat ${i};
      10:50:24:				done
      10:50:24:Lustre: DEBUG MARKER: echo ====zpool events -v=======================;
      10:50:24:				zpool events -v
      10:50:24:Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity test 118d: Fsync validation inject a delay of the bulk ============ 10:48:54 \(1462297734\)
      

      Similarly, on the client, there are multiple calls to osc_brw_redo_request()in the console logs for the failed test

      10:49:59:Lustre: DEBUG MARKER: == sanity test 118c: Fsync blocks on EROFS until dirty pages are flushed ============ 10:48:38 (1462297718)
      10:49:59:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_118c: @@@@@@ FAIL: No page in writeback, writeback=0 
      10:49:59:Lustre: DEBUG MARKER: sanity test_118c: @@@@@@ FAIL: No page in writeback, writeback=0
      10:49:59:Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /logdir/test_logs/2016-05-03/lustre-reviews-el6_7-x86_64--review-zfs-part-1--1_8_1__38579__-70130514263840-074815/sanity.test_118c.debug_log.$(hostname -s).1462297721.log;
      10:49:59:         dmesg > /logdir/test_logs/2016-05-03/lustre-reviews-el
      10:49:59:LustreError: 11-0: lustre-OST0001-osc-ffff8800623bc800: operation ost_write to node 10.2.5.40@tcp failed: rc = -30
      10:49:59:LustreError: Skipped 1 previous similar message
      10:49:59:LustreError: 4786:0:(osc_request.c:1527:osc_brw_redo_request()) @@@ redo for recoverable error -30  req@ffff880077ddac80 x1533326821172416/t0(0) o4->lustre-OST0001-osc-ffff8800623bc800@10.2.5.40@tcp:6/4 lens 608/192 e 0 to 0 dl 1462297765 ref 2 fl Interpret:RM/0/0 rc -30/-30
      10:49:59:LustreError: 4786:0:(osc_request.c:1527:osc_brw_redo_request()) Skipped 1 previous similar message
      10:49:59:LustreError: 4786:0:(osc_request.c:1527:osc_brw_redo_request()) @@@ redo for recoverable error -30  req@ffff88007d2f1cc0 x1533326821172448/t0(0) o4->lustre-OST0001-osc-ffff8800623bc800@10.2.5.40@tcp:6/4 lens 608/192 e 0 to 0 dl 1462297767 ref 2 fl Interpret:RM/0/0 rc -30/-30
      10:49:59:LustreError: 4786:0:(osc_request.c:1527:osc_brw_redo_request()) @@@ redo for recoverable error -30  req@ffff880077ddac80 x1533326821172464/t0(0) o4->lustre-OST0001-osc-ffff8800623bc800@10.2.5.40@tcp:6/4 lens 608/192 e 0 to 0 dl 1462297770 ref 2 fl Interpret:RM/0/0 rc -30/-30
      10:49:59:Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null || true
      10:49:59:Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity test 118d: Fsync validation inject a delay of the bulk ============ 10:48:54 \(1462297734\)
      

      Logs for this failure are at:
      https://testing.hpdd.intel.com/test_sets/f12f9b3e-1184-11e6-b5f1-5254006e85c2
      https://testing.hpdd.intel.com/test_sets/721631e0-f399-11e5-b32f-5254006e85c2
      https://testing.hpdd.intel.com/test_sets/f0c2aa46-c7f2-11e5-80b3-5254006e85c2

      Attachments

        Issue Links

          Activity

            [LU-8101] sanity test_118c fails with 'No page in writeback, writeback=0'
            adilger Andreas Dilger added a comment - - edited

            Track this under LU-16515 since this 7-year-old ticket likely was fixed and the new failures have a different cause.

            adilger Andreas Dilger added a comment - - edited Track this under LU-16515 since this 7-year-old ticket likely was fixed and the new failures have a different cause.
            arshad512 Arshad Hussain added a comment - +1 on Master ( https://testing.whamcloud.com/test_sets/74277446-d066-41aa-9bc8-b9f7a20f8172 )
            ssmirnov Serguei Smirnov added a comment - +1 on master: https://testing.whamcloud.com/test_sets/b890d117-9879-43bb-aa3e-2ba79c5f2066
            hornc Chris Horn added a comment - +1 on master https://testing.whamcloud.com/test_sets/4ecfbfd1-4b0c-4659-91d1-2c5526930839
            qian_wc Qian Yingjin added a comment - +1 on master: https://testing.whamcloud.com/test_sets/004d8e6c-7e7f-4692-a4ee-be276f3879d0
            yujian Jian Yu added a comment - +1 on master: https://testing.whamcloud.com/test_sets/169b0755-6281-4add-bbb5-38717d9807de
            jamesanunez James Nunez (Inactive) added a comment - A couple more recent failures for DNE/ZFS: https://testing.whamcloud.com/test_sets/0651d843-48ed-4de9-8795-61ff90b775d5 https://testing.whamcloud.com/test_sets/b7e33a0e-80c1-4a3d-9f15-80efd98c7be3
            hornc Chris Horn added a comment - - edited +1 on master https://testing.whamcloud.com/test_sets/9e1b9298-f4db-11e9-a0ba-52540065bddc
            jamesanunez James Nunez (Inactive) added a comment - - edited

            We see a very similar issue for sanity test 118d for 2.10.7 RC1 testing, but no errors in the client logs. Logs are at https://testing.whamcloud.com/test_sets/389bf13a-4363-11e9-9646-52540065bddc .

            Note: this is an ldiskfs test session, not ZFS.

            jamesanunez James Nunez (Inactive) added a comment - - edited We see a very similar issue for sanity test 118d for 2.10.7 RC1 testing, but no errors in the client logs. Logs are at https://testing.whamcloud.com/test_sets/389bf13a-4363-11e9-9646-52540065bddc . Note: this is an ldiskfs test session, not ZFS.
            mdiep Minh Diep added a comment - seen on b2_10 https://testing.hpdd.intel.com/test_sessions/f1b86689-3404-4a90-af08-5efd68c6bceb

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: