Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6635

sanity-lfsck test_18e:FAIL: (8) .lustre/lost+found/MDT0000/ should not be empty

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • Lustre 2.8.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for wangdi <di.wang@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/35e75e88-012d-11e5-9d1f-5254006e85c2.

      The sub-test test_18e failed with the following error:

      (8) .lustre/lost+found/MDT0000/ should not be empty
      
      CMD: shadow-20vm12 /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.lfsck_layout
      There should be stub file under .lustre/lost+found/MDT0000/
       sanity-lfsck test_18e: @@@@@@ FAIL: (8) .lustre/lost+found/MDT0000/ should not be empty 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:4727:error_noexit()
        = /usr/lib64/lustre/tests/test-framework.sh:4758:error()
        = /usr/lib64/lustre/tests/sanity-lfsck.sh:2261:test_18e()
        = /usr/lib64/lustre/tests/test-framework.sh:5020:run_one()
        = /usr/lib64/lustre/tests/test-framework.sh:5057:run_one_logged()
        = /usr/lib64/lustre/tests/test-framework.sh:4907:run_test()
        = /usr/lib64/lustre/tests/sanity-lfsck.sh:2277:main()
      Dumping lctl log to /logdir/test_logs/2015-05-22/lustre-reviews-el6_6-x86_64--review-dne-part-2--2_9_1__32432__-70061688987660-225248/sanity-lfsck.test_18e.*.1432352325.log
      CMD: shadow-20vm10.shadow.whamcloud.com,shadow-20vm11,shadow-20vm12,shadow-20vm8,shadow-20vm9 /usr/sbin/lctl dk > /logdir/test_logs/2015-05-22/lustre-reviews-el6_6-x86_64--review-dne-part-2--2_9_1__32432__-70061688987660-225248/sanity-lfsck.test_18e.debug_log.\$(hostname -s).1432352325.log;
               dmesg > /logdir/test_logs/2015-05-22/lustre-reviews-el6_6-x86_64--review-dne-part-2--2_9_1__32432__-70061688987660-225248/sanity-lfsck.test_18e.dmesg.\$(hostname -s).1432352325.log
      

      Attachments

        Activity

          [LU-6635] sanity-lfsck test_18e:FAIL: (8) .lustre/lost+found/MDT0000/ should not be empty
          pjones Peter Jones added a comment -

          Landed for 2.9

          pjones Peter Jones added a comment - Landed for 2.9

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18146/
          Subject: LU-6635 lfsck: block replacing the OST-object for test
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 7a814e94e065551ab79e2ba75df9626e4940efc5

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18146/ Subject: LU-6635 lfsck: block replacing the OST-object for test Project: fs/lustre-release Branch: master Current Patch Set: Commit: 7a814e94e065551ab79e2ba75df9626e4940efc5

          The reason is that the client side write happened after the OST replaced the new created OST-object with the old orphan. So the solution is that we need to hold the replacing until the write happened. The patch http://review.whamcloud.com/18146 is for that.

          yong.fan nasf (Inactive) added a comment - The reason is that the client side write happened after the OST replaced the new created OST-object with the old orphan. So the solution is that we need to hold the replacing until the write happened. The patch http://review.whamcloud.com/18146 is for that.

          Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/18146
          Subject: LU-6635 lfsck: block repalcing the OST-object for test
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 183abfd4cb2186c1170cd1dfaac31d02df9ddeda

          gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/18146 Subject: LU-6635 lfsck: block repalcing the OST-object for test Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 183abfd4cb2186c1170cd1dfaac31d02df9ddeda
          jamesanunez James Nunez (Inactive) added a comment - More failures on master: 2015-12-12 07:31:16 - https://testing.hpdd.intel.com/test_sets/b1b6505e-a0cf-11e5-9d88-5254006e85c2 2015-12-15 04:11:00 - https://testing.hpdd.intel.com/test_sets/89178d04-a2f3-11e5-9b3d-5254006e85c2 2015-12-16 10:58:27 - https://testing.hpdd.intel.com/test_sets/bf7d399a-a413-11e5-b715-5254006e85c2 2015-12-16 22:16:33 - https://testing.hpdd.intel.com/test_sets/0b63ca3a-a451-11e5-8701-5254006e85c2

          Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/17025
          Subject: LU-6635 tests: more log message for wait_update
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 68b20adc367826650b1c48a464b4fb500deee788

          gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/17025 Subject: LU-6635 tests: more log message for wait_update Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 68b20adc367826650b1c48a464b4fb500deee788
          jamesanunez James Nunez (Inactive) added a comment - Another failure on master at https://testing.hpdd.intel.com/test_sets/b122edfe-7b2d-11e5-9650-5254006e85c2

          According to the log https://testing.hpdd.intel.com/test_sets/7368a004-4b0c-11e5-bc8b-5254006e85c2, the client side time is consumed inside the following scripts of wait_update_facet:

                  $START_LAYOUT -r -o -c || error "(2) Fail to start LFSCK for layout!"
          
                  wait_update_facet mds1 "$LCTL get_param -n \
                          mdd.$(facet_svc mds1).lfsck_layout |
                          awk '/^status/ { print \\\$2 }'" "scanning-phase2" $LTIME ||
                          error "(3) MDS1 is not the expected 'scanning-phase2'"
          
                  # to guarantee all updates are synced.
                  sync
                  sleep 2
                  
                  echo "Write new data to f2 to modify the new created OST-object."
                  echo "dummy" >> $DIR/$tdir/a1/f2
          
          00000001:00000001:1.0:1440478103.384505:0:4206:0:(debug.c:334:libcfs_debug_mark_buffer()) ***************************************************
          00000001:02000400:1.0:1440478103.384506:0:4206:0:(debug.c:335:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl get_param -n             mdd.lustre-MDT0000.lfsck_layout |
                          awk '/^status/ { print $2 }'
          00000001:00000001:1.0:1440478103.385590:0:4206:0:(debug.c:336:libcfs_debug_mark_buffer()) ***************************************************
          ...
          00000001:00000001:1.0:1440478124.343827:0:4318:0:(debug.c:334:libcfs_debug_mark_buffer()) ***************************************************
          00000001:02000400:1.0:1440478124.343828:0:4318:0:(debug.c:335:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl get_param -n                     mdd.lustre-MDT0000.lfsck_layout |
                                  awk '/^status/ { print $2 }'
          00000001:00000001:1.0:1440478124.344946:0:4318:0:(debug.c:336:libcfs_debug_mark_buffer()) ***************************************************
          

          The expected status detect time is about 1 second, but the real case is about 21 seconds. Such too long interval caused the subsequent write option to be postponed after the LFSCK replacing the new created OST-object.

          It seems that the client was NOT in heavy load. So please check your test scripts to guarantee that the wait_update_facet() works well.

          yong.fan nasf (Inactive) added a comment - According to the log https://testing.hpdd.intel.com/test_sets/7368a004-4b0c-11e5-bc8b-5254006e85c2 , the client side time is consumed inside the following scripts of wait_update_facet: $START_LAYOUT -r -o -c || error "(2) Fail to start LFSCK for layout!" wait_update_facet mds1 "$LCTL get_param -n \ mdd.$(facet_svc mds1).lfsck_layout | awk '/^status/ { print \\\$2 }'" "scanning-phase2" $LTIME || error "(3) MDS1 is not the expected 'scanning-phase2'" # to guarantee all updates are synced. sync sleep 2 echo "Write new data to f2 to modify the new created OST-object." echo "dummy" >> $DIR/$tdir/a1/f2 00000001:00000001:1.0:1440478103.384505:0:4206:0:(debug.c:334:libcfs_debug_mark_buffer()) *************************************************** 00000001:02000400:1.0:1440478103.384506:0:4206:0:(debug.c:335:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.lfsck_layout | awk '/^status/ { print $2 }' 00000001:00000001:1.0:1440478103.385590:0:4206:0:(debug.c:336:libcfs_debug_mark_buffer()) *************************************************** ... 00000001:00000001:1.0:1440478124.343827:0:4318:0:(debug.c:334:libcfs_debug_mark_buffer()) *************************************************** 00000001:02000400:1.0:1440478124.343828:0:4318:0:(debug.c:335:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.lfsck_layout | awk '/^status/ { print $2 }' 00000001:00000001:1.0:1440478124.344946:0:4318:0:(debug.c:336:libcfs_debug_mark_buffer()) *************************************************** The expected status detect time is about 1 second, but the real case is about 21 seconds. Such too long interval caused the subsequent write option to be postponed after the LFSCK replacing the new created OST-object. It seems that the client was NOT in heavy load. So please check your test scripts to guarantee that the wait_update_facet() works well.
          di.wang Di Wang added a comment -

          Fan Yong: No, I do not have the logs, only got these failures on Maloo test anyway, here is another failure from yesterday. Please check, thanks.
          https://testing.hpdd.intel.com/test_sets/7368a004-4b0c-11e5-bc8b-5254006e85c2

          di.wang Di Wang added a comment - Fan Yong: No, I do not have the logs, only got these failures on Maloo test anyway, here is another failure from yesterday. Please check, thanks. https://testing.hpdd.intel.com/test_sets/7368a004-4b0c-11e5-bc8b-5254006e85c2

          People

            yong.fan nasf (Inactive)
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: