Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6635

sanity-lfsck test_18e:FAIL: (8) .lustre/lost+found/MDT0000/ should not be empty

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • Lustre 2.8.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for wangdi <di.wang@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/35e75e88-012d-11e5-9d1f-5254006e85c2.

      The sub-test test_18e failed with the following error:

      (8) .lustre/lost+found/MDT0000/ should not be empty
      
      CMD: shadow-20vm12 /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.lfsck_layout
      There should be stub file under .lustre/lost+found/MDT0000/
       sanity-lfsck test_18e: @@@@@@ FAIL: (8) .lustre/lost+found/MDT0000/ should not be empty 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:4727:error_noexit()
        = /usr/lib64/lustre/tests/test-framework.sh:4758:error()
        = /usr/lib64/lustre/tests/sanity-lfsck.sh:2261:test_18e()
        = /usr/lib64/lustre/tests/test-framework.sh:5020:run_one()
        = /usr/lib64/lustre/tests/test-framework.sh:5057:run_one_logged()
        = /usr/lib64/lustre/tests/test-framework.sh:4907:run_test()
        = /usr/lib64/lustre/tests/sanity-lfsck.sh:2277:main()
      Dumping lctl log to /logdir/test_logs/2015-05-22/lustre-reviews-el6_6-x86_64--review-dne-part-2--2_9_1__32432__-70061688987660-225248/sanity-lfsck.test_18e.*.1432352325.log
      CMD: shadow-20vm10.shadow.whamcloud.com,shadow-20vm11,shadow-20vm12,shadow-20vm8,shadow-20vm9 /usr/sbin/lctl dk > /logdir/test_logs/2015-05-22/lustre-reviews-el6_6-x86_64--review-dne-part-2--2_9_1__32432__-70061688987660-225248/sanity-lfsck.test_18e.debug_log.\$(hostname -s).1432352325.log;
               dmesg > /logdir/test_logs/2015-05-22/lustre-reviews-el6_6-x86_64--review-dne-part-2--2_9_1__32432__-70061688987660-225248/sanity-lfsck.test_18e.dmesg.\$(hostname -s).1432352325.log
      

      Attachments

        Activity

          [LU-6635] sanity-lfsck test_18e:FAIL: (8) .lustre/lost+found/MDT0000/ should not be empty
          pjones Peter Jones added a comment -

          Landed for 2.9

          pjones Peter Jones added a comment - Landed for 2.9

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18146/
          Subject: LU-6635 lfsck: block replacing the OST-object for test
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 7a814e94e065551ab79e2ba75df9626e4940efc5

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18146/ Subject: LU-6635 lfsck: block replacing the OST-object for test Project: fs/lustre-release Branch: master Current Patch Set: Commit: 7a814e94e065551ab79e2ba75df9626e4940efc5

          The reason is that the client side write happened after the OST replaced the new created OST-object with the old orphan. So the solution is that we need to hold the replacing until the write happened. The patch http://review.whamcloud.com/18146 is for that.

          yong.fan nasf (Inactive) added a comment - The reason is that the client side write happened after the OST replaced the new created OST-object with the old orphan. So the solution is that we need to hold the replacing until the write happened. The patch http://review.whamcloud.com/18146 is for that.

          Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/18146
          Subject: LU-6635 lfsck: block repalcing the OST-object for test
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 183abfd4cb2186c1170cd1dfaac31d02df9ddeda

          gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/18146 Subject: LU-6635 lfsck: block repalcing the OST-object for test Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 183abfd4cb2186c1170cd1dfaac31d02df9ddeda
          jamesanunez James Nunez (Inactive) added a comment - More failures on master: 2015-12-12 07:31:16 - https://testing.hpdd.intel.com/test_sets/b1b6505e-a0cf-11e5-9d88-5254006e85c2 2015-12-15 04:11:00 - https://testing.hpdd.intel.com/test_sets/89178d04-a2f3-11e5-9b3d-5254006e85c2 2015-12-16 10:58:27 - https://testing.hpdd.intel.com/test_sets/bf7d399a-a413-11e5-b715-5254006e85c2 2015-12-16 22:16:33 - https://testing.hpdd.intel.com/test_sets/0b63ca3a-a451-11e5-8701-5254006e85c2

          Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/17025
          Subject: LU-6635 tests: more log message for wait_update
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 68b20adc367826650b1c48a464b4fb500deee788

          gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/17025 Subject: LU-6635 tests: more log message for wait_update Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 68b20adc367826650b1c48a464b4fb500deee788
          jamesanunez James Nunez (Inactive) added a comment - Another failure on master at https://testing.hpdd.intel.com/test_sets/b122edfe-7b2d-11e5-9650-5254006e85c2

          According to the log https://testing.hpdd.intel.com/test_sets/7368a004-4b0c-11e5-bc8b-5254006e85c2, the client side time is consumed inside the following scripts of wait_update_facet:

                  $START_LAYOUT -r -o -c || error "(2) Fail to start LFSCK for layout!"
          
                  wait_update_facet mds1 "$LCTL get_param -n \
                          mdd.$(facet_svc mds1).lfsck_layout |
                          awk '/^status/ { print \\\$2 }'" "scanning-phase2" $LTIME ||
                          error "(3) MDS1 is not the expected 'scanning-phase2'"
          
                  # to guarantee all updates are synced.
                  sync
                  sleep 2
                  
                  echo "Write new data to f2 to modify the new created OST-object."
                  echo "dummy" >> $DIR/$tdir/a1/f2
          
          00000001:00000001:1.0:1440478103.384505:0:4206:0:(debug.c:334:libcfs_debug_mark_buffer()) ***************************************************
          00000001:02000400:1.0:1440478103.384506:0:4206:0:(debug.c:335:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl get_param -n             mdd.lustre-MDT0000.lfsck_layout |
                          awk '/^status/ { print $2 }'
          00000001:00000001:1.0:1440478103.385590:0:4206:0:(debug.c:336:libcfs_debug_mark_buffer()) ***************************************************
          ...
          00000001:00000001:1.0:1440478124.343827:0:4318:0:(debug.c:334:libcfs_debug_mark_buffer()) ***************************************************
          00000001:02000400:1.0:1440478124.343828:0:4318:0:(debug.c:335:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl get_param -n                     mdd.lustre-MDT0000.lfsck_layout |
                                  awk '/^status/ { print $2 }'
          00000001:00000001:1.0:1440478124.344946:0:4318:0:(debug.c:336:libcfs_debug_mark_buffer()) ***************************************************
          

          The expected status detect time is about 1 second, but the real case is about 21 seconds. Such too long interval caused the subsequent write option to be postponed after the LFSCK replacing the new created OST-object.

          It seems that the client was NOT in heavy load. So please check your test scripts to guarantee that the wait_update_facet() works well.

          yong.fan nasf (Inactive) added a comment - According to the log https://testing.hpdd.intel.com/test_sets/7368a004-4b0c-11e5-bc8b-5254006e85c2 , the client side time is consumed inside the following scripts of wait_update_facet: $START_LAYOUT -r -o -c || error "(2) Fail to start LFSCK for layout!" wait_update_facet mds1 "$LCTL get_param -n \ mdd.$(facet_svc mds1).lfsck_layout | awk '/^status/ { print \\\$2 }'" "scanning-phase2" $LTIME || error "(3) MDS1 is not the expected 'scanning-phase2'" # to guarantee all updates are synced. sync sleep 2 echo "Write new data to f2 to modify the new created OST-object." echo "dummy" >> $DIR/$tdir/a1/f2 00000001:00000001:1.0:1440478103.384505:0:4206:0:(debug.c:334:libcfs_debug_mark_buffer()) *************************************************** 00000001:02000400:1.0:1440478103.384506:0:4206:0:(debug.c:335:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.lfsck_layout | awk '/^status/ { print $2 }' 00000001:00000001:1.0:1440478103.385590:0:4206:0:(debug.c:336:libcfs_debug_mark_buffer()) *************************************************** ... 00000001:00000001:1.0:1440478124.343827:0:4318:0:(debug.c:334:libcfs_debug_mark_buffer()) *************************************************** 00000001:02000400:1.0:1440478124.343828:0:4318:0:(debug.c:335:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.lfsck_layout | awk '/^status/ { print $2 }' 00000001:00000001:1.0:1440478124.344946:0:4318:0:(debug.c:336:libcfs_debug_mark_buffer()) *************************************************** The expected status detect time is about 1 second, but the real case is about 21 seconds. Such too long interval caused the subsequent write option to be postponed after the LFSCK replacing the new created OST-object. It seems that the client was NOT in heavy load. So please check your test scripts to guarantee that the wait_update_facet() works well.
          di.wang Di Wang added a comment -

          Fan Yong: No, I do not have the logs, only got these failures on Maloo test anyway, here is another failure from yesterday. Please check, thanks.
          https://testing.hpdd.intel.com/test_sets/7368a004-4b0c-11e5-bc8b-5254006e85c2

          di.wang Di Wang added a comment - Fan Yong: No, I do not have the logs, only got these failures on Maloo test anyway, here is another failure from yesterday. Please check, thanks. https://testing.hpdd.intel.com/test_sets/7368a004-4b0c-11e5-bc8b-5254006e85c2
          yong.fan nasf (Inactive) added a comment - - edited

          Di, do you have the logs with "sync" time measured?
          Or can we close the ticket if it is not valid any longer?

          yong.fan nasf (Inactive) added a comment - - edited Di, do you have the logs with "sync" time measured? Or can we close the ticket if it is not valid any longer?

          People

            yong.fan nasf (Inactive)
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: