Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6635

sanity-lfsck test_18e:FAIL: (8) .lustre/lost+found/MDT0000/ should not be empty

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • Lustre 2.8.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for wangdi <di.wang@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/35e75e88-012d-11e5-9d1f-5254006e85c2.

      The sub-test test_18e failed with the following error:

      (8) .lustre/lost+found/MDT0000/ should not be empty
      
      CMD: shadow-20vm12 /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.lfsck_layout
      There should be stub file under .lustre/lost+found/MDT0000/
       sanity-lfsck test_18e: @@@@@@ FAIL: (8) .lustre/lost+found/MDT0000/ should not be empty 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:4727:error_noexit()
        = /usr/lib64/lustre/tests/test-framework.sh:4758:error()
        = /usr/lib64/lustre/tests/sanity-lfsck.sh:2261:test_18e()
        = /usr/lib64/lustre/tests/test-framework.sh:5020:run_one()
        = /usr/lib64/lustre/tests/test-framework.sh:5057:run_one_logged()
        = /usr/lib64/lustre/tests/test-framework.sh:4907:run_test()
        = /usr/lib64/lustre/tests/sanity-lfsck.sh:2277:main()
      Dumping lctl log to /logdir/test_logs/2015-05-22/lustre-reviews-el6_6-x86_64--review-dne-part-2--2_9_1__32432__-70061688987660-225248/sanity-lfsck.test_18e.*.1432352325.log
      CMD: shadow-20vm10.shadow.whamcloud.com,shadow-20vm11,shadow-20vm12,shadow-20vm8,shadow-20vm9 /usr/sbin/lctl dk > /logdir/test_logs/2015-05-22/lustre-reviews-el6_6-x86_64--review-dne-part-2--2_9_1__32432__-70061688987660-225248/sanity-lfsck.test_18e.debug_log.\$(hostname -s).1432352325.log;
               dmesg > /logdir/test_logs/2015-05-22/lustre-reviews-el6_6-x86_64--review-dne-part-2--2_9_1__32432__-70061688987660-225248/sanity-lfsck.test_18e.dmesg.\$(hostname -s).1432352325.log
      

      Attachments

        Activity

          [LU-6635] sanity-lfsck test_18e:FAIL: (8) .lustre/lost+found/MDT0000/ should not be empty
                  wait_update_facet mds1 "$LCTL get_param -n \
                          mdd.$(facet_svc mds1).lfsck_layout |
                          awk '/^status/ { print \\\$2 }'" "scanning-phase2" $LTIME ||
                          error "(3) MDS1 is not the expected 'scanning-phase2'"
          
                  # to guarantee all updates are synced.
                  sync
                  sleep 2
          
                  echo "Write new data to f2 to modify the new created OST-object."
                  echo "dummy" >> $DIR/$tdir/a1/f2
          

          The reason is that there were some very slow processing, and caused that before performing 'echo "dummy" >> $DIR/$tdir/a1/f2', the layout LFSCK has already waken up from the OBD_FAIL_LFSCK_DELAY3 and handled the target object "f2". The suspect point is the "sync". You can add some timer before and after the "sync" to measure how long it will take in your test.

          yong.fan nasf (Inactive) added a comment - wait_update_facet mds1 "$LCTL get_param -n \ mdd.$(facet_svc mds1).lfsck_layout | awk '/^status/ { print \\\$2 }'" "scanning-phase2" $LTIME || error "(3) MDS1 is not the expected 'scanning-phase2'" # to guarantee all updates are synced. sync sleep 2 echo "Write new data to f2 to modify the new created OST-object." echo "dummy" >> $DIR/$tdir/a1/f2 The reason is that there were some very slow processing, and caused that before performing 'echo "dummy" >> $DIR/$tdir/a1/f2', the layout LFSCK has already waken up from the OBD_FAIL_LFSCK_DELAY3 and handled the target object "f2". The suspect point is the "sync". You can add some timer before and after the "sync" to measure how long it will take in your test.
           would you please to check whether there is some logic in your patches that may affect the data to be written back to the OST? Thanks!
          

          I do not remember I did sth like this.

          di.wang Di Wang (Inactive) added a comment - would you please to check whether there is some logic in your patches that may affect the data to be written back to the OST? Thanks! I do not remember I did sth like this.
          di.wang Di Wang (Inactive) added a comment - Nasf: yes, this still happens https://testing.hpdd.intel.com/test_sessions/3f3d32ea-12a4-11e5-b9a9-5254006e85c2

          Di, is it still trouble for you? or you did not hit it again because of being fixed by the side-effect of some patch?

          yong.fan nasf (Inactive) added a comment - Di, is it still trouble for you? or you did not hit it again because of being fixed by the side-effect of some patch?

          The main purpose for sanity-lfsck test_18e is that:

          During the first-stage scanning in the layout LFSCK, if some MDT-object dangling references some OST-object, then at that time, the layout LFSCK cannot know whether such dangling reference is right or not, then it will re-create the specified (lost) OST-object. If such decision is wrong, means during the second-stage scanning, the layout LFSCK finds that some orphan OST-object back-references the former MDT-object that with dangling reference, then the layout LFSCK will try to replace the re-created OST-obejct with the orphan OST-object. But before such replacing, it needs to check whether the re-created OST-obejct has ever been modified after the re-creating. If no, replace it; otherwise, there is conflict, the layout LFSCK will link the orphan OST-object to a new MDT-object under .lustre/lost+found/MDTxxxx/. test_18e is for the later case.

          About LFSCK logic maybe not perfect. But the failure here is that: when the layout LFSCK on the OST to check whether the re-created OST-object has ever been modified after the re-creating, it found that it is empty! That it is strange. According to the test scripts, we modified such re-created OST-object as following:

                  echo "Trigger layout LFSCK on all devices to find out orphan OST-object"
                  $START_LAYOUT -r -o -c || error "(2) Fail to start LFSCK for layout!"
          
                  wait_update_facet mds1 "$LCTL get_param -n \
                          mdd.$(facet_svc mds1).lfsck_layout |
                          awk '/^status/ { print \\\$2 }'" "scanning-phase2" $LTIME ||
                          error "(3) MDS1 is not the expected 'scanning-phase2'"
          
                  # to guarantee all updates are synced.
                  sync
                  sleep 2
          
                  echo "Write new data to f2 to modify the new created OST-object."
                  echo "dummy" >> $DIR/$tdir/a1/f2
          
                  do_facet $SINGLEMDS $LCTL set_param fail_val=0 fail_loc=0
          

          And the logs shows that the fail_loc has been triggered. On the other hand, when the layout LFSCK checks whether the re-created OST-object has ever been modified after the re-creating, it will take the ldlm lock firstly, the log shows that it has taken the log, but the OST-object was empty.

          00010000:00010000:1.0:1432342702.953829:0:15109:0:(ldlm_request.c:487:ldlm_cli_enqueue_local()) ### client-side local enqueue handler, new lock created ns: filter-lustre-OST0000_UUID lock: ffff880070e43080/0xbd508038ad6121c7 lrc: 3/0,1 mode: EX/EX res: [0x7ac:0x0:0x0].0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x40010000000000 nid: local remote: 0x0 expref: -99 pid: 15109 timeout: 0 lvb_type: 0
          00100000:10000000:1.0:1432342702.953874:0:15109:0:(lfsck_layout.c:2037:lfsck_layout_slave_conditional_destroy()) lustre-OST0000-osd: layout LFSCK destroyed the empty OST-object [0x100000000:0x7ac:0x0] that was created for reparing dangling referenced case. But the original missing OST-object is found now.
          

          Since the re-created OST-object has not been modified, the layout LFSCK will replaced it with the orphan OST-object, that is NOT the test_18e expectation.

          Di, would you please to check whether there is some logic in your patches that may affect the data to be written back to the OST? Thanks!

          yong.fan nasf (Inactive) added a comment - The main purpose for sanity-lfsck test_18e is that: During the first-stage scanning in the layout LFSCK, if some MDT-object dangling references some OST-object, then at that time, the layout LFSCK cannot know whether such dangling reference is right or not, then it will re-create the specified (lost) OST-object. If such decision is wrong, means during the second-stage scanning, the layout LFSCK finds that some orphan OST-object back-references the former MDT-object that with dangling reference, then the layout LFSCK will try to replace the re-created OST-obejct with the orphan OST-object. But before such replacing, it needs to check whether the re-created OST-obejct has ever been modified after the re-creating. If no, replace it; otherwise, there is conflict, the layout LFSCK will link the orphan OST-object to a new MDT-object under .lustre/lost+found/MDTxxxx/. test_18e is for the later case. About LFSCK logic maybe not perfect. But the failure here is that: when the layout LFSCK on the OST to check whether the re-created OST-object has ever been modified after the re-creating, it found that it is empty! That it is strange. According to the test scripts, we modified such re-created OST-object as following: echo "Trigger layout LFSCK on all devices to find out orphan OST-object" $START_LAYOUT -r -o -c || error "(2) Fail to start LFSCK for layout!" wait_update_facet mds1 "$LCTL get_param -n \ mdd.$(facet_svc mds1).lfsck_layout | awk '/^status/ { print \\\$2 }'" "scanning-phase2" $LTIME || error "(3) MDS1 is not the expected 'scanning-phase2'" # to guarantee all updates are synced. sync sleep 2 echo "Write new data to f2 to modify the new created OST-object." echo "dummy" >> $DIR/$tdir/a1/f2 do_facet $SINGLEMDS $LCTL set_param fail_val=0 fail_loc=0 And the logs shows that the fail_loc has been triggered. On the other hand, when the layout LFSCK checks whether the re-created OST-object has ever been modified after the re-creating, it will take the ldlm lock firstly, the log shows that it has taken the log, but the OST-object was empty. 00010000:00010000:1.0:1432342702.953829:0:15109:0:(ldlm_request.c:487:ldlm_cli_enqueue_local()) ### client-side local enqueue handler, new lock created ns: filter-lustre-OST0000_UUID lock: ffff880070e43080/0xbd508038ad6121c7 lrc: 3/0,1 mode: EX/EX res: [0x7ac:0x0:0x0].0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x40010000000000 nid: local remote: 0x0 expref: -99 pid: 15109 timeout: 0 lvb_type: 0 00100000:10000000:1.0:1432342702.953874:0:15109:0:(lfsck_layout.c:2037:lfsck_layout_slave_conditional_destroy()) lustre-OST0000-osd: layout LFSCK destroyed the empty OST-object [0x100000000:0x7ac:0x0] that was created for reparing dangling referenced case. But the original missing OST-object is found now. Since the re-created OST-object has not been modified, the layout LFSCK will replaced it with the orphan OST-object, that is NOT the test_18e expectation. Di, would you please to check whether there is some logic in your patches that may affect the data to be written back to the OST? Thanks!
          pjones Peter Jones added a comment -

          Fan Yong

          Any ideas on this one?

          Peter

          pjones Peter Jones added a comment - Fan Yong Any ideas on this one? Peter
          di.wang Di Wang (Inactive) added a comment - Another failure https://testing.hpdd.intel.com/test_sets/71b23c06-0116-11e5-8699-5254006e85c2

          People

            yong.fan nasf (Inactive)
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: