Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7550

sanity test_27C: FAIL: Can not find 5 in obdidx 0 1 2 3 4 6

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.8.0
    • Lustre 2.8.0
    • None
    • 3
    • 9223372036854775807

    Description

      sanity test 27C failed as follows:

      == sanity test 27C: check full striping across all OSTs == 06:13:56 (1450016036)
      obdidx 0 1 2 3 4 6
      /usr/lib64/lustre/tests/sanity.sh: line 2017: [: obdidx: integer expression expected
      /usr/lib64/lustre/tests/sanity.sh: line 2017: [: obdidx: integer expression expected
      /usr/lib64/lustre/tests/sanity.sh: line 2017: [: obdidx: integer expression expected
      /usr/lib64/lustre/tests/sanity.sh: line 2017: [: obdidx: integer expression expected
      /usr/lib64/lustre/tests/sanity.sh: line 2017: [: obdidx: integer expression expected
      /usr/lib64/lustre/tests/sanity.sh: line 2017: [: obdidx: integer expression expected
       sanity test_27C: @@@@@@ FAIL: Can not find 5 in obdidx 0 1 2 3 4 6 
      

      Maloo report:
      https://testing.hpdd.intel.com/test_sets/86dc5690-a1b9-11e5-b83c-5254006e85c2

      Attachments

        Issue Links

          Activity

            [LU-7550] sanity test_27C: FAIL: Can not find 5 in obdidx 0 1 2 3 4 6
            yujian Jian Yu added a comment -

            Patch landed to master branch for Lustre 2.8.0.

            yujian Jian Yu added a comment - Patch landed to master branch for Lustre 2.8.0.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17691/
            Subject: LU-7550 tests: wait OSTs up in check_seq_oid()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: ef46d32bc5aa713ab55179bd836d750def0022d7

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17691/ Subject: LU-7550 tests: wait OSTs up in check_seq_oid() Project: fs/lustre-release Branch: master Current Patch Set: Commit: ef46d32bc5aa713ab55179bd836d750def0022d7

            Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/17691
            Subject: LU-7550 tests: wait OSTs up in check_seq_oid()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: eeb8ff65ff75ad902b1702b7b75d63a1ec09c6da

            gerrit Gerrit Updater added a comment - Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/17691 Subject: LU-7550 tests: wait OSTs up in check_seq_oid() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: eeb8ff65ff75ad902b1702b7b75d63a1ec09c6da

            Reopening because we are seeing some patches fail due to this error with LU-6910 reverted. Reducing priority since the frequency of this failure is one to two times per day.

            jamesanunez James Nunez (Inactive) added a comment - Reopening because we are seeing some patches fail due to this error with LU-6910 reverted. Reducing priority since the frequency of this failure is one to two times per day.

            >So, the failure is a regression introduced by patch http://review.whamcloud.com/15731 for LU-6910.
            Looks like it was wrong, a last fails was not based at LU-6910.

            aboyko Alexander Boyko added a comment - >So, the failure is a regression introduced by patch http://review.whamcloud.com/15731 for LU-6910 . Looks like it was wrong, a last fails was not based at LU-6910 .
            jamesanunez James Nunez (Inactive) added a comment - - edited More failures on master: 2015-12-17 18:15:27 - https://testing.hpdd.intel.com/test_sets/2ffb7bf0-a504-11e5-b68c-5254006e85c2 2015-12-17 21:23:15 - https://testing.hpdd.intel.com/test_sets/21bd6288-a54a-11e5-9f01-5254006e85c2 2015-12-17 21:47:01 - https://testing.hpdd.intel.com/test_sets/e9ac1898-a53b-11e5-b50c-5254006e85c2 2015-12-19 06:39:50 - https://testing.hpdd.intel.com/test_sets/394798f4-a641-11e5-924d-5254006e85c2 2015-12-20 07:49:15 - https://testing.hpdd.intel.com/test_sets/9a4e3e2e-a730-11e5-b560-5254006e85c2
            yujian Jian Yu added a comment -

            The easiest fix might be just wait OSTs up after restart OST in test_27z.

            Yes, Di. After making the following change, sanity test 27C passed regularly against master build https://build.hpdd.intel.com/job/lustre-master/3274/ (which failed before):

            diff --git a/lustre/tests/sanity.sh b/lustre/tests/sanity.sh
            index 7a25e4b..19e8521 100755
            --- a/lustre/tests/sanity.sh
            +++ b/lustre/tests/sanity.sh
            @@ -1885,6 +1885,7 @@ check_seq_oid()
                                            $(facet_mntpt ost$ost)/$obj_file)
                                    unmount_fstype ost$ost
                                    start ost$ost $dev $OST_MOUNT_OPTS
            +                       clients_up
                            fi
            

            https://testing.hpdd.intel.com/test_sessions/918784fc-a36c-11e5-a3ed-5254006e85c2

            yujian Jian Yu added a comment - The easiest fix might be just wait OSTs up after restart OST in test_27z. Yes, Di. After making the following change, sanity test 27C passed regularly against master build https://build.hpdd.intel.com/job/lustre-master/3274/ (which failed before): diff --git a/lustre/tests/sanity.sh b/lustre/tests/sanity.sh index 7a25e4b..19e8521 100755 --- a/lustre/tests/sanity.sh +++ b/lustre/tests/sanity.sh @@ -1885,6 +1885,7 @@ check_seq_oid() $(facet_mntpt ost$ost)/$obj_file) unmount_fstype ost$ost start ost$ost $dev $OST_MOUNT_OPTS + clients_up fi https://testing.hpdd.intel.com/test_sessions/918784fc-a36c-11e5-a3ed-5254006e85c2
            pjones Peter Jones added a comment -

            LU-6910 has been reverted

            pjones Peter Jones added a comment - LU-6910 has been reverted
            di.wang Di Wang added a comment - - edited

            I checked the failure a little bit. and it seems because of OST0005 is not started yet, because check_seq_oid() in test_27z will restart OSTs.

            log in test_27C

            98ad88ca34f2+331:11157:x1520450549370228:12345-10.2.4.165@tcp:103 Request procesed in 63us (226us total) trans 0 rc 0/0
            00000100:00100000:0.0:1450016037.008128:0:28196:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.2.4.165@tcp, seq: 143
            00020000:00020000:1.0:1450016037.008151:0:21232:0:(lod_qos.c:205:lod_statfs_and_check()) lustre-MDT0000-mdtlov: statfs: rc = -19
            00000004:00080000:1.0:1450016037.013441:0:21232:0:(osp_internal.h:551:osp_update_last_fid()) Gap in objids: start=[0x100000000:0xf6:0x0], count =76
            00000004:00080000:1.0:1450016037.013446:0:21232:0:(osp_object.c:1486:osp_object_create()) Writing gap [0x100000000:0xf6:0x0]+76 in llog
            00000004:00080000:1.0:1450016037.013451:0:21232:0:(osp_object.c:1503:osp_object_create()) lustre-OST0000-osc-MDT0000: Wrote last used FID: [0x100000000:0x142:0x0], index 0: 0
            00000004:00080000:1.0:1450016037.013453:0:21232:0:(osp_internal.h:551:osp_update_last_fid()) Gap in objids: start=[0x100010000:0x124:0x0], count =30
            00000004:00080000:1.0:1450016037.013455:0:21232:0:(osp_object.c:1486:osp_object_create()) Writing gap [0x100010000:0x124:0x0]+30 in llog
            00000004:00080000:1.0:1450016037.013456:0:21232:0:(osp_object.c:1503:osp_object_create()) lustre-OST0001-osc-MDT0000: Wrote last used FID: [0x100010000:0x142:0x0], index 1: 0
            00000004:00080000:1.0:1450016037.013459:0:21232:0:(osp_object.c:1503:osp_object_create()) lustre-OST0002-osc-MDT0000: Wrote last used FID: [0x100020000:0x143:0x0], index 2: 0
            00000004:00080000:1.0:1450016037.013460:0:21232:0:(osp_internal.h:551:osp_update_last_fid()) Gap in objids: start=[0x100030000:0x124:0x0], count =30
            00000004:00080000:1.0:1450016037.013460:0:21232:0:(osp_object.c:1486:osp_object_create()) Writing gap [0x100030000:0x124:0x0]+30 in llog
            00000004:00080000:1.0:1450016037.013461:0:21232:0:(osp_object.c:1503:osp_object_create()) lustre-OST0003-osc-MDT0000: Wrote last used FID: [0x100030000:0x142:0x0], index 3: 0
            00000004:00080000:1.0:1450016037.013462:0:21232:0:(osp_internal.h:551:osp_update_last_fid()) Gap in objids: start=[0x100040000:0x124:0x0], count =30
            00000004:00080000:1.0:1450016037.013463:0:21232:0:(osp_object.c:1486:osp_object_create()) Writing gap [0x100040000:0x124:0x0]+30 in llog
            00000004:00080000:1.0:1450016037.013464:0:21232:0:(osp_object.c:1503:osp_object_create()) lustre-OST0004-osc-MDT0000: Wrote last used FID: [0x100040000:0x142:0x0], index 4: 0
            00000004:00080000:1.0:1450016037.013465:0:21232:0:(osp_internal.h:551:osp_update_last_fid()) Gap in objids: start=[0x100060000:0x124:0x0], count =30
            00000004:00080000:1.0:1450016037.013466:0:21232:0:(osp_object.c:1486:osp_object_create()) Writing gap [0x100060000:0x124:0x0]+30 in llog
            00000004:00080000:1.0:1450016037.013467:0:21232:0:(osp_object.c:1503:osp_object_create()) lustre-OST0006-osc-MDT0000: Wrote last used FID: [0x100060000:0x142:0x0], index 6: 0
            

            return -19 (NODEV) when stat OST0005.

            And OST0005 is just restarted in test_27z

            00000020:01200004:0.0:1450016031.465300:0:17414:0:(obd_mount.c:1276:lustre_fill_super()) VFS Op: sb ffff880064b67000
            00000020:01000004:0.0:1450016031.465318:0:17414:0:(obd_mount.c:843:lmd_print())   mount data:
            00000020:01000004:0.0:1450016031.465319:0:17414:0:(obd_mount.c:846:lmd_print()) device:  /dev/mapper/lvm--Role_OSS-P6
            00000020:01000004:0.0:1450016031.465320:0:17414:0:(obd_mount.c:847:lmd_print()) flags:   0
            00000020:01000004:0.0:1450016031.465321:0:17414:0:(obd_mount.c:850:lmd_print()) options: errors=remount-ro
            00000020:01000004:0.0:1450016031.465323:0:17414:0:(obd_mount.c:1323:lustre_fill_super()) Mounting server from /dev/mapper/lvm--Role_OSS-P6
            00000020:01000004:0.0:1450016031.465326:0:17414:0:(obd_mount_server.c:1686:osd_start()) Attempting to start lustre-OST0005, type=osd-ldiskfs, lsifl=200002, mountfl=0
            00000020:01000004:0.0:1450016031.465367:0:17414:0:(obd_mount.c:194:lustre_start_simple()) Starting obd lustre-OST0005-osd (typ=osd-ldiskfs)
            00000020:00000080:0.0:1450016031.465370:0:17414:0:(obd_config.c:1145:class_process_config()) processing cmd: cf001
            00000020:00000080:0.0:1450016031.465373:0:17414:0:(obd_config.c:359:class_attach()) attach type osd-ldiskfs name: lustre-OST0005-osd uuid: lustre-OST0005-osd_UUID
            00000020:00000080:0.0:1450016031.465429:0:17414:0:(genops.c:371:class_newdev()) Adding new device lustre-OST0005-osd (ffff88005f728078)
            00000020:00000080:0.0:1450016031.465432:0:17414:0:(obd_config.c:429:class_attach()) OBD: dev 17 attached type osd-ldiskfs with refcount 1
            00000020:00000080:0.0:1450016031.465435:0:17414:0:(obd_config.c:1145:class_process_config()) processing cmd: cf003
            00000020:00000080:0.0:1450016031.502031:0:17414:0:(obd_config.c:549:class_setup()) finished setup of obd lustre-OST0005-osd (uuid lustre-OST0005-osd_UUID)
            

            And it probably related with this change http://review.whamcloud.com/15731

            diff --git a/lustre/osp/osp_dev.c b/lustre/osp/osp_dev.c
            index e4af6b6..2a9d30f 100644
            --- a/lustre/osp/osp_dev.c
            +++ b/lustre/osp/osp_dev.c
            @@ -752,7 +752,17 @@ static int osp_statfs(const struct lu_env *env, struct dt_device *dev,
                           LPU64" files, "LPU64" free files\n", d->opd_obd->obd_name,
                           sfs->os_blocks, sfs->os_bfree, sfs->os_bavail,
                           sfs->os_files, sfs->os_ffree);
            -       RETURN(0);
            +
            +       /* ENOSPC could be for two reasons,
            +        * 1) not enough inodes 2) not enough blocks
            +        * for 1) lod should use preallocated objects
            +        * and for 2) shouldn`t. So, here for ENOSPC
            +        * different values is returned to spend preallocated.
            +        */
            +       if (d->opd_pre_status == -ENOSPC && sfs->os_ffree < 32)
            +               RETURN(0);
            +
            +       RETURN(d->opd_pre_status);
             }
            

            The easiest fix might be just wait OSTs up after restart OST in test_27z.

            di.wang Di Wang added a comment - - edited I checked the failure a little bit. and it seems because of OST0005 is not started yet, because check_seq_oid() in test_27z will restart OSTs. log in test_27C 98ad88ca34f2+331:11157:x1520450549370228:12345-10.2.4.165@tcp:103 Request procesed in 63us (226us total) trans 0 rc 0/0 00000100:00100000:0.0:1450016037.008128:0:28196:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.2.4.165@tcp, seq: 143 00020000:00020000:1.0:1450016037.008151:0:21232:0:(lod_qos.c:205:lod_statfs_and_check()) lustre-MDT0000-mdtlov: statfs: rc = -19 00000004:00080000:1.0:1450016037.013441:0:21232:0:(osp_internal.h:551:osp_update_last_fid()) Gap in objids: start=[0x100000000:0xf6:0x0], count =76 00000004:00080000:1.0:1450016037.013446:0:21232:0:(osp_object.c:1486:osp_object_create()) Writing gap [0x100000000:0xf6:0x0]+76 in llog 00000004:00080000:1.0:1450016037.013451:0:21232:0:(osp_object.c:1503:osp_object_create()) lustre-OST0000-osc-MDT0000: Wrote last used FID: [0x100000000:0x142:0x0], index 0: 0 00000004:00080000:1.0:1450016037.013453:0:21232:0:(osp_internal.h:551:osp_update_last_fid()) Gap in objids: start=[0x100010000:0x124:0x0], count =30 00000004:00080000:1.0:1450016037.013455:0:21232:0:(osp_object.c:1486:osp_object_create()) Writing gap [0x100010000:0x124:0x0]+30 in llog 00000004:00080000:1.0:1450016037.013456:0:21232:0:(osp_object.c:1503:osp_object_create()) lustre-OST0001-osc-MDT0000: Wrote last used FID: [0x100010000:0x142:0x0], index 1: 0 00000004:00080000:1.0:1450016037.013459:0:21232:0:(osp_object.c:1503:osp_object_create()) lustre-OST0002-osc-MDT0000: Wrote last used FID: [0x100020000:0x143:0x0], index 2: 0 00000004:00080000:1.0:1450016037.013460:0:21232:0:(osp_internal.h:551:osp_update_last_fid()) Gap in objids: start=[0x100030000:0x124:0x0], count =30 00000004:00080000:1.0:1450016037.013460:0:21232:0:(osp_object.c:1486:osp_object_create()) Writing gap [0x100030000:0x124:0x0]+30 in llog 00000004:00080000:1.0:1450016037.013461:0:21232:0:(osp_object.c:1503:osp_object_create()) lustre-OST0003-osc-MDT0000: Wrote last used FID: [0x100030000:0x142:0x0], index 3: 0 00000004:00080000:1.0:1450016037.013462:0:21232:0:(osp_internal.h:551:osp_update_last_fid()) Gap in objids: start=[0x100040000:0x124:0x0], count =30 00000004:00080000:1.0:1450016037.013463:0:21232:0:(osp_object.c:1486:osp_object_create()) Writing gap [0x100040000:0x124:0x0]+30 in llog 00000004:00080000:1.0:1450016037.013464:0:21232:0:(osp_object.c:1503:osp_object_create()) lustre-OST0004-osc-MDT0000: Wrote last used FID: [0x100040000:0x142:0x0], index 4: 0 00000004:00080000:1.0:1450016037.013465:0:21232:0:(osp_internal.h:551:osp_update_last_fid()) Gap in objids: start=[0x100060000:0x124:0x0], count =30 00000004:00080000:1.0:1450016037.013466:0:21232:0:(osp_object.c:1486:osp_object_create()) Writing gap [0x100060000:0x124:0x0]+30 in llog 00000004:00080000:1.0:1450016037.013467:0:21232:0:(osp_object.c:1503:osp_object_create()) lustre-OST0006-osc-MDT0000: Wrote last used FID: [0x100060000:0x142:0x0], index 6: 0 return -19 (NODEV) when stat OST0005. And OST0005 is just restarted in test_27z 00000020:01200004:0.0:1450016031.465300:0:17414:0:(obd_mount.c:1276:lustre_fill_super()) VFS Op: sb ffff880064b67000 00000020:01000004:0.0:1450016031.465318:0:17414:0:(obd_mount.c:843:lmd_print()) mount data: 00000020:01000004:0.0:1450016031.465319:0:17414:0:(obd_mount.c:846:lmd_print()) device: /dev/mapper/lvm--Role_OSS-P6 00000020:01000004:0.0:1450016031.465320:0:17414:0:(obd_mount.c:847:lmd_print()) flags: 0 00000020:01000004:0.0:1450016031.465321:0:17414:0:(obd_mount.c:850:lmd_print()) options: errors=remount-ro 00000020:01000004:0.0:1450016031.465323:0:17414:0:(obd_mount.c:1323:lustre_fill_super()) Mounting server from /dev/mapper/lvm--Role_OSS-P6 00000020:01000004:0.0:1450016031.465326:0:17414:0:(obd_mount_server.c:1686:osd_start()) Attempting to start lustre-OST0005, type=osd-ldiskfs, lsifl=200002, mountfl=0 00000020:01000004:0.0:1450016031.465367:0:17414:0:(obd_mount.c:194:lustre_start_simple()) Starting obd lustre-OST0005-osd (typ=osd-ldiskfs) 00000020:00000080:0.0:1450016031.465370:0:17414:0:(obd_config.c:1145:class_process_config()) processing cmd: cf001 00000020:00000080:0.0:1450016031.465373:0:17414:0:(obd_config.c:359:class_attach()) attach type osd-ldiskfs name: lustre-OST0005-osd uuid: lustre-OST0005-osd_UUID 00000020:00000080:0.0:1450016031.465429:0:17414:0:(genops.c:371:class_newdev()) Adding new device lustre-OST0005-osd (ffff88005f728078) 00000020:00000080:0.0:1450016031.465432:0:17414:0:(obd_config.c:429:class_attach()) OBD: dev 17 attached type osd-ldiskfs with refcount 1 00000020:00000080:0.0:1450016031.465435:0:17414:0:(obd_config.c:1145:class_process_config()) processing cmd: cf003 00000020:00000080:0.0:1450016031.502031:0:17414:0:(obd_config.c:549:class_setup()) finished setup of obd lustre-OST0005-osd (uuid lustre-OST0005-osd_UUID) And it probably related with this change http://review.whamcloud.com/15731 diff --git a/lustre/osp/osp_dev.c b/lustre/osp/osp_dev.c index e4af6b6..2a9d30f 100644 --- a/lustre/osp/osp_dev.c +++ b/lustre/osp/osp_dev.c @@ -752,7 +752,17 @@ static int osp_statfs(const struct lu_env *env, struct dt_device *dev, LPU64" files, "LPU64" free files\n", d->opd_obd->obd_name, sfs->os_blocks, sfs->os_bfree, sfs->os_bavail, sfs->os_files, sfs->os_ffree); - RETURN(0); + + /* ENOSPC could be for two reasons, + * 1) not enough inodes 2) not enough blocks + * for 1) lod should use preallocated objects + * and for 2) shouldn`t. So, here for ENOSPC + * different values is returned to spend preallocated. + */ + if (d->opd_pre_status == -ENOSPC && sfs->os_ffree < 32) + RETURN(0); + + RETURN(d->opd_pre_status); } The easiest fix might be just wait OSTs up after restart OST in test_27z.
            yujian Jian Yu added a comment -

            Lustre Build: https://build.hpdd.intel.com/job/lustre-master/3274/

            sanity test 27C failed:
            https://testing.hpdd.intel.com/test_sets/e8fd04bc-a350-11e5-867e-5254006e85c2

            After reverting commit 0585b0fb5895a24f07ca32e830d1fa72b75f4f2b for LU-6910, sanity tests passed regularly:
            https://testing.hpdd.intel.com/test_sessions/5a533ec4-a30b-11e5-a3ed-5254006e85c2

            So, the failure is a regression introduced by patch http://review.whamcloud.com/15731 for LU-6910.

            yujian Jian Yu added a comment - Lustre Build: https://build.hpdd.intel.com/job/lustre-master/3274/ sanity test 27C failed: https://testing.hpdd.intel.com/test_sets/e8fd04bc-a350-11e5-867e-5254006e85c2 After reverting commit 0585b0fb5895a24f07ca32e830d1fa72b75f4f2b for LU-6910 , sanity tests passed regularly: https://testing.hpdd.intel.com/test_sessions/5a533ec4-a30b-11e5-a3ed-5254006e85c2 So, the failure is a regression introduced by patch http://review.whamcloud.com/15731 for LU-6910 .

            People

              yujian Jian Yu
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: