[LU-7550] sanity test_27C: FAIL: Can not find 5 in obdidx 0 1 2 3 4 6 Created: 14/Dec/15  Updated: 05/Jan/16  Resolved: 05/Jan/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Major
Reporter: Jian Yu Assignee: Jian Yu
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-3001 sanity 27C: error: getstripe failed f... Resolved
Related
is related to LU-6910 Configurable values for OST reserved ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity test 27C failed as follows:

== sanity test 27C: check full striping across all OSTs == 06:13:56 (1450016036)
obdidx 0 1 2 3 4 6
/usr/lib64/lustre/tests/sanity.sh: line 2017: [: obdidx: integer expression expected
/usr/lib64/lustre/tests/sanity.sh: line 2017: [: obdidx: integer expression expected
/usr/lib64/lustre/tests/sanity.sh: line 2017: [: obdidx: integer expression expected
/usr/lib64/lustre/tests/sanity.sh: line 2017: [: obdidx: integer expression expected
/usr/lib64/lustre/tests/sanity.sh: line 2017: [: obdidx: integer expression expected
/usr/lib64/lustre/tests/sanity.sh: line 2017: [: obdidx: integer expression expected
 sanity test_27C: @@@@@@ FAIL: Can not find 5 in obdidx 0 1 2 3 4 6 

Maloo report:
https://testing.hpdd.intel.com/test_sets/86dc5690-a1b9-11e5-b83c-5254006e85c2



 Comments   
Comment by Jian Yu [ 14/Dec/15 ]

This is affecting patch review testing on master branch:
https://testing.hpdd.intel.com/test_sets/86dc5690-a1b9-11e5-b83c-5254006e85c2
https://testing.hpdd.intel.com/test_sets/68f586d8-a1e1-11e5-a4da-5254006e85c2
https://testing.hpdd.intel.com/test_sets/a45e6640-a1aa-11e5-8f60-5254006e85c2
https://testing.hpdd.intel.com/test_sets/4a39b5d0-a154-11e5-a4da-5254006e85c2
https://testing.hpdd.intel.com/test_sets/743384e6-a132-11e5-8bbb-5254006e85c2
https://testing.hpdd.intel.com/test_sets/d5d4d5f2-a0ff-11e5-8bbb-5254006e85c2
https://testing.hpdd.intel.com/test_sets/65e48472-a104-11e5-85ed-5254006e85c2

Comment by Peter Jones [ 14/Dec/15 ]

Jian is looking into this

Comment by James Nunez (Inactive) [ 15/Dec/15 ]

Many failures on master:
2015-12-14 02:42:44 - https://testing.hpdd.intel.com/test_sets/1a76d214-a240-11e5-bdef-5254006e85c2
2015-12-14 02:56:37 - https://testing.hpdd.intel.com/test_sets/e31761fa-a229-11e5-bdef-5254006e85c2
2015-12-14 03:02:29 - https://testing.hpdd.intel.com/test_sets/64424a2a-a242-11e5-afd0-5254006e85c2
2015-12-14 03:26:29 - https://testing.hpdd.intel.com/test_sets/29eb7544-a252-11e5-952f-5254006e85c2
2015-12-14 03:38:27 - https://testing.hpdd.intel.com/test_sets/f2ec045e-a23e-11e5-952f-5254006e85c2
2015-12-14 06:57:17 - https://testing.hpdd.intel.com/test_sets/bfc35114-a25b-11e5-afd0-5254006e85c2
2015-12-14 17:01:49 - https://testing.hpdd.intel.com/test_sets/30c0eb58-a2a0-11e5-afd0-5254006e85c2
2015-12-14 17:07:13 - https://testing.hpdd.intel.com/test_sets/d35e351a-a2ae-11e5-bdef-5254006e85c2
2015-12-14 19:49:45 - https://testing.hpdd.intel.com/test_sets/fdc65064-a2c5-11e5-adf1-5254006e85c2
2015-12-14 20:51:12 - https://testing.hpdd.intel.com/test_sets/2cf571de-a2ef-11e5-9b3d-5254006e85c2
2015-12-14 21:22:09 - https://testing.hpdd.intel.com/test_sets/e81e9656-a2f0-11e5-b94e-5254006e85c2
2015-12-14 21:26:06 - https://testing.hpdd.intel.com/test_sets/d3053d92-a2dc-11e5-952f-5254006e85c2
2015-12-14 21:34:36 - https://testing.hpdd.intel.com/test_sets/0d10b2f4-a2f2-11e5-9b3d-5254006e85c2
2015-12-14 21:34:44 - https://testing.hpdd.intel.com/test_sets/6afac616-a2c5-11e5-adf1-5254006e85c2
2015-12-14 23:42:22 - https://testing.hpdd.intel.com/test_sets/4b2781ac-a2da-11e5-bdef-5254006e85c2

Comment by Jian Yu [ 15/Dec/15 ]

Lustre Build: https://build.hpdd.intel.com/job/lustre-master/3274/

sanity test 27C failed:
https://testing.hpdd.intel.com/test_sets/e8fd04bc-a350-11e5-867e-5254006e85c2

After reverting commit 0585b0fb5895a24f07ca32e830d1fa72b75f4f2b for LU-6910, sanity tests passed regularly:
https://testing.hpdd.intel.com/test_sessions/5a533ec4-a30b-11e5-a3ed-5254006e85c2

So, the failure is a regression introduced by patch http://review.whamcloud.com/15731 for LU-6910.

Comment by Di Wang [ 15/Dec/15 ]

I checked the failure a little bit. and it seems because of OST0005 is not started yet, because check_seq_oid() in test_27z will restart OSTs.

log in test_27C

98ad88ca34f2+331:11157:x1520450549370228:12345-10.2.4.165@tcp:103 Request procesed in 63us (226us total) trans 0 rc 0/0
00000100:00100000:0.0:1450016037.008128:0:28196:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.2.4.165@tcp, seq: 143
00020000:00020000:1.0:1450016037.008151:0:21232:0:(lod_qos.c:205:lod_statfs_and_check()) lustre-MDT0000-mdtlov: statfs: rc = -19
00000004:00080000:1.0:1450016037.013441:0:21232:0:(osp_internal.h:551:osp_update_last_fid()) Gap in objids: start=[0x100000000:0xf6:0x0], count =76
00000004:00080000:1.0:1450016037.013446:0:21232:0:(osp_object.c:1486:osp_object_create()) Writing gap [0x100000000:0xf6:0x0]+76 in llog
00000004:00080000:1.0:1450016037.013451:0:21232:0:(osp_object.c:1503:osp_object_create()) lustre-OST0000-osc-MDT0000: Wrote last used FID: [0x100000000:0x142:0x0], index 0: 0
00000004:00080000:1.0:1450016037.013453:0:21232:0:(osp_internal.h:551:osp_update_last_fid()) Gap in objids: start=[0x100010000:0x124:0x0], count =30
00000004:00080000:1.0:1450016037.013455:0:21232:0:(osp_object.c:1486:osp_object_create()) Writing gap [0x100010000:0x124:0x0]+30 in llog
00000004:00080000:1.0:1450016037.013456:0:21232:0:(osp_object.c:1503:osp_object_create()) lustre-OST0001-osc-MDT0000: Wrote last used FID: [0x100010000:0x142:0x0], index 1: 0
00000004:00080000:1.0:1450016037.013459:0:21232:0:(osp_object.c:1503:osp_object_create()) lustre-OST0002-osc-MDT0000: Wrote last used FID: [0x100020000:0x143:0x0], index 2: 0
00000004:00080000:1.0:1450016037.013460:0:21232:0:(osp_internal.h:551:osp_update_last_fid()) Gap in objids: start=[0x100030000:0x124:0x0], count =30
00000004:00080000:1.0:1450016037.013460:0:21232:0:(osp_object.c:1486:osp_object_create()) Writing gap [0x100030000:0x124:0x0]+30 in llog
00000004:00080000:1.0:1450016037.013461:0:21232:0:(osp_object.c:1503:osp_object_create()) lustre-OST0003-osc-MDT0000: Wrote last used FID: [0x100030000:0x142:0x0], index 3: 0
00000004:00080000:1.0:1450016037.013462:0:21232:0:(osp_internal.h:551:osp_update_last_fid()) Gap in objids: start=[0x100040000:0x124:0x0], count =30
00000004:00080000:1.0:1450016037.013463:0:21232:0:(osp_object.c:1486:osp_object_create()) Writing gap [0x100040000:0x124:0x0]+30 in llog
00000004:00080000:1.0:1450016037.013464:0:21232:0:(osp_object.c:1503:osp_object_create()) lustre-OST0004-osc-MDT0000: Wrote last used FID: [0x100040000:0x142:0x0], index 4: 0
00000004:00080000:1.0:1450016037.013465:0:21232:0:(osp_internal.h:551:osp_update_last_fid()) Gap in objids: start=[0x100060000:0x124:0x0], count =30
00000004:00080000:1.0:1450016037.013466:0:21232:0:(osp_object.c:1486:osp_object_create()) Writing gap [0x100060000:0x124:0x0]+30 in llog
00000004:00080000:1.0:1450016037.013467:0:21232:0:(osp_object.c:1503:osp_object_create()) lustre-OST0006-osc-MDT0000: Wrote last used FID: [0x100060000:0x142:0x0], index 6: 0

return -19 (NODEV) when stat OST0005.

And OST0005 is just restarted in test_27z

00000020:01200004:0.0:1450016031.465300:0:17414:0:(obd_mount.c:1276:lustre_fill_super()) VFS Op: sb ffff880064b67000
00000020:01000004:0.0:1450016031.465318:0:17414:0:(obd_mount.c:843:lmd_print())   mount data:
00000020:01000004:0.0:1450016031.465319:0:17414:0:(obd_mount.c:846:lmd_print()) device:  /dev/mapper/lvm--Role_OSS-P6
00000020:01000004:0.0:1450016031.465320:0:17414:0:(obd_mount.c:847:lmd_print()) flags:   0
00000020:01000004:0.0:1450016031.465321:0:17414:0:(obd_mount.c:850:lmd_print()) options: errors=remount-ro
00000020:01000004:0.0:1450016031.465323:0:17414:0:(obd_mount.c:1323:lustre_fill_super()) Mounting server from /dev/mapper/lvm--Role_OSS-P6
00000020:01000004:0.0:1450016031.465326:0:17414:0:(obd_mount_server.c:1686:osd_start()) Attempting to start lustre-OST0005, type=osd-ldiskfs, lsifl=200002, mountfl=0
00000020:01000004:0.0:1450016031.465367:0:17414:0:(obd_mount.c:194:lustre_start_simple()) Starting obd lustre-OST0005-osd (typ=osd-ldiskfs)
00000020:00000080:0.0:1450016031.465370:0:17414:0:(obd_config.c:1145:class_process_config()) processing cmd: cf001
00000020:00000080:0.0:1450016031.465373:0:17414:0:(obd_config.c:359:class_attach()) attach type osd-ldiskfs name: lustre-OST0005-osd uuid: lustre-OST0005-osd_UUID
00000020:00000080:0.0:1450016031.465429:0:17414:0:(genops.c:371:class_newdev()) Adding new device lustre-OST0005-osd (ffff88005f728078)
00000020:00000080:0.0:1450016031.465432:0:17414:0:(obd_config.c:429:class_attach()) OBD: dev 17 attached type osd-ldiskfs with refcount 1
00000020:00000080:0.0:1450016031.465435:0:17414:0:(obd_config.c:1145:class_process_config()) processing cmd: cf003
00000020:00000080:0.0:1450016031.502031:0:17414:0:(obd_config.c:549:class_setup()) finished setup of obd lustre-OST0005-osd (uuid lustre-OST0005-osd_UUID)

And it probably related with this change http://review.whamcloud.com/15731

diff --git a/lustre/osp/osp_dev.c b/lustre/osp/osp_dev.c
index e4af6b6..2a9d30f 100644
--- a/lustre/osp/osp_dev.c
+++ b/lustre/osp/osp_dev.c
@@ -752,7 +752,17 @@ static int osp_statfs(const struct lu_env *env, struct dt_device *dev,
               LPU64" files, "LPU64" free files\n", d->opd_obd->obd_name,
               sfs->os_blocks, sfs->os_bfree, sfs->os_bavail,
               sfs->os_files, sfs->os_ffree);
-       RETURN(0);
+
+       /* ENOSPC could be for two reasons,
+        * 1) not enough inodes 2) not enough blocks
+        * for 1) lod should use preallocated objects
+        * and for 2) shouldn`t. So, here for ENOSPC
+        * different values is returned to spend preallocated.
+        */
+       if (d->opd_pre_status == -ENOSPC && sfs->os_ffree < 32)
+               RETURN(0);
+
+       RETURN(d->opd_pre_status);
 }

The easiest fix might be just wait OSTs up after restart OST in test_27z.

Comment by Peter Jones [ 15/Dec/15 ]

LU-6910 has been reverted

Comment by Jian Yu [ 15/Dec/15 ]

The easiest fix might be just wait OSTs up after restart OST in test_27z.

Yes, Di. After making the following change, sanity test 27C passed regularly against master build https://build.hpdd.intel.com/job/lustre-master/3274/ (which failed before):

diff --git a/lustre/tests/sanity.sh b/lustre/tests/sanity.sh
index 7a25e4b..19e8521 100755
--- a/lustre/tests/sanity.sh
+++ b/lustre/tests/sanity.sh
@@ -1885,6 +1885,7 @@ check_seq_oid()
                                $(facet_mntpt ost$ost)/$obj_file)
                        unmount_fstype ost$ost
                        start ost$ost $dev $OST_MOUNT_OPTS
+                       clients_up
                fi

https://testing.hpdd.intel.com/test_sessions/918784fc-a36c-11e5-a3ed-5254006e85c2

Comment by James Nunez (Inactive) [ 18/Dec/15 ]

More failures on master:
2015-12-17 18:15:27 - https://testing.hpdd.intel.com/test_sets/2ffb7bf0-a504-11e5-b68c-5254006e85c2
2015-12-17 21:23:15 - https://testing.hpdd.intel.com/test_sets/21bd6288-a54a-11e5-9f01-5254006e85c2
2015-12-17 21:47:01 - https://testing.hpdd.intel.com/test_sets/e9ac1898-a53b-11e5-b50c-5254006e85c2
2015-12-19 06:39:50 - https://testing.hpdd.intel.com/test_sets/394798f4-a641-11e5-924d-5254006e85c2
2015-12-20 07:49:15 - https://testing.hpdd.intel.com/test_sets/9a4e3e2e-a730-11e5-b560-5254006e85c2

Comment by Alexander Boyko [ 21/Dec/15 ]

>So, the failure is a regression introduced by patch http://review.whamcloud.com/15731 for LU-6910.
Looks like it was wrong, a last fails was not based at LU-6910.

Comment by James Nunez (Inactive) [ 21/Dec/15 ]

Reopening because we are seeing some patches fail due to this error with LU-6910 reverted. Reducing priority since the frequency of this failure is one to two times per day.

Comment by Gerrit Updater [ 21/Dec/15 ]

Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/17691
Subject: LU-7550 tests: wait OSTs up in check_seq_oid()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: eeb8ff65ff75ad902b1702b7b75d63a1ec09c6da

Comment by Gerrit Updater [ 05/Jan/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17691/
Subject: LU-7550 tests: wait OSTs up in check_seq_oid()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ef46d32bc5aa713ab55179bd836d750def0022d7

Comment by Jian Yu [ 05/Jan/16 ]

Patch landed to master branch for Lustre 2.8.0.

Generated at Sat Feb 10 02:09:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.