[LU-7118] sanity-scrub: No sub tests failed in this test set Created: 08/Sep/15  Updated: 22/Sep/15  Resolved: 22/Sep/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Yang Sheng
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-7123 sanity-scrub: OST shows unable to mount Resolved
Related
is related to LU-5645 Fail to start sanity-scrub Resolved
is related to LU-6827 sanity-scrub: Failed mount OST on ldi... Resolved
is related to LU-7193 sanity-scrub: No sub tests failed in ... Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

I've seen a lot of sanity-scrub instances entirely fail lately. Looks like some kind of TEI issue to me as it shows up in test runs on lots of different and unrelated mods. no logs are collected, summary always says:

Failed subtests

No sub tests failed in this test set.

All subtests

This test set does not have any sub tests.

Maybe something really bad landed that blocks any sanity-scrub from running.

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/aaf64806-5682-11e5-a9bc-5254006e85c2.



 Comments   
Comment by Andreas Dilger [ 09/Sep/15 ]

This can often happen if there was a problem during cleanup/unmount/module unload after all the subtests have finished.

Comment by James Nunez (Inactive) [ 09/Sep/15 ]

In the suite_stdout logs from the failure for this ticket:

23:31:38:shadow-40vm8: mount.lustre: mount /dev/mapper/lvm--Role_OSS-P1 at /mnt/ost1 failed: Cannot send after transport endpoint shutdown

There are no other logs to see what the actual problem with the OSS might be. It looks like the previous test suite, sanity-quota, completed with no errors.

This failure matches the one in LU-5645. So, this is probably a duplicate of LU-5645.

There are many instances of sanity-scrub failing, here are a few instances on review-dne-part-2:
2015-09-09 03:06:45 - https://testing.hpdd.intel.com/test_sets/5bd4b8c2-56cb-11e5-84d0-5254006e85c2
2015-09-09 07:25:42 - https://testing.hpdd.intel.com/test_sets/9d6af5e0-56ed-11e5-84d0-5254006e85c2
2015-09-09 09:13:53 - https://testing.hpdd.intel.com/test_sets/40ac0b3a-56fe-11e5-8947-5254006e85c2

Comment by Bob Glossman (Inactive) [ 09/Sep/15 ]

another on master:
https://testing.hpdd.intel.com/test_sets/78b06ad8-5732-11e5-a2e1-5254006e85c2

Comment by Sarah Liu [ 09/Sep/15 ]

I have seen similar issue on tip of master for RHEL7.1 server. Not sure if they are the same. LU-7123

Comment by Bob Glossman (Inactive) [ 09/Sep/15 ]

Sarah, I think your LU-7123 may very well be the same thing and maybe our only clue to what's going on. As I understand it by running on el7 the console log gets timestamps prepended to all lines. This in turn causes maloo to misplace the console logs onto lustre-init instead of putting them where they should be. I'm guessing the same info may be in console logs on el6 runs but those aren't captured or are thrown away.

Having the console logs misplaced may be a good thing in this case. It preserved them and lets us see info not captured during failures on el6.

Comment by Joseph Gmitter (Inactive) [ 10/Sep/15 ]

Can you please have a look at this issue? Can you dig into it further to get more info and see what may be happening?
Thanks
Joe

Comment by Andreas Dilger [ 10/Sep/15 ]

There are no console logs available to show what is happening. The MDT and OST were just reformatted and mounted so it is strange that there is an error.

Comment by Jian Yu [ 10/Sep/15 ]

The first occurrence of this failure is in report https://testing.hpdd.intel.com/test_sets/aaf64806-5682-11e5-a9bc-5254006e85c2 of patch http://review.whamcloud.com/16315, which is based on 01ca899324738343279c1d63823b7fab937197dc (LU-7079 ptlrpc: imp_peer_committed_transno should increase), and which was landed on 6 days ago. So, it seems this is not a regression introduced by the recent landings.

Comment by Peter Jones [ 10/Sep/15 ]

Yang Sheng

Could you please investigate?

Thanks

Peter

Comment by Jian Yu [ 10/Sep/15 ]

This report https://testing.hpdd.intel.com/test_sets/5bd4b8c2-56cb-11e5-84d0-5254006e85c2 (failed on 2015-09-09) of http://review.whamcloud.com/16129 patch set 4 is based on commit 1f4d68d334c85a8106f5939351991b80449e5713 (LU-6947 ptlrpc: Remove stray comment in ptlrpc_start_pinger), which was landed on 2015-08-26. So, it's not related to the recent Lustre landings.

Is the issue related to the recent RHEL 6.7 change in autotest system?

Comment by Sarah Liu [ 10/Sep/15 ]

If this is a dup of LU-7123, then there can find MDS and OSS console logs.

Comment by Bob Glossman (Inactive) [ 10/Sep/15 ]

the misplaced console logs in the LU-7123 report are no help I think. they are there in lustre-init, but only show sanity-scrub taking 0 time. for example in the mds console log:

17:23:00:[ 9529.925038] Lustre: DEBUG MARKER: /usr/sbin/lctl mark -----============= acceptance-small: sanity-scrub ============----- Fri Sep  4 17:22:53 UTC 2015
17:23:23:[ 9530.121771] Lustre: DEBUG MARKER: -----============= acceptance-small: sanity-scrub ============----- Fri Sep 4 17:22:53 UTC 2015
17:23:23:[ 9530.645738] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity-scrub test complete, duration -o sec ======================================================= 17:22:53 \(1441387373\)
17:23:23:[ 9530.809134] Lustre: DEBUG MARKER: == sanity-scrub test complete, duration -o sec ======================================================= 17:22:53 (1441387373)

nothing useful there that I can see.

Comment by Di Wang [ 10/Sep/15 ]

Hmm, I though the OSS console logs on LU-7123 is the key here

7:23:31:[ 9568.305881] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: errors=remount-ro
17:23:31:[ 9568.461183] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache
17:23:31:[ 9568.480543] Lustre: Evicted from MGS (at 10.1.4.162@tcp) after server handle changed from 0xd1a6ba6be92fc62 to 0xd1a6ba6be9adf3a
17:23:31:[ 9568.482736] LustreError: 15f-b: lustre-OST0000: cannot register this server with the MGS: rc = -108. Is the MGS running?
17:23:31:[ 9568.482991] Lustre: MGC10.1.4.162@tcp: Connection restored to MGS (at 10.1.4.162@tcp)
17:23:31:[ 9568.482993] Lustre: Skipped 6 previous similar messages
17:23:31:[ 9568.485396] LustreError: 14130:0:(obd_mount_server.c:1794:server_fill_super()) Unable to start targets: -108
17:23:31:[ 9568.486579] LustreError: 14130:0:(obd_mount_server.c:1509:server_put_super()) no obd lustre-OST0000
17:23:31:[ 9568.487556] LustreError: 14130:0:(obd_mount_server.c:137:server_deregister_mount()) lustre-OST0000 not registered
17:23:31:[ 9568.620431] Lustre: server umount lustre-OST0000 complete

It looks like MGC somehow survive umount/remount OST process. Since MDT0 is reformatted, so this survive mgc caused the mount failed. (or there might be some race here, sigh there are no further log here).

17:23:31:[ 9568.480543] Lustre: Evicted from MGS (at 10.1.4.162@tcp) after server handle changed from 0xd1a6ba6be92fc62 to 0xd1a6ba6be9adf3a

Sigh, nothing on MDS console log.

Comment by Di Wang [ 10/Sep/15 ]

See mgc even start to reconnect before the first OST is mount, if the timestamp is correct in the console log( is it? ), that means MGC is left over from last umount, a bit strange.

17:23:31:[ 9567.529080] Lustre: 19295:0:(client.c:2039:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1441387392/real 1441387392]  req@ffff88005deb1e00 x1511395716989360/t0(0) o250->MGC10.1.4.162@tcp@10.1.4.162@tcp:26/25 lens 520/544 e 0 to 1 dl 1441387403 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
17:23:31:[ 9567.531521] Lustre: 19295:0:(client.c:2039:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
17:23:31:[ 9567.570845] Lustre: DEBUG MARKER: test -b /dev/lvm-Role_OSS/P1
17:23:31:[ 9567.913313] Lustre: DEBUG MARKER: mkdir -p /mnt/ost1; mount -t lustre   		                   /dev/lvm-Role_OSS/P1 /mnt/ost1
17:23:31:[ 9568.305881] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: errors=remount-ro
17:23:31:[ 9568.461183] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache
17:23:31:[ 9568.480543] Lustre: Evicted from MGS (at 10.1.4.162@tcp) after server handle changed from 0xd1a6ba6be92fc62 to 0xd1a6ba6be9adf3a
17:23:31:[ 9568.482736] LustreError: 15f-b: lustre-OST0000: cannot register this server with the MGS: rc = -108. Is the MGS running?
17:23:31:[ 9568.482991] Lustre: MGC10.1.4.162@tcp: Connection restored to MGS (at 10.1.4.162@tcp)
Comment by Di Wang [ 10/Sep/15 ]

Because it blocks almost all of patches on master, I changed it to blocker.

Comment by Di Wang [ 10/Sep/15 ]

Ah, there are 7 OSTs in the test config. But it seems sanity-scrub.sh only stops 4 OSTs, then do reformat & restart, that is why mgc is left over and cause all these troubles.

# no need too much OSTs, to reduce the format/start/stop overhead
[ $OSTCOUNT -gt 4 ] && OSTCOUNT=4
                
MOUNT_2=""              
                        
# build up a clean test environment.
formatall      
setupall

Probably we only need fix the test script here.

Comment by Gerrit Updater [ 10/Sep/15 ]

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/16366
Subject: LU-7118 tests: stop all OSTs before reformat
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 802a0e14359af52a7265f9a0441c1704517d8b94

Comment by Bob Glossman (Inactive) [ 10/Sep/15 ]

I like your theory, but it looks like that line reducing OSTCOUNT to 4 has been in there for over a year. Why only a problem recently?

Comment by Andreas Dilger [ 10/Sep/15 ]

I was going to ask the same question... I don't see any changes to sanity-scrub.sh or test-framework.sh recently.

Comment by Di Wang [ 10/Sep/15 ]

Well, this only cause problem when remount OST (after reformat) happens after the leftover MGC is being evicted, otherwise we can not see this problem. (Hmm, it seems we need do extra check for MGC, or do we allow OST to use old MGC in this case, it seems not cause any problem if it is not being evicted).

I guess the recent change just prolong the time costs of reformat & restart? Except the lustre change, do we change test environment recently?

Comment by Yang Sheng [ 11/Sep/15 ]

I think we need someone from TEL team to investigate why test logs absence and why sanity-scrub doesn't run at all just return immediately. I have verified the test script runs well on shadow cluster. Do we have any documents about autotest system how to work?

Comment by Minh Diep [ 11/Sep/15 ]

This is what I see.
The reason that there aren't any logs because the first subtest hasn't even started yet. it failed during setup and mount OST1

Starting ost1: /dev/lvm-Role_OSS/P1 /mnt/ost1
CMD: onyx-35vm4 mkdir -p /mnt/ost1; mount -t lustre /dev/lvm-Role_OSS/P1 /mnt/ost1
onyx-35vm4: mount.lustre: mount /dev/mapper/lvm--Role_OSS-P1 at /mnt/ost1 failed: Cannot send after transport endpoint shutdown.

The question here is why we can't mount the OST. why the ost disconnect due to transport endpoint shutdown?
sanity-scrub does formatall and setupall at the beginning of the test. Perhaps the cleanup was not properly done.

Comment by Gerrit Updater [ 11/Sep/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16366/
Subject: LU-7118 tests: stop all OSTs before reformat
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 61b4d4ff8e9a6e7539aa3a7dcc4dd1aea6b4f927

Comment by Andreas Dilger [ 11/Sep/15 ]

This definitely seems like a regression that landed recently on master, since sanity-scrub has had the [ $OSTCOUNT -gt 4 ] && OSTCOUNT=4 line since commit v2_5_58_0-41-g1dbba32 and there are not any failures on b2_7 testing. It might be useful to run a series of tests with different commits going back once per day running review-ldiskfs multiple times (if this is possible):

Test-Parameters: fortestonly testgroup=review-ldiskfs
Test-Parameters: fortestonly testgroup=review-ldiskfs
Test-Parameters: fortestonly testgroup=review-ldiskfs
Test-Parameters: fortestonly testgroup=review-ldiskfs

since sanity-scrub only fails about 50% of the time this would give us a 94% chance of catching the regression patch at each stage. The earliest failures I see outside RHEL7.1 testing is 2015-09-08 (http://review.whamcloud.com/16315 based on commit 01ca8993247383 "LU-7079 ptlrpc: imp_peer_committed_transno should increase") and then it starts hitting hard on 2015-09-09, so I suspect it was something landed on 2015-09-08 that caused it.

Comment by Joseph Gmitter (Inactive) [ 15/Sep/15 ]

Landed for 2.8.

Comment by Yang Sheng [ 17/Sep/15 ]

https://testing.hpdd.intel.com/test_sets/bdf2ffa6-5cfb-11e5-945a-5254006e85c2
https://testing.hpdd.intel.com/test_sets/9cdbc984-5c91-11e5-b8a8-5254006e85c2
https://testing.hpdd.intel.com/test_sets/94b10788-5c19-11e5-9dac-5254006e85c2
https://testing.hpdd.intel.com/test_sets/e7fd2a3e-5bd8-11e5-96c9-5254006e85c2

Looks like last patch still not resolved this issue. But it is really not so frequently.

Comment by Di Wang [ 17/Sep/15 ]

Looks like still the same problem, there are 7 OSTs, and only 4 OSTs are stopped.

CMD: shadow-49vm3 grep -c /mnt/mds1' ' /proc/mounts
Stopping /mnt/mds1 (opts:-f) on shadow-49vm3
CMD: shadow-49vm3 umount -d -f /mnt/mds1
CMD: shadow-49vm3 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
CMD: shadow-49vm7 grep -c /mnt/mds2' ' /proc/mounts
Stopping /mnt/mds2 (opts:-f) on shadow-49vm7
CMD: shadow-49vm7 umount -d -f /mnt/mds2
CMD: shadow-49vm7 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
CMD: shadow-49vm7 grep -c /mnt/mds3' ' /proc/mounts
Stopping /mnt/mds3 (opts:-f) on shadow-49vm7
CMD: shadow-49vm7 umount -d -f /mnt/mds3
CMD: shadow-49vm7 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
CMD: shadow-49vm7 grep -c /mnt/mds4' ' /proc/mounts
Stopping /mnt/mds4 (opts:-f) on shadow-49vm7
CMD: shadow-49vm7 umount -d -f /mnt/mds4
CMD: shadow-49vm7 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
CMD: shadow-49vm4 grep -c /mnt/ost1' ' /proc/mounts
Stopping /mnt/ost1 (opts:-f) on shadow-49vm4
CMD: shadow-49vm4 umount -d -f /mnt/ost1
CMD: shadow-49vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
CMD: shadow-49vm4 grep -c /mnt/ost2' ' /proc/mounts
Stopping /mnt/ost2 (opts:-f) on shadow-49vm4
CMD: shadow-49vm4 umount -d -f /mnt/ost2
CMD: shadow-49vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
CMD: shadow-49vm4 grep -c /mnt/ost3' ' /proc/mounts
Stopping /mnt/ost3 (opts:-f) on shadow-49vm4
CMD: shadow-49vm4 umount -d -f /mnt/ost3
CMD: shadow-49vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
CMD: shadow-49vm4 grep -c /mnt/ost4' ' /proc/mounts
Stopping /mnt/ost4 (opts:-f) on shadow-49vm4
CMD: shadow-49vm4 umount -d -f /mnt/ost4
CMD: shadow-49vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '

I guess we need find out where the OSTCOUNT has been changed? or probably stopall should use "lov.xxxxx.numobd" instead of env var ?

Comment by Gerrit Updater [ 18/Sep/15 ]

Yang Sheng (yang.sheng@intel.com) uploaded a new patch: http://review.whamcloud.com/16483
Subject: LU-7118 tests: debug patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ba6612cda149851fff0eea6ee548a99afb95dacf

Comment by Peter Jones [ 22/Sep/15 ]

Original regularly occurring failure fixed. Dealing with less frequent occasional failure under LU-7193

Generated at Sat Feb 10 02:06:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.