[LU-7776] lustre-single lnet-selftest test failed Created: 15/Feb/16  Updated: 15/Jun/16  Resolved: 15/Jun/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0, Lustre 2.9.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Minor
Reporter: Abrar-ahmed Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: None
Environment:

Solo setup


Epic/Theme: test
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

lnet-selftest test fails in test setup

stdout.log
  1 UP mgs MGS MGS 5
  2 UP mgc MGC192.168.108.18@tcp c0ab2420-8f51-ad18-f779-591cad596879 5
  3 UP mds MDS MDS_uuid 3
  4 UP lod lustre-MDT0000-mdtlov lustre-MDT0000-mdtlov_UUID 4
  5 UP mdt lustre-MDT0000 lustre-MDT0000_UUID 11
  6 UP mdd lustre-MDD0000 lustre-MDD0000_UUID 4
  7 UP qmt lustre-QMT0000 lustre-QMT0000_UUID 4
  8 UP lwp lustre-MDT0000-lwp-MDT0000 lustre-MDT0000-lwp-MDT0000_UUID 5
  9 UP osd-ldiskfs lustre-OST0000-osd lustre-OST0000-osd_UUID 5
 10 UP ost OSS OSS_uuid 3
 11 UP obdfilter lustre-OST0000 lustre-OST0000_UUID 7
 12 UP lwp lustre-MDT0000-lwp-OST0000 lustre-MDT0000-lwp-OST0000_UUID 5
 13 UP osd-ldiskfs lustre-OST0001-osd lustre-OST0001-osd_UUID 5
 14 UP obdfilter lustre-OST0001 lustre-OST0001_UUID 7
 15 UP lwp lustre-MDT0000-lwp-OST0001 lustre-MDT0000-lwp-OST0001_UUID 5
 21 UP osp lustre-OST0000-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5
 22 UP osp lustre-OST0001-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5
Modules still loaded: 
lustre/osp/osp.o lustre/lod/lod.o lustre/ost/ost.o lustre/mdt/mdt.o lustre/mdd/mdd.o lustre/mgs/mgs.o ldiskfs/ldiskfs.o lustre/quota/lquota.o lustre/lfsck/lfsck.o lustre/mgc/mgc.o lustre/fid/fid.o lustre/fld/fld.o lustre/ptlrpc/ptlrpc.o lustre/obdclass/obdclass.o lnet/klnds/socklnd/ksocklnd.o lnet/lnet/lnet.o libcfs/libcfs/libcfs.o



 Comments   
Comment by Abrar-ahmed [ 15/Feb/16 ]

lnet-selftest.sh script is fails while trying to execute cleanupall() during test setup. cleanupall() in turn fails trying to remove modules while still in use. This happens on a solo setup when local_node returns true and variable CLIENTONLY is set to true. Further cleanupall() internally calls stopall() which checks CLIENTONLY and returns midway if true without further cleanup of mgs, mds and ost. This causes cleanupall() to fail at a later stage trying to remove loaded modules.

stopall() {
...
 [ "$CLIENTONLY" ] && return

History of change shows that this regression was introduced as a result of a debug patch http://review.whamcloud.com/12469 (LU-4181 tests: cleanup lustre before starting lnet-selftest.sh)
As the discussions on LU-4181 point out that the changes were for debug purpose and removing the modules was not a necessity i propose revoking this change to resolve the bug.

Comment by Andreas Dilger [ 18/Feb/16 ]

I don't think that reverting the patch is a good idea, since I believe this will cause lnet-selftest to begin failing again in our test configuration.

Instead, I think it should be enough to change the "cleanupall" to "stopall" so that it doesn't try to unload the modules, which isn't necessary. The goal of the LU-4181 patch was to stop the clients so that they would not interfere with the testing, or become disconnected when lnet-selftest was saturating the network.

Comment by Abrar-ahmed [ 31/Mar/16 ]

@Andreas Dilger

Here is my understanding of the debug patch submitted via commit <a8ba5c645f91faf86a84c99dd2cc049bc54e12b1>
Debug patch replaced stopall with cleanupall. cleanupall in addition to unmounting clients and stopping servers also unloads modules which i believe was the intended purpose of the debug patch. Please correct my understanding if wrong.
Quoting the relevant section of the debug patch change below

-    local_mode && CLIENTONLY=yes
-    stopall
-    RESTORE_MOUNT=yes
+	local_mode && CLIENTONLY=yes
+	RESTORE_MOUNT=yes
+	LOAD_MODULES_REMOTE=true
+	cleanupall

So changing cleanupall to stopall would be functionally reverting the debug patch. Would this not cause your test setup to fail again?.

Comment by Abrar-ahmed [ 31/Mar/16 ]

@Andreas Dilger

Alternate solution to keep the debug patch functionality would be to avoid calling cleanupall on local_mode setups. Something like below

-	local_mode && CLIENTONLY=yes
+	if local_mode; then
+		CLIENTONLY=yes
+		stopall
+	else
+		LOAD_MODULES_REMOTE=true
+		cleanupall
+	fi

Let me know which solution works for you or if you want to suggest alternatives. I can upload a patch for the same.

Comment by Andreas Dilger [ 01/Apr/16 ]

Looks reasonable, and if this patch works for you then you can submit it and it can be tested.

Comment by Gerrit Updater [ 02/Apr/16 ]

Abrarahmed Momin (kais_abrar@yahoo.co.in) uploaded a new patch: http://review.whamcloud.com/19308
Subject: LU-7776 tests: lnet-selftest.sh local_mode failure
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4efad14f8c85d5346aed97a047d9e5681c1792e5

Comment by Abrar-ahmed [ 14/Apr/16 ]

@Andreas Dilge: Hi Andreas, have uploaded the discussed patch and test run results were fine. Can you and others kindly review the patch.

Comment by Gerrit Updater [ 14/Jun/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19308/
Subject: LU-7776 tests: lnet-selftest.sh local_mode failure
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 84030bf26c1763edf9ac17a8cd2765e9163294bf

Comment by Joseph Gmitter (Inactive) [ 15/Jun/16 ]

patch has landed to master for 2.9.0

Generated at Sat Feb 10 02:11:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.