[LU-15604] sanity-lnet test_226: failed to unload modules Created: 28/Feb/22  Updated: 24/Mar/22  Resolved: 24/Mar/22

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Chris Horn
Resolution: Won't Fix Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Chris Horn <chris.horn@hpe.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/b3098e8b-70ca-4980-a955-159ee6537597

test_226 failed with the following error:

failed to unload modules

This is a new test case added by the patch. I don't see anything obviously wrong, so I'm not sure why it failed in this way. I repeated the test 100 times in my VM environment and I did not see this issue.

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-lnet test_226 - failed to unload modules



 Comments   
Comment by Chris Horn [ 22/Mar/22 ]

I was testing https://review.whamcloud.com/#/c/46727/ by starting/stopping LNet in a loop while pinging the test node from 3 other peers:

[root@s-lmo-gaz38a ~]# while true; do for i in {1..20}; do lctl ping 172.18.2.6@o2ib10; done; echo sleep 5; sleep 5; done
...

[root@s-lmo-gaz38b ~]# while true; do for i in {1..2}; do lnetctl discover --force 172.18.2.6@o2ib10; done; lnetctl peer del --prim 172.18.2.6@o2ib; echo sleep 5; sleep 5; done
...

cassini-hosta:~ # while true; do for i in {1..20}; do lnetctl discover --force 172.18.2.6@o2ib; done; lnetctl peer del --prim 172.18.2.6@o2ib; echo sleep 5; sleep 5; done
...

cassini-hostb:~ # while true; do /bin/start.sh2 ; lustre_rmmod ; echo sleep 5; sleep 5; done
sleep 5
sleep 5
sleep 5
sleep 5
sleep 5
sleep 5
...

After a while, I opened some additional terminals on the test node and ran some lctl and lnetctl commands in a loop:

cassini-hostb:~ # while true; do lctl list_nids 2>/dev/null; done
...
cassini-hostb:~ # while true; do lnetctl peer show 2>/dev/null; done
...
cassini-hostb:~ # while true; do lnetctl net show 2>/dev/null; done
...

At that point, I started to see some rmmod failures like in this ticket:

sleep 5
sleep 5
rmmod: ERROR: Module libcfs is in use

So I think that this test failure is most likely just from some other lctl or lnetctl process that is running at same time as rmmod. We're executing unload_modules_local via do_rpc_nodes() which is going to invoke "lctl mark" on all the test nodes. This is most likely what is causing the occasional rmmod failure. So it seems that it is not safe to call unload_modules_local() via do_rpc_nodes().

Comment by Chris Horn [ 24/Mar/22 ]

The tests were modified to call lustre_rmmod via do_nodes() instead of the using unload_modules_locals() via do_rpc_nodes(). With that change, the test now passes consistently. https://testing.whamcloud.com/test_sessions/81932feb-6028-40fb-868d-f6a21485811c 100/100

Generated at Sat Feb 10 03:19:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.