[LU-15604] sanity-lnet test_226: failed to unload modules Created: 28/Feb/22 Updated: 24/Mar/22 Resolved: 24/Mar/22 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Chris Horn |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
This issue was created by maloo for Chris Horn <chris.horn@hpe.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/b3098e8b-70ca-4980-a955-159ee6537597 test_226 failed with the following error: failed to unload modules This is a new test case added by the patch. I don't see anything obviously wrong, so I'm not sure why it failed in this way. I repeated the test 100 times in my VM environment and I did not see this issue. VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Chris Horn [ 22/Mar/22 ] |
|
I was testing https://review.whamcloud.com/#/c/46727/ by starting/stopping LNet in a loop while pinging the test node from 3 other peers: [root@s-lmo-gaz38a ~]# while true; do for i in {1..20}; do lctl ping 172.18.2.6@o2ib10; done; echo sleep 5; sleep 5; done
...
[root@s-lmo-gaz38b ~]# while true; do for i in {1..2}; do lnetctl discover --force 172.18.2.6@o2ib10; done; lnetctl peer del --prim 172.18.2.6@o2ib; echo sleep 5; sleep 5; done
...
cassini-hosta:~ # while true; do for i in {1..20}; do lnetctl discover --force 172.18.2.6@o2ib; done; lnetctl peer del --prim 172.18.2.6@o2ib; echo sleep 5; sleep 5; done
...
cassini-hostb:~ # while true; do /bin/start.sh2 ; lustre_rmmod ; echo sleep 5; sleep 5; done
sleep 5
sleep 5
sleep 5
sleep 5
sleep 5
sleep 5
...
After a while, I opened some additional terminals on the test node and ran some lctl and lnetctl commands in a loop: cassini-hostb:~ # while true; do lctl list_nids 2>/dev/null; done ... cassini-hostb:~ # while true; do lnetctl peer show 2>/dev/null; done ... cassini-hostb:~ # while true; do lnetctl net show 2>/dev/null; done ... At that point, I started to see some rmmod failures like in this ticket: sleep 5 sleep 5 rmmod: ERROR: Module libcfs is in use So I think that this test failure is most likely just from some other lctl or lnetctl process that is running at same time as rmmod. We're executing unload_modules_local via do_rpc_nodes() which is going to invoke "lctl mark" on all the test nodes. This is most likely what is causing the occasional rmmod failure. So it seems that it is not safe to call unload_modules_local() via do_rpc_nodes(). |
| Comment by Chris Horn [ 24/Mar/22 ] |
|
The tests were modified to call lustre_rmmod via do_nodes() instead of the using unload_modules_locals() via do_rpc_nodes(). With that change, the test now passes consistently. https://testing.whamcloud.com/test_sessions/81932feb-6028-40fb-868d-f6a21485811c 100/100 |