[LU-17294] sanity-lnet test_219: timeout on Ubuntu since 2.15.59 Created: 16/Nov/23  Updated: 15/Jan/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.16.0
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: James A Simmons
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-10391 LNET: Support IPv6 Reopened
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run:
https://testing.whamcloud.com/test_sets/f17b0778-f4c2-4789-9e34-f5eaf3425b08

test_219 failed with the following error:

Timeout occurred after 429 minutes, last suite running was sanity-lnet

Test session details:
clients: https://build.whamcloud.com/job/lustre-reviews/100335 - 5.15.0-71-generic
servers: https://build.whamcloud.com/job/lustre-reviews/100335 - 4.18.0-477.27.1.el8_lustre.x86_64

It looks like all failures are on top of the 2.15.59 tag, so it is likely to be caused by one of the recent patch landings:

6521c313f7 New tag 2.15.59
6a6e4ee20f LU-17184 mgc: remove damaged local configs
21295b169b LU-17213 llite: check sdio before freeing it
87ca3cffe6 LU-17259 lnet: kgnilnd_nl_get should return 0 on success
5afe3b0538 LU-17258 socklnd: ensure connection type established upon race
c4c9a8eea3 LU-17256 debian: allow building client dkms on arm64
982eca73a9 LU-17000 coverity: Fix Logically dead code under liblnetconfig.c
b83156304d LU-17254 lnet: Fix ofed detection with specific kernel version
9ba375983d LU-17249 ptlrpc: protect scp_rqbd_idle list operations
ee56161ea0 LU-17000 coverity: Fix Dereference after null under client.c
edb968d04f LU-17242 debug: remove CFS_CHECK_STACK
4b290188fb LU-16868 tests: skip conf-sanity/32 in interop
37a50f74da LU-16796 lfsck: Change lfsck_assistant_object to use kref
d7e3e7c104 LU-16796 target: Change struct job_stat to use kref
a12c352a3d LU-17205 utils: add lctl get_param -H option
8fa3532b1e LU-17204 lod: don't panic on short LOVEA
f2f8b6deaf LU-17203 libcfs: ignore remaining items
1759ae751a LU-16796 ofd: Change struct ofd_seq to use refcount_t
e420e92ac9 LU-17196 tests: sanity-lnet test_310 MR support
6aede12548 LU-16518 rsync: fix new clang error in lustre_rsync.c
36b14a23a6 LU-17207 lnet: race b/w monitor thr stop and discovery push
1b694ad04f LU-16896 flr: resync could mess mirror content
d1fadf8e8a LU-17132 kernel: update RHEL 8.8 [4.18.0-477.27.1.el8_8]
67e0d9e40a LU-17191 osc: only call xa_insert for new entries
ee0e9447e7 LU-17115 quota: fix race of acquiring locks in qmt
ecea24d843 LU-17071 o2iblnd: IBLND_REJECT_EARLY condition causes Oops
05c97b1096 LU-17232 build: fix ext4-misc for el7.6 server
24d515367f LU-9859 libcfs: migrate libcfs_mem.c to lnet/lib-mem.c
b0cc96a1ff LU-17131 ldiskfs: el9.2 encdata and filename-encode
57ac32a223 LU-16097 quota: release preacquired quota when over limits
6fbffd9c09 LU-14361 statahead: add tunable for fname pattern statahead
753e058b4c LU-4974 lod: Change pool_desc to "[lod|lov]_pool_desc"
cb5f92c0e3 LU-10391 ksocklnd: use ksocknal_protocol v4 for IPv6
02b22df643 LU-17235 o2iblnd: adding alias ib interface causes crash
6ad9ef1fec LU-17225 utils: check_iam print records
b5fde4d6c0 LU-17197 obdclass: preserve fairness when waiting for rpc slot
68254c484a LU-10391 lnet: handle discovery with Netlink
4512347d6c LU-16356 hsm: add running ref to the coordinator
a772e90243 LU-16032 osd: move unlink of large objects to separate thread
2c97684db9 LU-17181 misc: don't block reclaim threads
a8e66b899a LU-17103 lnet: use workqueue for lnd ping buffer updates

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-lnet test_219 - Timeout occurred after 429 minutes, last suite running was sanity-lnet



 Comments   
Comment by Andreas Dilger [ 16/Nov/23 ]

simmonsja, ssmirnov, could you please take a look, there are about 15 failures but only since 2023-11-09 so it is a very recent regression.

Comment by Gerrit Updater [ 20/Nov/23 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53185
Subject: LU-17294 lnet: bisect sanity-lnet timeout
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6d723c3f58dbc643c02acee1c0ff01c6ec1d1247

Comment by Gerrit Updater [ 20/Nov/23 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53186
Subject: LU-17294 lnet: bisect sanity-lnet timeout
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8fcba6f7aae458304bf35e9934f0dffeec69531f

Comment by Serguei Smirnov [ 21/Nov/23 ]

According to the bisection results (bisect patch gerrit change id 53185)  , the issue shows up in 68254c484a, "LU-10391 lnet: handle discovery with Netlink".

Two commits prior to this point passed the test: 

9938228dc7e708fd, "LU-17198 tests: running_in_vm to recognize qemu"
4512347d6cda68fc, "LU-16356 hsm: add running ref to the coordinator" 

One commit after this point also failed the test (bisect patch gerrit change id 53186):

 57ac32a22372065, "LU-16097 quota: release preacquired quota when over limits"

Now need to understand how to fix it.

 

Comment by James A Simmons [ 22/Nov/23 ]

That is strange its only Ubuntu. Also looking at the logs it seems to not get beyond lnetctl lnet configure.

Comment by James A Simmons [ 28/Nov/23 ]

I noticed in the sanity-lnet logs its looks like Lustre is mounted. LU-17311 patch also does a run with sanity-lnet 219 but its doing against a Lustre patch revert. Perhaps that bug is exposing any issue?

Comment by Gerrit Updater [ 15/Jan/24 ]

"James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53677
Subject: LU-17294 tests: verify sanity-lnet on Ubuntu
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 57c210f504a266fd43a9f9bc4e2cff366e47bec4

Generated at Sat Feb 10 03:34:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.