[LU-17294] sanity-lnet test_219: timeout on Ubuntu since 2.15.59 Created: 16/Nov/23 Updated: 15/Jan/24 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.16.0 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | James A Simmons |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com> This issue relates to the following test suite run: test_219 failed with the following error: Timeout occurred after 429 minutes, last suite running was sanity-lnet Test session details: It looks like all failures are on top of the 2.15.59 tag, so it is likely to be caused by one of the recent patch landings: 6521c313f7 New tag 2.15.59 6a6e4ee20f LU-17184 mgc: remove damaged local configs 21295b169b LU-17213 llite: check sdio before freeing it 87ca3cffe6 LU-17259 lnet: kgnilnd_nl_get should return 0 on success 5afe3b0538 LU-17258 socklnd: ensure connection type established upon race c4c9a8eea3 LU-17256 debian: allow building client dkms on arm64 982eca73a9 LU-17000 coverity: Fix Logically dead code under liblnetconfig.c b83156304d LU-17254 lnet: Fix ofed detection with specific kernel version 9ba375983d LU-17249 ptlrpc: protect scp_rqbd_idle list operations ee56161ea0 LU-17000 coverity: Fix Dereference after null under client.c edb968d04f LU-17242 debug: remove CFS_CHECK_STACK 4b290188fb LU-16868 tests: skip conf-sanity/32 in interop 37a50f74da LU-16796 lfsck: Change lfsck_assistant_object to use kref d7e3e7c104 LU-16796 target: Change struct job_stat to use kref a12c352a3d LU-17205 utils: add lctl get_param -H option 8fa3532b1e LU-17204 lod: don't panic on short LOVEA f2f8b6deaf LU-17203 libcfs: ignore remaining items 1759ae751a LU-16796 ofd: Change struct ofd_seq to use refcount_t e420e92ac9 LU-17196 tests: sanity-lnet test_310 MR support 6aede12548 LU-16518 rsync: fix new clang error in lustre_rsync.c 36b14a23a6 LU-17207 lnet: race b/w monitor thr stop and discovery push 1b694ad04f LU-16896 flr: resync could mess mirror content d1fadf8e8a LU-17132 kernel: update RHEL 8.8 [4.18.0-477.27.1.el8_8] 67e0d9e40a LU-17191 osc: only call xa_insert for new entries ee0e9447e7 LU-17115 quota: fix race of acquiring locks in qmt ecea24d843 LU-17071 o2iblnd: IBLND_REJECT_EARLY condition causes Oops 05c97b1096 LU-17232 build: fix ext4-misc for el7.6 server 24d515367f LU-9859 libcfs: migrate libcfs_mem.c to lnet/lib-mem.c b0cc96a1ff LU-17131 ldiskfs: el9.2 encdata and filename-encode 57ac32a223 LU-16097 quota: release preacquired quota when over limits 6fbffd9c09 LU-14361 statahead: add tunable for fname pattern statahead 753e058b4c LU-4974 lod: Change pool_desc to "[lod|lov]_pool_desc" cb5f92c0e3 LU-10391 ksocklnd: use ksocknal_protocol v4 for IPv6 02b22df643 LU-17235 o2iblnd: adding alias ib interface causes crash 6ad9ef1fec LU-17225 utils: check_iam print records b5fde4d6c0 LU-17197 obdclass: preserve fairness when waiting for rpc slot 68254c484a LU-10391 lnet: handle discovery with Netlink 4512347d6c LU-16356 hsm: add running ref to the coordinator a772e90243 LU-16032 osd: move unlink of large objects to separate thread 2c97684db9 LU-17181 misc: don't block reclaim threads a8e66b899a LU-17103 lnet: use workqueue for lnd ping buffer updates VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Andreas Dilger [ 16/Nov/23 ] |
|
simmonsja, ssmirnov, could you please take a look, there are about 15 failures but only since 2023-11-09 so it is a very recent regression. |
| Comment by Gerrit Updater [ 20/Nov/23 ] |
|
"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53185 |
| Comment by Gerrit Updater [ 20/Nov/23 ] |
|
"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53186 |
| Comment by Serguei Smirnov [ 21/Nov/23 ] |
|
According to the bisection results (bisect patch gerrit change id 53185) , the issue shows up in 68254c484a, "LU-10391 lnet: handle discovery with Netlink". Two commits prior to this point passed the test: 9938228dc7e708fd, "LU-17198 tests: running_in_vm to recognize qemu" 4512347d6cda68fc, "LU-16356 hsm: add running ref to the coordinator" One commit after this point also failed the test (bisect patch gerrit change id 53186):
57ac32a22372065, "LU-16097 quota: release preacquired quota when over limits"
Now need to understand how to fix it.
|
| Comment by James A Simmons [ 22/Nov/23 ] |
|
That is strange its only Ubuntu. Also looking at the logs it seems to not get beyond lnetctl lnet configure. |
| Comment by James A Simmons [ 28/Nov/23 ] |
|
I noticed in the sanity-lnet logs its looks like Lustre is mounted. LU-17311 patch also does a run with sanity-lnet 219 but its doing against a Lustre patch revert. Perhaps that bug is exposing any issue? |
| Comment by Gerrit Updater [ 15/Jan/24 ] |
|
"James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53677 |