[LU-12874] Ubuntu client testing instability Created: 18/Oct/19  Updated: 21/Dec/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Sebastien Buisson Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

It seems the Lustre Ubuntu client testing is quite unstable on master. I triggered all the usual test groups but with clients running on Ubuntu 18.04, and a vast majority of test sessions failed.

https://review.whamcloud.com/36470

Various tests failed, but always because of a timeout. In some of them (e.g. https://testing.whamcloud.com/test_sets/508af68a-f0fe-11e9-be86-52540065bddc), we have the following error message:

LNetError: 21037:0:(lib-socket.c:225:lnet_sock_create()) Error trying to bind to port 1023: -99

So it has some similarities with the issue reported under LU-11805. But in some other cases, we do not have this message. For instance in this failure https://testing.whamcloud.com/test_sets/e204e6a8-f0fd-11e9-a0ba-52540065bddc , we can see errors while trying to send requests. The stack trace of the stalled process (dd) is as follows:

[14017.507205] Call Trace:
[14017.507537]  __schedule+0x24e/0x880
[14017.507978]  schedule+0x2c/0x80
[14017.508405]  schedule_timeout+0x1cf/0x350
[14017.509155]  ? ptlrpcd_add_req+0x10e/0x2c0 [ptlrpc]
[14017.509750]  wait_for_completion+0xba/0x140
[14017.510257]  ? wake_up_q+0x80/0x80
[14017.510694]  osc_io_setattr_end+0x189/0x200 [osc]
[14017.511258]  ? lov_io_iter_fini_wrapper+0x40/0x40 [lov]
[14017.511902]  cl_io_end+0x57/0x130 [obdclass]
[14017.512503]  lov_io_end_wrapper+0xcf/0xe0 [lov]
[14017.513193]  lov_io_call.isra.9+0x86/0x140 [lov]
[14017.513756]  lov_io_end+0x36/0xd0 [lov]
[14017.514241]  cl_io_end+0x57/0x130 [obdclass]
[14017.514779]  cl_io_loop+0xd8/0x1c0 [obdclass]
[14017.515327]  cl_setattr_ost+0x247/0x300 [lustre]
[14017.515894]  ll_setattr_raw+0xd6a/0xec0 [lustre]
[14017.516473]  ll_setattr+0x5f/0xa0 [lustre]
[14017.517106]  notify_change+0x2eb/0x440
[14017.517580]  do_truncate+0x73/0xc0
[14017.518002]  ? __inode_permission+0x5b/0x160
[14017.518518]  path_openat+0x1192/0x1960
[14017.518982]  do_filp_open+0x9b/0x110
[14017.519421]  ? __check_object_size+0xc3/0x1a0
[14017.519947]  ? __alloc_fd+0xb2/0x170
[14017.520404]  do_sys_open+0x1bb/0x2c0
[14017.520981]  ? do_sys_open+0x1bb/0x2c0
[14017.521439]  SyS_openat+0x14/0x20
[14017.521858]  do_syscall_64+0x73/0x130
[14017.522304]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[14017.522894] RIP: 0033:0x7ff30cb66c8e
[14017.523330] RSP: 002b:00007ffd0e1e6180 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[14017.524217] RAX: ffffffffffffffda RBX: 00005555a742f160 RCX: 00007ff30cb66c8e
[14017.525154] RDX: 0000000000000241 RSI: 00007ffd0e1e6ff4 RDI: 00000000ffffff9c
[14017.525956] RBP: 0000000000000001 R08: 00007ffd0e1e701b R09: 0000000000000000
[14017.526758] R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000241
[14017.527551] R13: 00007ffd0e1e6ff4 R14: 0000000000000001 R15: 0000000000100000

In this other failure on time out https://testing.whamcloud.com/test_sets/768fc648-f0ff-11e9-be86-52540065bddc (conf-sanity test 78), the expression is different with LNet complaining like this:

[21354.716125] LNetError: 6607:0:(peer.c:3724:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0
[21354.718346] LNetError: 6607:0:(peer.c:3724:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 6 previous similar messages
[21364.956272] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI
[21364.957975] LNetError: 6598:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1
[21364.959822] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs?

In all cases, the test timeouts tend to be related to a network or communication problem.

These test failures are quite disruptive for the client-side encryption work, because the encryption feature can only be tested on Ubuntu 18 as it needs an encryption capable kernel.


Generated at Sat Feb 10 02:56:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.