[LU-12874] Ubuntu client testing instability Created: 18/Oct/19 Updated: 21/Dec/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Sebastien Buisson | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
It seems the Lustre Ubuntu client testing is quite unstable on master. I triggered all the usual test groups but with clients running on Ubuntu 18.04, and a vast majority of test sessions failed. https://review.whamcloud.com/36470 Various tests failed, but always because of a timeout. In some of them (e.g. https://testing.whamcloud.com/test_sets/508af68a-f0fe-11e9-be86-52540065bddc), we have the following error message: LNetError: 21037:0:(lib-socket.c:225:lnet_sock_create()) Error trying to bind to port 1023: -99 So it has some similarities with the issue reported under [14017.507205] Call Trace: [14017.507537] __schedule+0x24e/0x880 [14017.507978] schedule+0x2c/0x80 [14017.508405] schedule_timeout+0x1cf/0x350 [14017.509155] ? ptlrpcd_add_req+0x10e/0x2c0 [ptlrpc] [14017.509750] wait_for_completion+0xba/0x140 [14017.510257] ? wake_up_q+0x80/0x80 [14017.510694] osc_io_setattr_end+0x189/0x200 [osc] [14017.511258] ? lov_io_iter_fini_wrapper+0x40/0x40 [lov] [14017.511902] cl_io_end+0x57/0x130 [obdclass] [14017.512503] lov_io_end_wrapper+0xcf/0xe0 [lov] [14017.513193] lov_io_call.isra.9+0x86/0x140 [lov] [14017.513756] lov_io_end+0x36/0xd0 [lov] [14017.514241] cl_io_end+0x57/0x130 [obdclass] [14017.514779] cl_io_loop+0xd8/0x1c0 [obdclass] [14017.515327] cl_setattr_ost+0x247/0x300 [lustre] [14017.515894] ll_setattr_raw+0xd6a/0xec0 [lustre] [14017.516473] ll_setattr+0x5f/0xa0 [lustre] [14017.517106] notify_change+0x2eb/0x440 [14017.517580] do_truncate+0x73/0xc0 [14017.518002] ? __inode_permission+0x5b/0x160 [14017.518518] path_openat+0x1192/0x1960 [14017.518982] do_filp_open+0x9b/0x110 [14017.519421] ? __check_object_size+0xc3/0x1a0 [14017.519947] ? __alloc_fd+0xb2/0x170 [14017.520404] do_sys_open+0x1bb/0x2c0 [14017.520981] ? do_sys_open+0x1bb/0x2c0 [14017.521439] SyS_openat+0x14/0x20 [14017.521858] do_syscall_64+0x73/0x130 [14017.522304] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [14017.522894] RIP: 0033:0x7ff30cb66c8e [14017.523330] RSP: 002b:00007ffd0e1e6180 EFLAGS: 00000246 ORIG_RAX: 0000000000000101 [14017.524217] RAX: ffffffffffffffda RBX: 00005555a742f160 RCX: 00007ff30cb66c8e [14017.525154] RDX: 0000000000000241 RSI: 00007ffd0e1e6ff4 RDI: 00000000ffffff9c [14017.525956] RBP: 0000000000000001 R08: 00007ffd0e1e701b R09: 0000000000000000 [14017.526758] R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000241 [14017.527551] R13: 00007ffd0e1e6ff4 R14: 0000000000000001 R15: 0000000000100000 In this other failure on time out https://testing.whamcloud.com/test_sets/768fc648-f0ff-11e9-be86-52540065bddc (conf-sanity test 78), the expression is different with LNet complaining like this: [21354.716125] LNetError: 6607:0:(peer.c:3724:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0 [21354.718346] LNetError: 6607:0:(peer.c:3724:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 6 previous similar messages [21364.956272] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI [21364.957975] LNetError: 6598:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1 [21364.959822] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs? In all cases, the test timeouts tend to be related to a network or communication problem. These test failures are quite disruptive for the client-side encryption work, because the encryption feature can only be tested on Ubuntu 18 as it needs an encryption capable kernel. |