Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12874

Ubuntu client testing instability

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.13.0
    • None
    • 3
    • 9223372036854775807

    Description

      It seems the Lustre Ubuntu client testing is quite unstable on master. I triggered all the usual test groups but with clients running on Ubuntu 18.04, and a vast majority of test sessions failed.

      https://review.whamcloud.com/36470

      Various tests failed, but always because of a timeout. In some of them (e.g. https://testing.whamcloud.com/test_sets/508af68a-f0fe-11e9-be86-52540065bddc), we have the following error message:

      LNetError: 21037:0:(lib-socket.c:225:lnet_sock_create()) Error trying to bind to port 1023: -99
      

      So it has some similarities with the issue reported under LU-11805. But in some other cases, we do not have this message. For instance in this failure https://testing.whamcloud.com/test_sets/e204e6a8-f0fd-11e9-a0ba-52540065bddc , we can see errors while trying to send requests. The stack trace of the stalled process (dd) is as follows:

      [14017.507205] Call Trace:
      [14017.507537]  __schedule+0x24e/0x880
      [14017.507978]  schedule+0x2c/0x80
      [14017.508405]  schedule_timeout+0x1cf/0x350
      [14017.509155]  ? ptlrpcd_add_req+0x10e/0x2c0 [ptlrpc]
      [14017.509750]  wait_for_completion+0xba/0x140
      [14017.510257]  ? wake_up_q+0x80/0x80
      [14017.510694]  osc_io_setattr_end+0x189/0x200 [osc]
      [14017.511258]  ? lov_io_iter_fini_wrapper+0x40/0x40 [lov]
      [14017.511902]  cl_io_end+0x57/0x130 [obdclass]
      [14017.512503]  lov_io_end_wrapper+0xcf/0xe0 [lov]
      [14017.513193]  lov_io_call.isra.9+0x86/0x140 [lov]
      [14017.513756]  lov_io_end+0x36/0xd0 [lov]
      [14017.514241]  cl_io_end+0x57/0x130 [obdclass]
      [14017.514779]  cl_io_loop+0xd8/0x1c0 [obdclass]
      [14017.515327]  cl_setattr_ost+0x247/0x300 [lustre]
      [14017.515894]  ll_setattr_raw+0xd6a/0xec0 [lustre]
      [14017.516473]  ll_setattr+0x5f/0xa0 [lustre]
      [14017.517106]  notify_change+0x2eb/0x440
      [14017.517580]  do_truncate+0x73/0xc0
      [14017.518002]  ? __inode_permission+0x5b/0x160
      [14017.518518]  path_openat+0x1192/0x1960
      [14017.518982]  do_filp_open+0x9b/0x110
      [14017.519421]  ? __check_object_size+0xc3/0x1a0
      [14017.519947]  ? __alloc_fd+0xb2/0x170
      [14017.520404]  do_sys_open+0x1bb/0x2c0
      [14017.520981]  ? do_sys_open+0x1bb/0x2c0
      [14017.521439]  SyS_openat+0x14/0x20
      [14017.521858]  do_syscall_64+0x73/0x130
      [14017.522304]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      [14017.522894] RIP: 0033:0x7ff30cb66c8e
      [14017.523330] RSP: 002b:00007ffd0e1e6180 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
      [14017.524217] RAX: ffffffffffffffda RBX: 00005555a742f160 RCX: 00007ff30cb66c8e
      [14017.525154] RDX: 0000000000000241 RSI: 00007ffd0e1e6ff4 RDI: 00000000ffffff9c
      [14017.525956] RBP: 0000000000000001 R08: 00007ffd0e1e701b R09: 0000000000000000
      [14017.526758] R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000241
      [14017.527551] R13: 00007ffd0e1e6ff4 R14: 0000000000000001 R15: 0000000000100000
      

      In this other failure on time out https://testing.whamcloud.com/test_sets/768fc648-f0ff-11e9-be86-52540065bddc (conf-sanity test 78), the expression is different with LNet complaining like this:

      [21354.716125] LNetError: 6607:0:(peer.c:3724:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0
      [21354.718346] LNetError: 6607:0:(peer.c:3724:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 6 previous similar messages
      [21364.956272] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI
      [21364.957975] LNetError: 6598:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1
      [21364.959822] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs?
      

      In all cases, the test timeouts tend to be related to a network or communication problem.

      These test failures are quite disruptive for the client-side encryption work, because the encryption feature can only be tested on Ubuntu 18 as it needs an encryption capable kernel.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              sebastien Sebastien Buisson
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: