Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17294

sanity-lnet test_219: timeout on Ubuntu since 2.15.59

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • None
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

      This issue relates to the following test suite run:
      https://testing.whamcloud.com/test_sets/f17b0778-f4c2-4789-9e34-f5eaf3425b08

      test_219 failed with the following error:

      Timeout occurred after 429 minutes, last suite running was sanity-lnet
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-reviews/100335 - 5.15.0-71-generic
      servers: https://build.whamcloud.com/job/lustre-reviews/100335 - 4.18.0-477.27.1.el8_lustre.x86_64

      It looks like all failures are on top of the 2.15.59 tag, so it is likely to be caused by one of the recent patch landings:

      6521c313f7 New tag 2.15.59
      6a6e4ee20f LU-17184 mgc: remove damaged local configs
      21295b169b LU-17213 llite: check sdio before freeing it
      87ca3cffe6 LU-17259 lnet: kgnilnd_nl_get should return 0 on success
      5afe3b0538 LU-17258 socklnd: ensure connection type established upon race
      c4c9a8eea3 LU-17256 debian: allow building client dkms on arm64
      982eca73a9 LU-17000 coverity: Fix Logically dead code under liblnetconfig.c
      b83156304d LU-17254 lnet: Fix ofed detection with specific kernel version
      9ba375983d LU-17249 ptlrpc: protect scp_rqbd_idle list operations
      ee56161ea0 LU-17000 coverity: Fix Dereference after null under client.c
      edb968d04f LU-17242 debug: remove CFS_CHECK_STACK
      4b290188fb LU-16868 tests: skip conf-sanity/32 in interop
      37a50f74da LU-16796 lfsck: Change lfsck_assistant_object to use kref
      d7e3e7c104 LU-16796 target: Change struct job_stat to use kref
      a12c352a3d LU-17205 utils: add lctl get_param -H option
      8fa3532b1e LU-17204 lod: don't panic on short LOVEA
      f2f8b6deaf LU-17203 libcfs: ignore remaining items
      1759ae751a LU-16796 ofd: Change struct ofd_seq to use refcount_t
      e420e92ac9 LU-17196 tests: sanity-lnet test_310 MR support
      6aede12548 LU-16518 rsync: fix new clang error in lustre_rsync.c
      36b14a23a6 LU-17207 lnet: race b/w monitor thr stop and discovery push
      1b694ad04f LU-16896 flr: resync could mess mirror content
      d1fadf8e8a LU-17132 kernel: update RHEL 8.8 [4.18.0-477.27.1.el8_8]
      67e0d9e40a LU-17191 osc: only call xa_insert for new entries
      ee0e9447e7 LU-17115 quota: fix race of acquiring locks in qmt
      ecea24d843 LU-17071 o2iblnd: IBLND_REJECT_EARLY condition causes Oops
      05c97b1096 LU-17232 build: fix ext4-misc for el7.6 server
      24d515367f LU-9859 libcfs: migrate libcfs_mem.c to lnet/lib-mem.c
      b0cc96a1ff LU-17131 ldiskfs: el9.2 encdata and filename-encode
      57ac32a223 LU-16097 quota: release preacquired quota when over limits
      6fbffd9c09 LU-14361 statahead: add tunable for fname pattern statahead
      753e058b4c LU-4974 lod: Change pool_desc to "[lod|lov]_pool_desc"
      cb5f92c0e3 LU-10391 ksocklnd: use ksocknal_protocol v4 for IPv6
      02b22df643 LU-17235 o2iblnd: adding alias ib interface causes crash
      6ad9ef1fec LU-17225 utils: check_iam print records
      b5fde4d6c0 LU-17197 obdclass: preserve fairness when waiting for rpc slot
      68254c484a LU-10391 lnet: handle discovery with Netlink
      4512347d6c LU-16356 hsm: add running ref to the coordinator
      a772e90243 LU-16032 osd: move unlink of large objects to separate thread
      2c97684db9 LU-17181 misc: don't block reclaim threads
      a8e66b899a LU-17103 lnet: use workqueue for lnd ping buffer updates
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity-lnet test_219 - Timeout occurred after 429 minutes, last suite running was sanity-lnet

      Attachments

        Issue Links

          Activity

            [LU-17294] sanity-lnet test_219: timeout on Ubuntu since 2.15.59

            Patch for LU-17700 fixed this issue.

            simmonsja James A Simmons added a comment - Patch for LU-17700 fixed this issue.

            "James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53677
            Subject: LU-17294 tests: verify sanity-lnet on Ubuntu
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 57c210f504a266fd43a9f9bc4e2cff366e47bec4

            gerrit Gerrit Updater added a comment - "James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53677 Subject: LU-17294 tests: verify sanity-lnet on Ubuntu Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 57c210f504a266fd43a9f9bc4e2cff366e47bec4

            I noticed in the sanity-lnet logs its looks like Lustre is mounted. LU-17311 patch also does a run with sanity-lnet 219 but its doing against a Lustre patch revert. Perhaps that bug is exposing any issue?

            simmonsja James A Simmons added a comment - I noticed in the sanity-lnet logs its looks like Lustre is mounted. LU-17311 patch also does a run with sanity-lnet 219 but its doing against a Lustre patch revert. Perhaps that bug is exposing any issue?

            That is strange its only Ubuntu. Also looking at the logs it seems to not get beyond lnetctl lnet configure.

            simmonsja James A Simmons added a comment - That is strange its only Ubuntu. Also looking at the logs it seems to not get beyond lnetctl lnet configure.

            According to the bisection results (bisect patch gerrit change id 53185)  , the issue shows up in 68254c484a, "LU-10391 lnet: handle discovery with Netlink".

            Two commits prior to this point passed the test: 

            9938228dc7e708fd, "LU-17198 tests: running_in_vm to recognize qemu"
            4512347d6cda68fc, "LU-16356 hsm: add running ref to the coordinator" 

            One commit after this point also failed the test (bisect patch gerrit change id 53186):

             57ac32a22372065, "LU-16097 quota: release preacquired quota when over limits"

            Now need to understand how to fix it.

             

            ssmirnov Serguei Smirnov added a comment - According to the bisection results (bisect patch gerrit change id 53185)  , the issue shows up in 68254c484a, " LU-10391 lnet: handle discovery with Netlink". Two commits prior to this point passed the test:  9938228dc7e708fd, "LU-17198 tests: running_in_vm to recognize qemu" 4512347d6cda68fc, "LU-16356 hsm: add running ref to the coordinator" One commit after this point also failed the test (bisect patch gerrit change id 53186): 57ac32a22372065, "LU-16097 quota: release preacquired quota when over limits" Now need to understand how to fix it.  

            "Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53186
            Subject: LU-17294 lnet: bisect sanity-lnet timeout
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 8fcba6f7aae458304bf35e9934f0dffeec69531f

            gerrit Gerrit Updater added a comment - "Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53186 Subject: LU-17294 lnet: bisect sanity-lnet timeout Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8fcba6f7aae458304bf35e9934f0dffeec69531f

            "Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53185
            Subject: LU-17294 lnet: bisect sanity-lnet timeout
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 6d723c3f58dbc643c02acee1c0ff01c6ec1d1247

            gerrit Gerrit Updater added a comment - "Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53185 Subject: LU-17294 lnet: bisect sanity-lnet timeout Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 6d723c3f58dbc643c02acee1c0ff01c6ec1d1247

            simmonsja, ssmirnov, could you please take a look, there are about 15 failures but only since 2023-11-09 so it is a very recent regression.

            adilger Andreas Dilger added a comment - simmonsja , ssmirnov , could you please take a look, there are about 15 failures but only since 2023-11-09 so it is a very recent regression.

            People

              simmonsja James A Simmons
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: