Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14809

Intermittent failure of sanity test_398g (we should see 8 RPCs in flight)

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.15.7
    • None
    • 3
    • 9223372036854775807

    Description

      Lustre Gerrit Janitor
      4:05 PM
      ↩
      Patch Set 15:

      Testing has completed with errors! IMPORTANT: these tests appear to be new failures unique to this patch

      • sanity2@ldiskfs+DNE:test_398g(Seen in reviews: 39447 39443 39440 39438 39436)

      > sanity2@ldiskfs+DNE Failure(7638s)

      • 398g(we should see 8 RPCs in flight)

      Succeeded:

      • conf-sanity1@ldiskfs+DNE conf-sanity2@ldiskfs+DNE conf-sanity3@ldiskfs+DNE conf-sanity-slow@ldiskfs+DNE conf-sanity-slow-106@ldiskfs+DNE conf-sanity1@zfs conf-sanity2@zfs conf-sanity3@zfs conf-sanity-slow@zfs conf-sanity-slow-106@zfs insanity@ldiskfs+DNE insanity@zfs lnet-selftest@zfs lustre-rsync-test@ldiskfs+DNE lustre-rsync-test@zfs ost-pools@ldiskfs+DNE ost-pools@zfs pjdfstest@ldiskfs+DNE pjdfstest@zfs racer@ldiskfs+DNE racer@zfs recovery-small@ldiskfs+DNE recovery-small@zfs replay-dual@ldiskfs+DNE replay-dual@zfs replay-ost-single@ldiskfs+DNE replay-ost-single@zfs replay-single@ldiskfs+DNE replay-single@zfs replay-vbr@ldiskfs+DNE replay-vbr@zfs sanity1@ldiskfs+DNE sanity-slow@ldiskfs+DNE sanity1@zfs sanity2@zfs sanity-slow@zfs sanity-benchmark@ldiskfs+DNE sanity-benchmark@zfs sanity-dom@ldiskfs+DNE sanity-dom@zfs sanity-flr@ldiskfs+DNE sanity-flr@zfs sanity-hsm@ldiskfs+DNE sanity-hsm@zfs sanity-lfsck@ldiskfs+DNE sanity-lfsck@zfs sanity-lnet@zfs sanity-pcc@ldiskfs+DNE sanity-pcc@zfs sanity-pfl@ldiskfs+DNE sanity-pfl@zfs sanity-quota@ldiskfs+DNE sanity-quota@zfs sanity-scrub@ldiskfs+DNE sanity-scrub@zfs sanity-sec@ldiskfs+DNE sanity-sec@zfs sanityn@ldiskfs+DNE sanityn@zfs

      (centos7)All results and logs: https://testing-archive.whamcloud.com/gerrit-janitor/16844/results.html

      Attachments

        Issue Links

          Activity

            [LU-14809] Intermittent failure of sanity test_398g (we should see 8 RPCs in flight)
            yujian Jian Yu added a comment -
            yujian Jian Yu added a comment - Test session details: clients: https://build.whamcloud.com/job/lustre-b2_15/101 - 4.18.0-553.27.1.el8_10.x86_64 servers: https://build.whamcloud.com/job/lustre-b2_12/170 - 3.10.0-1160.81.1.el7_lustre.x86_64 https://testing.whamcloud.com/test_sets/3b105e35-168a-46e9-82c8-a0d6a6faa4a5
            adilger Andreas Dilger added a comment - Also seeing failures on aarch64 clients: https://testing.whamcloud.com/test_sets/0b569f56-1b60-4dec-877e-f0b400058a49

            This failed 10x in the past 4 weeks. Nothing critical, but is it worthwhile to just accept the "9 RPCs in flight" answer and move on?

            adilger Andreas Dilger added a comment - This failed 10x in the past 4 weeks. Nothing critical, but is it worthwhile to just accept the "9 RPCs in flight" answer and move on?

            Still seeing this intermittent failure, with 9 RPCs in flight.

            Could it just be an extra OBD_PING or similar coincident with the start of the write? There should be enough info to tie the timestamp when the stats are collected to the Lustre debug log to see which RPCs were in flight at that time.

            adilger Andreas Dilger added a comment - Still seeing this intermittent failure, with 9 RPCs in flight. Could it just be an extra OBD_PING or similar coincident with the start of the write? There should be enough info to tie the timestamp when the stats are collected to the Lustre debug log to see which RPCs were in flight at that time.
            adilger Andreas Dilger added a comment - +1 on master https://testing.whamcloud.com/test_sets/d338f55b-171e-4a2d-aca2-1cac8becb602
            adilger Andreas Dilger added a comment - +1 on master: https://testing.whamcloud.com/test_sets/50aacb7f-d807-42ea-aa36-28cc7e8c9a31

            11 failures in the past 4 weeks, of 2000+ runs, so not super frequent

            adilger Andreas Dilger added a comment - 11 failures in the past 4 weeks, of 2000+ runs, so not super frequent
            adilger Andreas Dilger added a comment - +1 on master; https://testing.whamcloud.com/test_sets/65370343-bb92-428b-b736-b1641382af47
            adilger Andreas Dilger added a comment - +1 on master: https://testing.whamcloud.com/test_sets/17cec08c-d1ed-48cf-bf54-b0e9b0de387b

            This test was added by the recent parallel DIO work; this specific failure is one where we got one extra RPC of ~ 1 page for some reason:

            8388608 bytes (8.4 MB) copied, 2.17939 s, 3.8 MB/s
            osc.lustre-OST0000-osc-ffff8800d6b67800.rpc_stats=
            snapshot_time:         1625256875.834467231 (secs.nsecs)
            read RPCs in flight:  0
            write RPCs in flight: 1
            pending write pages:  0
            pending read pages:   0			
                                  read	      write
            pages per rpc         rpcs   % cum % |       rpcs   % cum %
            1:		         0   0   0   |          1  11  11
            2:		         0   0   0   |          0   0  11
            4:		         0   0   0   |          0   0  11
            8:		         0   0   0   |          0   0  11
            16:		         0   0   0   |          0   0  11
            32:		         0   0   0   |          0   0  11
            64:		         0   0   0   |          0   0  11
            128:		         0   0   0   |          0   0  11
            256:		         0   0   0   |          8  88 100
            
                            read			write
            rpcs in flight        rpcs   % cum % |       rpcs   % cum %
            1:		         0   0   0   |          1  11  11
            2:		         0   0   0   |          1  11  22
            3:		         0   0   0   |          1  11  33
            4:		         0   0   0   |          1  11  44
            5:		         0   0   0   |          1  11  55
            6:		         0   0   0   |          1  11  66
            7:		         0   0   0   |          1  11  77
            8:		         0   0   0   |          1  11  88
            9:		         0   0   0   |          1  11 100 

            This doesn't indicate the feature is broken, but it's weird and we can't have the test failing.

            paf0186 Patrick Farrell added a comment - This test was added by the recent parallel DIO work; this specific failure is one where we got one extra RPC of ~ 1 page for some reason: 8388608 bytes (8.4 MB) copied, 2.17939 s, 3.8 MB/s osc.lustre-OST0000-osc-ffff8800d6b67800.rpc_stats= snapshot_time: 1625256875.834467231 (secs.nsecs) read RPCs in flight: 0 write RPCs in flight: 1 pending write pages: 0 pending read pages: 0 read write pages per rpc rpcs % cum % | rpcs % cum % 1: 0 0 0 | 1 11 11 2: 0 0 0 | 0 0 11 4: 0 0 0 | 0 0 11 8: 0 0 0 | 0 0 11 16: 0 0 0 | 0 0 11 32: 0 0 0 | 0 0 11 64: 0 0 0 | 0 0 11 128: 0 0 0 | 0 0 11 256: 0 0 0 | 8 88 100 read write rpcs in flight rpcs % cum % | rpcs % cum % 1: 0 0 0 | 1 11 11 2: 0 0 0 | 1 11 22 3: 0 0 0 | 1 11 33 4: 0 0 0 | 1 11 44 5: 0 0 0 | 1 11 55 6: 0 0 0 | 1 11 66 7: 0 0 0 | 1 11 77 8: 0 0 0 | 1 11 88 9: 0 0 0 | 1 11 100 This doesn't indicate the feature is broken, but it's weird and we can't have the test failing.

            People

              paf0186 Patrick Farrell
              paf0186 Patrick Farrell
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: