Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11787

sanityn test 71a fails with ‘data is not flushed from client’

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0, Lustre 2.15.4
    • Lustre 2.12.0, (10)
      Lustre 2.13.0, Lustre 2.12.1, Lustre 2.12.2, Lustre 2.12.3, Lustre 2.14.0, Lustre 2.12.4, Lustre 2.12.5, Lustre 2.12.6, Lustre 2.15.0, Lustre 2.15.3
    • 3
    • 9223372036854775807

    Description

      sanityn test_71a fails with ‘data is not flushed from client’ for ARM clients. These failures started on October 26, 2018.

      Looking at a recent failure, https://testing.whamcloud.com/test_sets/f1121dd4-fdef-11e8-b837-52540065bddc , we see the following in the client test_log

      == sanityn test 71a: correct file map just after write operation is finished ========================= 00:36:26 (1544488586)
      CMD: trevis-24vm4 /usr/sbin/lctl get_param -n version 2>/dev/null ||
      				/usr/sbin/lctl lustre_build_version 2>/dev/null ||
      				/usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2
      1+0 records in
      1+0 records out
      40960 bytes (41 kB) copied, 0.00437335 s, 9.4 MB/s
      1+0 records in
      1+0 records out
      40960 bytes (41 kB) copied, 0.00155542 s, 26.3 MB/s
        File: '/mnt/lustre2/f71a.sanityn'
        Size: 163840    	Blocks: 1          IO Block: 4194304 regular file
      Device: 2c54f966h/743766374d	Inode: 144115205322894149  Links: 1
      Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
      Access: 2018-12-11 00:36:27.000000000 +0000
      Modify: 2018-12-11 00:36:27.000000000 +0000
      Change: 2018-12-11 00:36:27.000000000 +0000
       Birth: -
      209708
      fd: 3
      extent in offset 0, length 122880
      flags: 80000000
      extent in offset 122880, length 40960
      flags: 80000001
      No unwritten extents, extents number 2, file size 163840, original size 81920
       sanityn test_71a: @@@@@@ FAIL: data is not flushed from client 
      

      There are no errors in any of the console logs.

      Logs for more of these test failures are at
      https://testing.whamcloud.com/test_sets/8b2f4282-d9d0-11e8-b46b-52540065bddc
      https://testing.whamcloud.com/test_sets/83526f8c-df5f-11e8-a871-52540065bddc
      https://testing.whamcloud.com/test_sets/1de368ba-fa38-11e8-bb6b-52540065bddc

      Attachments

        Issue Links

          Activity

            [LU-11787] sanityn test 71a fails with ‘data is not flushed from client’

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51287/
            Subject: LU-11787 test: Fix checkfilemap tests for 64K page
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: 66d9916dc947064b598f52476fb6482c0bbaff10

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51287/ Subject: LU-11787 test: Fix checkfilemap tests for 64K page Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: 66d9916dc947064b598f52476fb6482c0bbaff10

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51287
            Subject: LU-11787 test: Fix checkfilemap tests for 64K page
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: 5755b9c9148cfe235fb2f4dcb066eb0c83fab274

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51287 Subject: LU-11787 test: Fix checkfilemap tests for 64K page Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 5755b9c9148cfe235fb2f4dcb066eb0c83fab274
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45629/
            Subject: LU-11787 test: Fix checkfilemap tests for 64K page
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 7c88dfd28b5cc6114a85f187ecb2473657d42c9d

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45629/ Subject: LU-11787 test: Fix checkfilemap tests for 64K page Project: fs/lustre-release Branch: master Current Patch Set: Commit: 7c88dfd28b5cc6114a85f187ecb2473657d42c9d

            Looking at the test itself, it seems clear that it won't work properly on a 64KB PAGE_SIZE system:

                     # write data this way: hole - data - hole - data
                    dd if=/dev/urandom of=$DIR1/$tfile bs=40K seek=1 count=1
                    dd if=/dev/urandom of=$DIR1/$tfile bs=40K seek=3 count=1
            

            because the "hole" is from 80KB-120KB, and with 64KB PAGE_SIZE the 40-80KB write will actually dirty the 64KB-128KB page and no hole is created...

            The test needs to be updated to use seek=0 for the first dd and seek=4 for the second dd and it should pass. It may need some adjustments to the later checks, but at least there will be a hole from 64KB-128KB for 64KB PAGE_SIZE clients, and 40-160KB for 4KB PAGE_SIZE clients. If we want to make the file consistent, it should use bs=64K.

            adilger Andreas Dilger added a comment - Looking at the test itself, it seems clear that it won't work properly on a 64KB PAGE_SIZE system: # write data this way: hole - data - hole - data dd if=/dev/urandom of=$DIR1/$tfile bs=40K seek=1 count=1 dd if=/dev/urandom of=$DIR1/$tfile bs=40K seek=3 count=1 because the "hole" is from 80KB-120KB, and with 64KB PAGE_SIZE the 40-80KB write will actually dirty the 64KB-128KB page and no hole is created... The test needs to be updated to use seek=0 for the first dd and seek=4 for the second dd and it should pass. It may need some adjustments to the later checks, but at least there will be a hole from 64KB-128KB for 64KB PAGE_SIZE clients, and 40-160KB for 4KB PAGE_SIZE clients. If we want to make the file consistent, it should use bs=64K.
            xinliang Xinliang Liu added a comment -

            Currently, ldiskfs block allocated is PAGE_SIZE aligned, see function osd_ldiskfs_map_inode_pages(), test case need to be modified to PAGE_SIZE aligned for Arm.

            xinliang Xinliang Liu added a comment - Currently, ldiskfs block allocated is PAGE_SIZE aligned, see function osd_ldiskfs_map_inode_pages(), test case need to be modified to PAGE_SIZE aligned for Arm.

            "James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/45629
            Subject: LU-11787 test: re-enable sanityn 71a for 64K page systems
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 49b892c53b16a7d02c29b4a041175d5303022cbb

            gerrit Gerrit Updater added a comment - "James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/45629 Subject: LU-11787 test: re-enable sanityn 71a for 64K page systems Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 49b892c53b16a7d02c29b4a041175d5303022cbb
            xinliang Xinliang Liu added a comment - - edited

            Note that above crash is another issue, tracking in this bug: LU-15122

             

            xinliang Xinliang Liu added a comment - - edited Note that above crash is another issue, tracking in this bug: LU-15122  
            xinliang Xinliang Liu added a comment -

            Arm crashing when running sanityn test 71a for master branch

            [ 1330.973746] Lustre: DEBUG MARKER: Using TIMEOUT=20
            [ 1333.254497] Lustre: Modifying parameter general.*.*.lbug_on_grant_miscount in log params
            [ 1338.681588] Lustre: DEBUG MARKER: beegfs-test-01.novalocal: executing check_logdir /tmp/test_logs/2021-10-11/020444
            [ 1339.086489] Lustre: DEBUG MARKER: beegfs-test-01.novalocal: executing yml_node
            [ 1339.419862] Lustre: DEBUG MARKER: Client: 2.14.52.80
            [ 1339.495670] Lustre: DEBUG MARKER: MDS: 2.14.52.80
            [ 1339.567308] Lustre: DEBUG MARKER: OSS: 2.14.52.80
            [ 1339.627097] Lustre: DEBUG MARKER: -----============= acceptance-small: sanityn ============----- Mon Oct 11 02:05:23 UTC 2021
            [ 1339.980338] Lustre: DEBUG MARKER: excepting tests: 28
            [ 1340.091948] Lustre: Mounted lustre-client
            [ 1340.468327] Lustre: DEBUG MARKER: beegfs-test-01.novalocal: executing check_config_client /mnt/lustre
            [ 1341.262854] Lustre: DEBUG MARKER: Using TIMEOUT=20
            [ 1341.443750] Lustre: Modifying parameter general.*.*.lbug_on_grant_miscount in log params
            [ 1342.212768] Lustre: DEBUG MARKER: == sanityn test 71a: correct file map just after write operation is finished ========================================================== 02:05:25 (1633917925)
            [ 1342.335774] LustreError: 4391:0:(osd_io.c:538:osd_do_bio()) ASSERTION( iobuf->dr_rw == 0 ) failed: page_idx 1, block_idx 16, i 14,start_blocks: 0, count: 30, npages: 3
            [ 1342.338924] LustreError: 4391:0:(osd_io.c:538:osd_do_bio()) LBUG
            [ 1342.340168] Pid: 4391, comm: ll_ost_io06_002 4.18.0-305.7.1.el8_lustre.aarch64 #1 SMP Mon Jul 19 08:24:26 UTC 2021
            [ 1342.342344] Call Trace:
            [ 1342.342858] [<0>] libcfs_call_trace+0xb8/0x118 [libcfs]
            [ 1342.343908] [<0>] lbug_with_loc+0x60/0xa0 [libcfs]
            [ 1342.344888] [<0>] osd_do_bio.constprop.21+0x6c4/0xcf8 [osd_ldiskfs]
            [ 1342.346130] [<0>] osd_ldiskfs_map_write.constprop.20+0xb4/0xf0 [osd_ldiskfs]
            [ 1342.347535] [<0>] osd_ldiskfs_map_inode_pages+0x66c/0x958 [osd_ldiskfs]
            [ 1342.348841] [<0>] osd_write_commit+0x4fc/0xbb8 [osd_ldiskfs]
            [ 1342.349971] [<0>] ofd_commitrw_write+0x658/0x1e68 [ofd]
            [ 1342.351002] [<0>] ofd_commitrw+0x360/0xa18 [ofd]
            [ 1342.352134] [<0>] tgt_brw_write+0x1624/0x2d90 [ptlrpc]
            [ 1342.353229] [<0>] tgt_handle_request0+0xd0/0x978 [ptlrpc]
            [ 1342.354356] [<0>] tgt_request_handle+0x7c0/0x1a38 [ptlrpc]
            [ 1342.355513] [<0>] ptlrpc_server_handle_request+0x3bc/0x11e8 [ptlrpc]
            [ 1342.356824] [<0>] ptlrpc_main+0xd28/0x15f0 [ptlrpc]
            [ 1342.357786] [<0>] kthread+0x130/0x138
            [ 1342.358508] [<0>] ret_from_fork+0x10/0x18
            [ 1342.359295] Kernel panic - not syncing: LBUG
            [ 1342.360146] CPU: 12 PID: 4391 Comm: ll_ost_io06_002 Kdump: loaded Tainted: P           OE    --------- -  - 4.18.0-305.7.1.el8_lustre.aarch64 #1
            [ 1342.362770] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
            [ 1342.364231] Call trace:
            [ 1342.364751]  dump_backtrace+0x0/0x188
            [ 1342.365526]  show_stack+0x24/0x30
            [ 1342.366234]  dump_stack+0x9c/0xbc
            [ 1342.366937]  panic+0x130/0x2f8
            [ 1342.367608]  param_set_delay_minmax.isra.1+0x0/0xd0 [libcfs]
            [ 1342.368816]  osd_do_bio.constprop.21+0x6c4/0xcf8 [osd_ldiskfs]
            [ 1342.370058]  osd_ldiskfs_map_write.constprop.20+0xb4/0xf0 [osd_ldiskfs]
            [ 1342.371477]  osd_ldiskfs_map_inode_pages+0x66c/0x958 [osd_ldiskfs]
            [ 1342.372837]  osd_write_commit+0x4fc/0xbb8 [osd_ldiskfs]
            [ 1342.373951]  ofd_commitrw_write+0x658/0x1e68 [ofd]
            [ 1342.374970]  ofd_commitrw+0x360/0xa18 [ofd]
            [ 1342.375934]  tgt_brw_write+0x1624/0x2d90 [ptlrpc]
            [ 1342.376999]  tgt_handle_request0+0xd0/0x978 [ptlrpc]
            [ 1342.378123]  tgt_request_handle+0x7c0/0x1a38 [ptlrpc]
            [ 1342.379258]  ptlrpc_server_handle_request+0x3bc/0x11e8 [ptlrpc]
            [ 1342.380587]  ptlrpc_main+0xd28/0x15f0 [ptlrpc]
            [ 1342.381530]  kthread+0x130/0x138
            [ 1342.382242]  ret_from_fork+0x10/0x18
            [ 1342.383021] SMP: stopping secondary CPUs
            [ 1342.389567] Starting crashdump kernel...
            [ 1342.390356] Bye!
            
            

             

            xinliang Xinliang Liu added a comment - Arm crashing when running sanityn test 71a for master branch [ 1330.973746] Lustre: DEBUG MARKER: Using TIMEOUT=20 [ 1333.254497] Lustre: Modifying parameter general.*.*.lbug_on_grant_miscount in log params [ 1338.681588] Lustre: DEBUG MARKER: beegfs-test-01.novalocal: executing check_logdir /tmp/test_logs/2021-10-11/020444 [ 1339.086489] Lustre: DEBUG MARKER: beegfs-test-01.novalocal: executing yml_node [ 1339.419862] Lustre: DEBUG MARKER: Client: 2.14.52.80 [ 1339.495670] Lustre: DEBUG MARKER: MDS: 2.14.52.80 [ 1339.567308] Lustre: DEBUG MARKER: OSS: 2.14.52.80 [ 1339.627097] Lustre: DEBUG MARKER: -----============= acceptance-small: sanityn ============----- Mon Oct 11 02:05:23 UTC 2021 [ 1339.980338] Lustre: DEBUG MARKER: excepting tests: 28 [ 1340.091948] Lustre: Mounted lustre-client [ 1340.468327] Lustre: DEBUG MARKER: beegfs-test-01.novalocal: executing check_config_client /mnt/lustre [ 1341.262854] Lustre: DEBUG MARKER: Using TIMEOUT=20 [ 1341.443750] Lustre: Modifying parameter general.*.*.lbug_on_grant_miscount in log params [ 1342.212768] Lustre: DEBUG MARKER: == sanityn test 71a: correct file map just after write operation is finished ========================================================== 02:05:25 (1633917925) [ 1342.335774] LustreError: 4391:0:(osd_io.c:538:osd_do_bio()) ASSERTION( iobuf->dr_rw == 0 ) failed: page_idx 1, block_idx 16, i 14,start_blocks: 0, count: 30, npages: 3 [ 1342.338924] LustreError: 4391:0:(osd_io.c:538:osd_do_bio()) LBUG [ 1342.340168] Pid: 4391, comm: ll_ost_io06_002 4.18.0-305.7.1.el8_lustre.aarch64 #1 SMP Mon Jul 19 08:24:26 UTC 2021 [ 1342.342344] Call Trace: [ 1342.342858] [<0>] libcfs_call_trace+0xb8/0x118 [libcfs] [ 1342.343908] [<0>] lbug_with_loc+0x60/0xa0 [libcfs] [ 1342.344888] [<0>] osd_do_bio.constprop.21+0x6c4/0xcf8 [osd_ldiskfs] [ 1342.346130] [<0>] osd_ldiskfs_map_write.constprop.20+0xb4/0xf0 [osd_ldiskfs] [ 1342.347535] [<0>] osd_ldiskfs_map_inode_pages+0x66c/0x958 [osd_ldiskfs] [ 1342.348841] [<0>] osd_write_commit+0x4fc/0xbb8 [osd_ldiskfs] [ 1342.349971] [<0>] ofd_commitrw_write+0x658/0x1e68 [ofd] [ 1342.351002] [<0>] ofd_commitrw+0x360/0xa18 [ofd] [ 1342.352134] [<0>] tgt_brw_write+0x1624/0x2d90 [ptlrpc] [ 1342.353229] [<0>] tgt_handle_request0+0xd0/0x978 [ptlrpc] [ 1342.354356] [<0>] tgt_request_handle+0x7c0/0x1a38 [ptlrpc] [ 1342.355513] [<0>] ptlrpc_server_handle_request+0x3bc/0x11e8 [ptlrpc] [ 1342.356824] [<0>] ptlrpc_main+0xd28/0x15f0 [ptlrpc] [ 1342.357786] [<0>] kthread+0x130/0x138 [ 1342.358508] [<0>] ret_from_fork+0x10/0x18 [ 1342.359295] Kernel panic - not syncing: LBUG [ 1342.360146] CPU: 12 PID: 4391 Comm: ll_ost_io06_002 Kdump: loaded Tainted: P OE --------- - - 4.18.0-305.7.1.el8_lustre.aarch64 #1 [ 1342.362770] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [ 1342.364231] Call trace: [ 1342.364751] dump_backtrace+0x0/0x188 [ 1342.365526] show_stack+0x24/0x30 [ 1342.366234] dump_stack+0x9c/0xbc [ 1342.366937] panic+0x130/0x2f8 [ 1342.367608] param_set_delay_minmax.isra.1+0x0/0xd0 [libcfs] [ 1342.368816] osd_do_bio.constprop.21+0x6c4/0xcf8 [osd_ldiskfs] [ 1342.370058] osd_ldiskfs_map_write.constprop.20+0xb4/0xf0 [osd_ldiskfs] [ 1342.371477] osd_ldiskfs_map_inode_pages+0x66c/0x958 [osd_ldiskfs] [ 1342.372837] osd_write_commit+0x4fc/0xbb8 [osd_ldiskfs] [ 1342.373951] ofd_commitrw_write+0x658/0x1e68 [ofd] [ 1342.374970] ofd_commitrw+0x360/0xa18 [ofd] [ 1342.375934] tgt_brw_write+0x1624/0x2d90 [ptlrpc] [ 1342.376999] tgt_handle_request0+0xd0/0x978 [ptlrpc] [ 1342.378123] tgt_request_handle+0x7c0/0x1a38 [ptlrpc] [ 1342.379258] ptlrpc_server_handle_request+0x3bc/0x11e8 [ptlrpc] [ 1342.380587] ptlrpc_main+0xd28/0x15f0 [ptlrpc] [ 1342.381530] kthread+0x130/0x138 [ 1342.382242] ret_from_fork+0x10/0x18 [ 1342.383021] SMP: stopping secondary CPUs [ 1342.389567] Starting crashdump kernel... [ 1342.390356] Bye!  

            The patch that landed, https://review.whamcloud.com/37561/, puts sanityn tests 16a and 71a on the ALWAYS_EXCEPT list for PPC client testing. This ticket should remain open until those tests are fixed and the tests are taken off the list.

            jamesanunez James Nunez (Inactive) added a comment - The patch that landed, https://review.whamcloud.com/37561/ , puts sanityn tests 16a and 71a on the ALWAYS_EXCEPT list for PPC client testing. This ticket should remain open until those tests are fixed and the tests are taken off the list.

            People

              xinliang Xinliang Liu
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: