Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18091

sanity-lfsck test_18h: FAIL: (6) Data in /mnt/lustre/d18h.sanity-lfsck/f0 is broken

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.16.0
    • None
    • Ubuntu 24.04 client
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for jianyu <yujian@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/76b849f1-717c-4798-a705-f16312ffa149

      test_18h failed with the following error:

      == sanity-lfsck test 18h: LFSCK can repair crashed PFL extent range ========================================================== 06:35:47 (1721975747)
      #####
      The PFL extent crashed. During the first cycle LFSCK scanning,
      the layout LFSCK will keep the bad PFL file(s) there without
      scanning its OST-object(s). Then in the second stage scanning,
      the OST will return related OST-object(s) to the MDT as orphan.
      And then the LFSCK on the MDT can rebuild the PFL extent with
      the 'orphan(s)' stripe information.
      #####
      0+1 records in
      0+1 records out
      295280 bytes (295 kB, 288 KiB) copied, 0.00202335 s, 146 MB/s
      cp: error copying '/mnt/lustre/d18h.sanity-lfsck/f0' to '/mnt/lustre/d18h.sanity-lfsck/guard': No data available
      Inject failure stub to simulate bad PFL extent range
      CMD: onyx-35vm6 /usr/sbin/lctl set_param fail_loc=0x162f
      fail_loc=0x162f
      chown: warning: '.' should be ':': '1.1'
      CMD: onyx-35vm6 /usr/sbin/lctl set_param fail_loc=0
      fail_loc=0
      dd: error writing '/mnt/lustre/d18h.sanity-lfsck/f0': No data available
      1+0 records in
      0+0 records out
      0 bytes copied, 0.00104513 s, 0.0 kB/s
      Trigger layout LFSCK to find out the bad lmm_oi and fix them
      CMD: onyx-35vm6 /usr/sbin/lctl lfsck_start -M lustre-MDT0000 -t layout -r -o
      Started LFSCK on the device lustre-MDT0000: scrub layout
      CMD: onyx-35vm6 /usr/sbin/lctl get_param -n 			mdd.lustre-MDT0000.lfsck_layout |
      			awk '/^status/ { print \$2 }'
      CMD: onyx-35vm3 /usr/sbin/lctl get_param -n obdfilter.lustre-OST0000.lfsck_layout
      CMD: onyx-35vm3 /usr/sbin/lctl get_param -n obdfilter.lustre-OST0001.lfsck_layout
      CMD: onyx-35vm3 /usr/sbin/lctl get_param -n obdfilter.lustre-OST0002.lfsck_layout
      CMD: onyx-35vm3 /usr/sbin/lctl get_param -n obdfilter.lustre-OST0003.lfsck_layout
      CMD: onyx-35vm6 /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.lfsck_layout
      Data in /mnt/lustre/d18h.sanity-lfsck/f0 should not be broken
      Binary files /mnt/lustre/d18h.sanity-lfsck/f0 and /mnt/lustre/d18h.sanity-lfsck/guard differ
       sanity-lfsck test_18h: @@@@@@ FAIL: (6) Data in /mnt/lustre/d18h.sanity-lfsck/f0 is broken
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-master/4558 - 6.8.0-35-generic
      servers: https://build.whamcloud.com/job/lustre-master/4558 - 5.14.0-427.24.1_lustre.el9.x86_64

      <<Please provide additional information about the failure here>>

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity-lfsck test_18h - (6) Data in /mnt/lustre/d18h.sanity-lfsck/f0 is broken

      Attachments

        Issue Links

          Activity

            [LU-18091] sanity-lfsck test_18h: FAIL: (6) Data in /mnt/lustre/d18h.sanity-lfsck/f0 is broken
            pjones Peter Jones added a comment -

            ok then let's close out the ticket as a duplicate of LU-18103 and reopen it if it still occurs once that patch has landed...

            pjones Peter Jones added a comment - ok then let's close out the ticket as a duplicate of LU-18103 and reopen it if it still occurs once that patch has landed...

            The issue should be caused by the IO path of splice read in Ubuntu24.04, there is no corresponding CLIO stuff,
            cl_io, cl_page in ll_readpage, and return -ENODATA

              ll_readpage+0x11a2/0x1290 [lustre]
              ? __pfx_ll_read_folio+0x10/0x10 [lustre]
              ll_read_folio+0xe/0x20 [lustre]
              filemap_read_folio+0x46/0xf0
              filemap_get_pages+0x27e/0x3b0
              filemap_splice_read+0x153/0x390
              do_splice_read+0x6d/0xf0
              splice_direct_to_actor+0xb2/0x270
              ? __pfx_direct_splice_actor+0x10/0x10
              do_splice_direct+0x6f/0xc0
              ? __pfx_direct_file_splice_eof+0x10/0x10
              vfs_copy_file_range+0x1b4/0x5e0
              __do_sys_copy_file_range+0xf1/0x230
              __x64_sys_copy_file_range+0x24/0x40
              x64_sys_call+0xc28/0x25c0
              do_syscall_64+0x7f/0x180
            

            It can be fixed by the patch https://review.whamcloud.com/c/fs/lustre-release/+/56093 in LU-18103
            https://testing.whamcloud.com/test_sessions/41fe7925-22e4-4aa0-9bb1-21ed5551233f

            hongchao.zhang Hongchao Zhang added a comment - The issue should be caused by the IO path of splice read in Ubuntu24.04, there is no corresponding CLIO stuff, cl_io, cl_page in ll_readpage, and return -ENODATA ll_readpage+0x11a2/0x1290 [lustre] ? __pfx_ll_read_folio+0x10/0x10 [lustre] ll_read_folio+0xe/0x20 [lustre] filemap_read_folio+0x46/0xf0 filemap_get_pages+0x27e/0x3b0 filemap_splice_read+0x153/0x390 do_splice_read+0x6d/0xf0 splice_direct_to_actor+0xb2/0x270 ? __pfx_direct_splice_actor+0x10/0x10 do_splice_direct+0x6f/0xc0 ? __pfx_direct_file_splice_eof+0x10/0x10 vfs_copy_file_range+0x1b4/0x5e0 __do_sys_copy_file_range+0xf1/0x230 __x64_sys_copy_file_range+0x24/0x40 x64_sys_call+0xc28/0x25c0 do_syscall_64+0x7f/0x180 It can be fixed by the patch https://review.whamcloud.com/c/fs/lustre-release/+/56093 in LU-18103 https://testing.whamcloud.com/test_sessions/41fe7925-22e4-4aa0-9bb1-21ed5551233f
            yujian Jian Yu added a comment -

            Comments from Andreas:

            It looks like the error is returned from fast read, which means read <= 4096 bytes. I don't think the -ENODATA should be returned back to userspace in this case, but rather retry a "slow" read or similar.

            yujian Jian Yu added a comment - Comments from Andreas: It looks like the error is returned from fast read, which means read <= 4096 bytes. I don't think the -ENODATA should be returned back to userspace in this case, but rather retry a "slow" read or similar.
            yujian Jian Yu added a comment -

            In lustre/llite/rw.c:

            ll_readpage()
                            CDEBUG(D_VFSTRACE, "fast read pgno: %ld\n", vmpage->index);
            
                            result = -ENODATA;
            
                            /* TODO: need to verify the layout version to make sure
                             * the page is not invalid due to layout change.
                             */
                            page = cl_vmpage_page(vmpage, clob);
                            if (page == NULL) {
                                    unlock_page(vmpage);
                                    CDEBUG(D_READA, "fast read: failed to find page %ld\n",
                                            vmpage->index);
                                    ll_ra_stats_inc_sbi(sbi, RA_STAT_FAILED_FAST_READ);
                                    RETURN(result);
                            }
            
            yujian Jian Yu added a comment - In lustre/llite/rw.c : ll_readpage() CDEBUG(D_VFSTRACE, "fast read pgno: %ld\n" , vmpage->index); result = -ENODATA; /* TODO: need to verify the layout version to make sure * the page is not invalid due to layout change. */ page = cl_vmpage_page(vmpage, clob); if (page == NULL) { unlock_page(vmpage); CDEBUG(D_READA, "fast read: failed to find page %ld\n" , vmpage->index); ll_ra_stats_inc_sbi(sbi, RA_STAT_FAILED_FAST_READ); RETURN(result); }
            yujian Jian Yu added a comment -

            The failure can be easily reproduced on Ubuntu 24.04 client:

            # lfs setstripe -E 2M -S 1M -c 1 -E -1 /mnt/lustre/f0
            # cat /etc/hosts > /mnt/lustre/f0
            # dd if=/etc/hosts of=/mnt/lustre/f0 bs=1M seek=2
            0+1 records in
            0+1 records out
            303 bytes copied, 0.00146371 s, 207 kB/s
            
            # # cp /mnt/lustre/f0 /mnt/lustre/f1
            cp: error copying '/mnt/lustre/f0' to '/mnt/lustre/f1': No data available
            

            Lustre client debug log:

            00000020:00000001:0.0:1725913382.264477:0:84469:0:(cl_object.c:435:cl_object_maxbytes()) Process entered
            00020000:00000002:0.0:1725913382.264478:0:84469:0:(lov_object.c:1243:lov_conf_freeze()) To take share lov(00000000c946e6cf) owner 0000000000000000/000000009c172d67
            00020000:00000002:0.0:1725913382.264479:0:84469:0:(lov_object.c:2435:lov_lsm_addref()) lsm 00000000cec182bf addref 2/0 by 000000009c172d67.
            00020000:00000002:0.0:1725913382.264480:0:84469:0:(lov_object.c:1251:lov_conf_thaw()) To release share lov(00000000c946e6cf) owner 0000000000000000/000000009c172d67
            00000020:00000001:0.0:1725913382.264481:0:84469:0:(cl_object.c:443:cl_object_maxbytes()) Process leaving (rc=17592188137472 : 17592188137472 : 1000001ff000)
            00000080:00000001:0.0:1725913382.264481:0:84469:0:(file.c:4893:ll_file_seek()) Process entered
            00000080:00200000:0.0:1725913382.264482:0:84469:0:(file.c:4895:ll_file_seek()) VFS Op:inode=[0x200000404:0xc:0x0](00000000ae4a5dd4), to=0=0x0(0)
            00000080:00000001:0.0:1725913382.264483:0:84469:0:(file.c:4932:ll_file_seek()) Process leaving (rc=0 : 0 : 0)
            00000080:00000001:0.0:1725913382.264494:0:84469:0:(rw.c:1931:ll_readpage()) Process entered
            00000080:00200000:0.0:1725913382.264495:0:84469:0:(rw.c:1994:ll_readpage()) fast read pgno: 0
            00000020:00000001:0.0:1725913382.264496:0:84469:0:(cl_page.c:542:cl_vmpage_page()) Process entered
            00000020:00000001:0.0:1725913382.264496:0:84469:0:(cl_page.c:556:cl_vmpage_page()) Process leaving (rc=0 : 0 : 0)
            00000080:00400000:0.0:1725913382.264497:0:84469:0:(rw.c:2003:ll_readpage()) fast read: failed to find page 0
            00000080:00000001:0.0:1725913382.264498:0:84469:0:(rw.c:2006:ll_readpage()) Process leaving (rc=18446744073709551555 : -61 : ffffffffffffffc3)

            yujian Jian Yu added a comment - The failure can be easily reproduced on Ubuntu 24.04 client: # lfs setstripe -E 2M -S 1M -c 1 -E -1 /mnt/lustre/f0 # cat /etc/hosts > /mnt/lustre/f0 # dd if=/etc/hosts of=/mnt/lustre/f0 bs=1M seek=2 0+1 records in 0+1 records out 303 bytes copied, 0.00146371 s, 207 kB/s # # cp /mnt/lustre/f0 /mnt/lustre/f1 cp: error copying '/mnt/lustre/f0' to '/mnt/lustre/f1': No data available Lustre client debug log: 00000020:00000001:0.0:1725913382.264477:0:84469:0:(cl_object.c:435:cl_object_maxbytes()) Process entered 00020000:00000002:0.0:1725913382.264478:0:84469:0:(lov_object.c:1243:lov_conf_freeze()) To take share lov(00000000c946e6cf) owner 0000000000000000/000000009c172d67 00020000:00000002:0.0:1725913382.264479:0:84469:0:(lov_object.c:2435:lov_lsm_addref()) lsm 00000000cec182bf addref 2/0 by 000000009c172d67. 00020000:00000002:0.0:1725913382.264480:0:84469:0:(lov_object.c:1251:lov_conf_thaw()) To release share lov(00000000c946e6cf) owner 0000000000000000/000000009c172d67 00000020:00000001:0.0:1725913382.264481:0:84469:0:(cl_object.c:443:cl_object_maxbytes()) Process leaving (rc=17592188137472 : 17592188137472 : 1000001ff000) 00000080:00000001:0.0:1725913382.264481:0:84469:0:(file.c:4893:ll_file_seek()) Process entered 00000080:00200000:0.0:1725913382.264482:0:84469:0:(file.c:4895:ll_file_seek()) VFS Op:inode= [0x200000404:0xc:0x0] (00000000ae4a5dd4), to=0=0x0(0) 00000080:00000001:0.0:1725913382.264483:0:84469:0:(file.c:4932:ll_file_seek()) Process leaving (rc=0 : 0 : 0) 00000080:00000001:0.0:1725913382.264494:0:84469:0:(rw.c:1931:ll_readpage()) Process entered 00000080:00200000:0.0:1725913382.264495:0:84469:0: (rw.c:1994:ll_readpage()) fast read pgno: 0 00000020:00000001:0.0:1725913382.264496:0:84469:0:(cl_page.c:542:cl_vmpage_page()) Process entered 00000020:00000001:0.0:1725913382.264496:0:84469:0:(cl_page.c:556:cl_vmpage_page()) Process leaving (rc=0 : 0 : 0) 00000080:00400000:0.0:1725913382.264497:0:84469:0: (rw.c:2003:ll_readpage()) fast read: failed to find page 0 00000080:00000001:0.0:1725913382.264498:0:84469:0: (rw.c:2006:ll_readpage()) Process leaving (rc=18446744073709551555 : -61 : ffffffffffffffc3)

            "Hongchao Zhang <hongchao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56263
            Subject: LU-18091 test: debug patch
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 8d4234c7953229541f429d45fb8f1b2923536605

            gerrit Gerrit Updater added a comment - "Hongchao Zhang <hongchao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56263 Subject: LU-18091 test: debug patch Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8d4234c7953229541f429d45fb8f1b2923536605
            lixi_wc Li Xi added a comment -

            Hongchao, would you please take a look?

            lixi_wc Li Xi added a comment - Hongchao, would you please take a look?

            People

              hongchao.zhang Hongchao Zhang
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: