Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18284

interop sanity test_119e test_119f: UDIO files differ, bsize 1048575, 2.12 servers crash

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0
    • Lustre 2.16.0, Lustre 2.12.9
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

      This issue relates to the following test suite run with 2.12.9 servers, 2.15.91 master client:
      https://testing.whamcloud.com/test_sets/a2ba69c8-7050-435a-9042-b3b074d98626

      test_119e failed with the following error:

      files differ, bsize 1048575
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-master/4544 - 5.14.0-284.30.1.el9_2.x86_64
      servers: https://build.whamcloud.com/job/lustre-b2_12/164 - 3.10.0-1160.49.1.el7_lustre.x86_64

      This test started failing on 2024-06-25 and is still failing.

      This causes list corruption in test_119e:

      WARNING: CPU: 1 PID: 18212 at lib/list_debug.c:62 __list_del_entry+0x82/0xd0
      list_del corruption. next->prev should be ffff9516e8be88e8, but was f2756b16e8be88e8
      CPU: 1 PID: 18212 Comm: ll_ost_io00_015   3.10.0-1160.49.1.el7_lustre.x86_64 #1
       Call Trace:
       dump_stack+0x19/0x1b
       __warn+0xd8/0x100
       warn_slowpath_fmt+0x5f/0x80
       __list_del_entry+0x82/0xd0
       ptlrpc_server_hpreq_fini+0x70/0x170 [ptlrpc]
       ptlrpc_server_finish_active_request+0x8a/0x140 [ptlrpc]
       ptlrpc_server_handle_request+0x401/0xab0 [ptlrpc]
       ptlrpc_main+0xb34/0x1470 [ptlrpc]
       kthread+0xd1/0xe0
      

      Then the crash in test_119f again due to list corruption:

      general protection fault: 0000 [#1] SMP 
      CPU: 1 PID: 16242 Comm: ll_ost00_023  3.10.0-1160.49.1.el7_lustre.x86_64 #1
      Call Trace:
       ldlm_add_blocked_lock+0xb2/0xc0 [ptlrpc]
       ldlm_add_waiting_lock+0x1ac/0x300 [ptlrpc]
       ldlm_server_blocking_ast+0x66c/0xa40 [ptlrpc]
       tgt_blocking_ast+0x159/0x630 [ptlrpc]
       ldlm_work_bl_ast_lock+0x11c/0x300 [ptlrpc]
       ptlrpc_set_wait+0x72/0x790 [ptlrpc]
       ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
       ldlm_handle_conflict_lock+0x70/0x290 [ptlrpc]
       ldlm_lock_enqueue+0x470/0x9b0 [ptlrpc]
       ldlm_cli_enqueue_local+0x367/0x830 [ptlrpc]
       ofd_destroy_by_fid+0x1d1/0x510 [ofd]
       ofd_destroy_hdl+0x257/0x9d0 [ofd]
       tgt_request_handle+0xada/0x1570 [ptlrpc]
       ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
       ptlrpc_main+0xb34/0x1470 [ptlrpc]
       kthread+0xd1/0xe0
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity test_119e - files differ, bsize 1048575

      Attachments

        Issue Links

          Activity

            [LU-18284] interop sanity test_119e test_119f: UDIO files differ, bsize 1048575, 2.12 servers crash
            pjones Peter Jones added a comment -

            Merged for 2.16

            pjones Peter Jones added a comment - Merged for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56571/
            Subject: LU-18284 llite: disallow udio exceptions
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: ff018bb77a371415a3973a58a70dfcc431862535

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56571/ Subject: LU-18284 llite: disallow udio exceptions Project: fs/lustre-release Branch: master Current Patch Set: Commit: ff018bb77a371415a3973a58a70dfcc431862535

            Will try the looser of the two options with the major/minor version check.

            It this looks okay will need to add some "version < 2.15 && skip" logic to several tests.

            stancheff Shaun Tancheff added a comment - Will try the looser of the two options with the major/minor version check. It this looks okay will need to add some "version < 2.15 && skip" logic to several tests.

            "Shaun Tancheff <shaun.tancheff@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56571
            Subject: LU-18284 llite: disallow udio exceptions prior to 2.15
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a2fca7c347639101e9c7aa8c22f4a876f90cb2c7

            gerrit Gerrit Updater added a comment - "Shaun Tancheff <shaun.tancheff@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56571 Subject: LU-18284 llite: disallow udio exceptions prior to 2.15 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a2fca7c347639101e9c7aa8c22f4a876f90cb2c7

            I had a look at removing the client-side UNALIGNED_DIO flag override, but it isn't as obvious as I thought it might be. The flag itself is not set on connections to servers that "look OK", but rather it is checked on a per-write basis in various parts of the code, and it isn't clear to me where the right place is to turn it off. It looks like part of the important code is:

            void ll_io_init(struct cl_io *io, struct file *file, enum cl_io_type iot,
                            struct vvp_io_args *args)
            {
                    :
                    io->ci_target_is_zfs = 0;
                    /* unaligned DIO has compat issues with some older servers, but we find
                     * out if there are such servers while setting up the IO, so it starts
                     * out allowed
                     */
                    io->ci_allow_unaligned_dio = true;
            

            One
            and another part is:

            ll_direct_IO_impl(struct kiocb *iocb, struct iov_iter *iter, int rw)
            {
                    /* Unpatched older servers which cannot safely support unaligned DIO
                     * (osd-zfs) or i/o with page size interop issues should abort here
                     */                                                                    
                    if (unaligned && !cl_io_top(io)->ci_allow_unaligned_dio) {            
                            unsigned int md0_offset;
                       
                            if (cl_io_top(io)->ci_target_is_zfs)
                                    RETURN(-EINVAL);
                                                                               
                            /* unpatched ldiskfs is fine, unless MD0 does not align/fit */
                            md0_offset = file_offset & (MD_MAX_INTEROP_PAGE_SIZE - 1);
                            if ((count + md0_offset) >= LNET_MTU) {
                                    u64 iomax, iomin;
                                                                             
                                    iomax = cl_io_nob_aligned(file_offset, count,
                                                              MD_MAX_INTEROP_PAGE_SIZE);
                                    iomin = cl_io_nob_aligned(file_offset, count,
                                                              MD_MIN_INTEROP_PAGE_SIZE);
                                    if (iomax != iomin)
                                            RETURN(-EINVAL);
                            }                            
                    }
            

            Probably the easiest thing to do is change ll_direct_IO_impl() to always disable UDIO on unsupported servers:

                    /* Unpatched older servers which cannot safely support unaligned DIO
                     * (osd-zfs) or i/o with page size interop issues should abort here
                     */                                                                    
                    if (unaligned && !cl_io_top(io)->ci_allow_unaligned_dio)           
                            RETURN(-EINVAL);
            

            Slightly more forgiving (but also somewhat ugly) would be to change ll_io_init() (or somewhere) to do a version check for < 2.14.0 servers and disable UDIO in that case?

            adilger Andreas Dilger added a comment - I had a look at removing the client-side UNALIGNED_DIO flag override, but it isn't as obvious as I thought it might be. The flag itself is not set on connections to servers that "look OK", but rather it is checked on a per-write basis in various parts of the code, and it isn't clear to me where the right place is to turn it off. It looks like part of the important code is: void ll_io_init(struct cl_io *io, struct file *file, enum cl_io_type iot, struct vvp_io_args *args) { : io->ci_target_is_zfs = 0; /* unaligned DIO has compat issues with some older servers, but we find * out if there are such servers while setting up the IO, so it starts * out allowed */ io->ci_allow_unaligned_dio = true ; One and another part is: ll_direct_IO_impl(struct kiocb *iocb, struct iov_iter *iter, int rw) { /* Unpatched older servers which cannot safely support unaligned DIO * (osd-zfs) or i/o with page size interop issues should abort here */ if (unaligned && !cl_io_top(io)->ci_allow_unaligned_dio) { unsigned int md0_offset; if (cl_io_top(io)->ci_target_is_zfs) RETURN(-EINVAL); /* unpatched ldiskfs is fine, unless MD0 does not align/fit */ md0_offset = file_offset & (MD_MAX_INTEROP_PAGE_SIZE - 1); if ((count + md0_offset) >= LNET_MTU) { u64 iomax, iomin; iomax = cl_io_nob_aligned(file_offset, count, MD_MAX_INTEROP_PAGE_SIZE); iomin = cl_io_nob_aligned(file_offset, count, MD_MIN_INTEROP_PAGE_SIZE); if (iomax != iomin) RETURN(-EINVAL); } } Probably the easiest thing to do is change ll_direct_IO_impl() to always disable UDIO on unsupported servers: /* Unpatched older servers which cannot safely support unaligned DIO * (osd-zfs) or i/o with page size interop issues should abort here */ if (unaligned && !cl_io_top(io)->ci_allow_unaligned_dio) RETURN(-EINVAL); Slightly more forgiving (but also somewhat ugly) would be to change ll_io_init() (or somewhere) to do a version check for < 2.14.0 servers and disable UDIO in that case?

            I would suggest that we remove the client-side UNALIGNED_DIO override.

            In the meantime I will try to understand why the override logic fails here.

            stancheff Shaun Tancheff added a comment - I would suggest that we remove the client-side UNALIGNED_DIO override. In the meantime I will try to understand why the override logic fails here.

            stancheff and/or paf could you please comment. Should we remove the client-side UNALIGNED_DIO override on the client, and only allow UDIO on servers that actually report support for that feature?

            adilger Andreas Dilger added a comment - stancheff and/or paf could you please comment. Should we remove the client-side UNALIGNED_DIO override on the client, and only allow UDIO on servers that actually report support for that feature?

            It looks like UDIO with 1048575-byte (1MiB -1) writes is causing 2.12.9 servers to crash.

            It isn't clear yet whether this will be elevated to a blocker for the 2.16 release, but it definitely isn't great that regular user IO from a 2.16 client could cause the OSS to crash.

            While there is an OBD_CONNECT2_UNALIGNED_DIO feature flag in place, the client is overriding this flag locally to allow UDIO with what it thinks is OK (4KiB page servers & clients and ldiskfs). Perhaps this client-side override of the UDIO feature flag should be removed, and we just depend on 2.16/2.15 servers exposing this flag if they can handle the UDIO.

            adilger Andreas Dilger added a comment - It looks like UDIO with 1048575-byte (1MiB -1) writes is causing 2.12.9 servers to crash. It isn't clear yet whether this will be elevated to a blocker for the 2.16 release, but it definitely isn't great that regular user IO from a 2.16 client could cause the OSS to crash. While there is an OBD_CONNECT2_UNALIGNED_DIO feature flag in place, the client is overriding this flag locally to allow UDIO with what it thinks is OK (4KiB page servers & clients and ldiskfs). Perhaps this client-side override of the UDIO feature flag should be removed, and we just depend on 2.16/2.15 servers exposing this flag if they can handle the UDIO.

            People

              stancheff Shaun Tancheff
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: