Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16907

sanity test_123f: crashed MDS with Max IOV exceeded: 257 should be < 256

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.17.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/85009820-d429-41a1-8948-90b2f66d7f02

      test_123f failed with the following error:

      onyx-37vm9 crashed during sanity test_123f
      
      LNetError: 191018:0:(socklnd_cb.c:1036:ksocknal_send()) ASSERTION( tx->tx_nkiov <= 256 ) failed: 
      LNetError: 191018:0:(socklnd_cb.c:1036:ksocknal_send()) LBUG
      Pid: 191018, comm: mdt_out00_001 4.18.0-425.10.1.el8_lustre.x86_64 #1 SMP Wed May 3 16:22:26 UTC 2023
      Call Trace TBD:
       libcfs_call_trace+0x6f/0xa0 [libcfs]
       lbug_with_loc+0x3f/0x70 [libcfs]
       ksocknal_send+0x27a/0x320 [ksocklnd]
       lnet_ni_send+0x4c/0xe0 [lnet]
       lnet_send+0xae/0x1e0 [lnet]
       LNetPut+0x318/0x940 [lnet]
       ptl_send_buf+0x208/0x5a0 [ptlrpc]
       ptlrpc_send_reply+0x2ad/0x8d0 [ptlrpc]
       target_send_reply+0x328/0x7d0 [ptlrpc]
       tgt_request_handle+0xe85/0x1920 [ptlrpc]
       ptlrpc_server_handle_request+0x31d/0xbc0 [ptlrpc]
       ptlrpc_main+0xc52/0x1510
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-reviews/94706 - 5.4.0-131-generic
      servers: https://build.whamcloud.com/job/lustre-reviews/94706 - 4.18.0-425.10.1.el8_lustre.x86_64

      Have seen this about 10 times since 2023-05-09, after patch https://review.whamcloud.com/46540 "LU-15550 ptlrpc: retry mechanism for overflowed batched RPCs" landed, but I'm not sure if it is directly related.

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity test_123f - onyx-37vm9 crashed during sanity test_123f

      Attachments

        Issue Links

          Activity

            [LU-16907] sanity test_123f: crashed MDS with Max IOV exceeded: 257 should be < 256
            pjones Peter Jones added a comment -

            Merged for 2.17

            pjones Peter Jones added a comment - Merged for 2.17

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56645/
            Subject: LU-16907 ptlrpc: correct the reply buffer size for batch RPC
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 8a7703eec9bb77a0dd85047a04910d30eb8843aa

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56645/ Subject: LU-16907 ptlrpc: correct the reply buffer size for batch RPC Project: fs/lustre-release Branch: master Current Patch Set: Commit: 8a7703eec9bb77a0dd85047a04910d30eb8843aa
            bfaccini-nvda Bruno Faccini added a comment - +1 on master : https://testing.whamcloud.com/test_sets/3e01b6f3-aff3-4369-8042-a11261bd006d

            "Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56645
            Subject: LU-16907 ptlrpc: correct the reply buffer size for batch RPC
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 6702e06dfdcaec919672e49595f4c4a7687ace95

            gerrit Gerrit Updater added a comment - "Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56645 Subject: LU-16907 ptlrpc: correct the reply buffer size for batch RPC Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 6702e06dfdcaec919672e49595f4c4a7687ace95
            adilger Andreas Dilger added a comment - - edited

            Also crashing the server during Janitor testing with the following stack:
            https://testing.whamcloud.com/gerrit-janitor/46254/testresults/sanity2-zfs-centos7_x86_64-centos7_x86_64/

            WARNING: CPU: 2 PID: 20861 at /home/green/git/lustre-release/lnet/lnet/lib-md.c:209 lnet_md_build.part.5+0x5b5/0x780 [lnet]
            Max IOV exceeded: 257 should be < 256
            
            CPU: 2 PID: 20861 Comm: mdt_out00_004 3.10.0-7.9-debug #1
            Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
            Call Trace:
              dump_stack+0x19/0x1b
              __warn+0xd8/0x100
              warn_slowpath_fmt+0x5f/0x80
              lnet_md_build.part.5+0x5b5/0x780 [lnet]
              LNetMDBind+0x4e/0x370 [lnet]
              ptl_send_buf+0xd0/0x5b0 [ptlrpc]
              ptlrpc_send_reply+0x2f3/0x9d0 [ptlrpc]
              target_send_reply_msg+0x63/0x1e0 [ptlrpc]
              target_send_reply+0x31e/0x780 [ptlrpc]
              tgt_request_handle+0x36d/0x1a60 [ptlrpc]
              ptlrpc_server_handle_request+0x281/0xce0 [ptlrpc]
              ptlrpc_main+0xc7e/0x1690 [ptlrpc]
              kthread+0xe4/0xf0
              ret_from_fork_nospec_begin+0x7/0x21
            
            LNetError: 20861:0:(lib-md.c:394:LNetMDBind()) Invalid length: too big transfer size 1048608, 1048576 max
            LustreError: 20861:0:(niobuf.c:81:ptl_send_buf()) LNetMDBind failed: -22
            LustreError: 20861:0:(niobuf.c:82:ptl_send_buf()) ASSERTION( rc == -12 ) failed: 
            LustreError: 20861:0:(niobuf.c:82:ptl_send_buf()) LBUG
            
            adilger Andreas Dilger added a comment - - edited Also crashing the server during Janitor testing with the following stack: https://testing.whamcloud.com/gerrit-janitor/46254/testresults/sanity2-zfs-centos7_x86_64-centos7_x86_64/ WARNING: CPU: 2 PID: 20861 at /home/green/git/lustre-release/lnet/lnet/lib-md.c:209 lnet_md_build.part.5+0x5b5/0x780 [lnet] Max IOV exceeded: 257 should be < 256 CPU: 2 PID: 20861 Comm: mdt_out00_004 3.10.0-7.9-debug #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014 Call Trace: dump_stack+0x19/0x1b __warn+0xd8/0x100 warn_slowpath_fmt+0x5f/0x80 lnet_md_build.part.5+0x5b5/0x780 [lnet] LNetMDBind+0x4e/0x370 [lnet] ptl_send_buf+0xd0/0x5b0 [ptlrpc] ptlrpc_send_reply+0x2f3/0x9d0 [ptlrpc] target_send_reply_msg+0x63/0x1e0 [ptlrpc] target_send_reply+0x31e/0x780 [ptlrpc] tgt_request_handle+0x36d/0x1a60 [ptlrpc] ptlrpc_server_handle_request+0x281/0xce0 [ptlrpc] ptlrpc_main+0xc7e/0x1690 [ptlrpc] kthread+0xe4/0xf0 ret_from_fork_nospec_begin+0x7/0x21 LNetError: 20861:0:(lib-md.c:394:LNetMDBind()) Invalid length: too big transfer size 1048608, 1048576 max LustreError: 20861:0:(niobuf.c:81:ptl_send_buf()) LNetMDBind failed: -22 LustreError: 20861:0:(niobuf.c:82:ptl_send_buf()) ASSERTION( rc == -12 ) failed: LustreError: 20861:0:(niobuf.c:82:ptl_send_buf()) LBUG
            qian_wc Qian Yingjin added a comment - +1 on master: https://testing.whamcloud.com/test_sets/59a8515d-6312-47af-8266-870e897d0d20
            qian_wc Qian Yingjin added a comment - +1 on master: https://testing.whamcloud.com/test_sets/083e7577-ac60-4e89-9211-f3d5f4db6d66
            hornc Chris Horn added a comment - +1 on master https://testing.whamcloud.com/test_sets/5b1eb7c5-506c-4f17-8910-4dcff04e76ca
            paf Patrick Farrell (Inactive) added a comment - +1 on master https://testing.whamcloud.com/test_sets/29d39dca-1e35-430f-bdcb-44883a7940e2
            ssmirnov Serguei Smirnov added a comment - +1 on master: https://testing.whamcloud.com/test_sets/1be4ce5b-c46b-4585-8570-d05a20b76034

            People

              qian_wc Qian Yingjin
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: