Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10068

OST fails to mount:LustreError: 14558:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.11.0
    • Soak cluster - latest lustre-master build (3650) version=2.10.53_32_g20ffe21
    • 3
    • 9223372036854775807

    Description

      Reformatted, created complete new filesystem.
      MDS/MGS mounts.
      OSS mount fails
      OSS errors

      [  584.585335] Lustre: 13317:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1507069853/real 1507069853]  req@ffff88083f758300 x1580277278179360/t0(0) o253->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 4768/4768 e 0 to 1 dl 1507069860 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1
      [  584.680525] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail
      [  584.727353] LustreError: 15f-b: soaked-OST0000: cannot register this server with the MGS: rc = -5. Is the MGS running?
      [  584.755152] Lustre: MGC192.168.1.108@o2ib: Connection restored to MGC192.168.1.108@o2ib_0 (at 192.168.1.108@o2ib)
      [  584.796681] LustreError: 13317:0:(obd_mount_server.c:1863:server_fill_super()) Unable to start targets: -5
      [  584.828634] LustreError: 13317:0:(obd_mount_server.c:1573:server_put_super()) no obd soaked-OST0000
      [  584.858501] LustreError: 13317:0:(obd_mount_server.c:132:server_deregister_mount()) soaked-OST0000 not registered
      [  585.118058] Lustre: server umount soaked-OST0000 complete
      [  585.135868] LustreError: 13317:0:(obd_mount.c:1504:lustre_fill_super()) Unable to mount  (-5)
      

      Errors on MDS

      [17524.112385] Lustre: soaked-MDT0000: new disk, initializing
      [17524.138680] Lustre: soaked-MDT0000: Imperative Recovery not enabled, recovery window 300-900
      [17524.153739] Lustre: ctl-soaked-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400]:0:mdt
      [17528.212947] Lustre: MGS: Connection restored to aa8414d2-f089-bf2e-b8c3-406370f048cc (at 192.168.1.102@o2ib)
      [17528.273221] LustreError: 12811:0:(events.c:304:request_in_callback()) event type 2, status -103, service mgs
      [17528.287356] LustreError: 14558:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check
      [17528.305726] LustreError: 14558:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-192.168.1.102@o2ib x1580277278179360
      [17528.418152] Lustre: MGS: Received new LWP connection from 192.168.1.102@o2ib, removing former export from same NID
      

      Not sure where to go from here.

      Attachments

        Issue Links

          Activity

            [LU-10068] OST fails to mount:LustreError: 14558:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check
            jhammond John Hammond added a comment -

            Fails on OPA as well.

            jhammond John Hammond added a comment - Fails on OPA as well.

            so on MLX5 I'm able to mount, but I'm getting the following error:

            [ 1157.516112] LNetError: 34447:0:(o2iblnd.c:1940:kiblnd_fmr_pool_map()) Failed to map mr 10/11 elements
            [ 1157.516122] LNetError: 34528:0:(o2iblnd_cb.c:560:kiblnd_fmr_map_tx()) Can't map 41033 pages: -22
            

            That's again due to the fastreg code.

            ashehata Amir Shehata (Inactive) added a comment - so on MLX5 I'm able to mount, but I'm getting the following error: [ 1157.516112] LNetError: 34447:0:(o2iblnd.c:1940:kiblnd_fmr_pool_map()) Failed to map mr 10/11 elements [ 1157.516122] LNetError: 34528:0:(o2iblnd_cb.c:560:kiblnd_fmr_map_tx()) Can't map 41033 pages: -22 That's again due to the fastreg code.

            James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/29341
            Subject: LU-10068 lnet: Revert prefer Fast Reg
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 004295627f99c906f360acc0624285b4631782a1

            gerrit Gerrit Updater added a comment - James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/29341 Subject: LU-10068 lnet: Revert prefer Fast Reg Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 004295627f99c906f360acc0624285b4631782a1

            Just to be double-certain,this morning I re-loaded 3637 (last good build) did a complete re-format to obtain a clean FS, verified that everything mounted. Then, I installed build 3638, power-cycled all systems. Verified that LNET was up on all systems, and that all systems could lctl ping the MGS/MDS (soak-8) - attempted to mount the filesystem. Initial MGS/MDS mount fails.

            Oct  5 18:16:28 soak-8 kernel: ZFS: Loaded module v0.7.1-1, ZFS pool version 5000, ZFS filesystem version 5
            Oct  5 18:16:29 soak-8 kernel: LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc
            Oct  5 18:16:30 soak-8 kernel: Lustre: MGS: Connection restored to 0e7fadb0-8a70-1717-c708-3785eb9ea4ec (at 192.168.1.108@o2ib)
            Oct  5 18:16:31 soak-8 kernel: Lustre: 2365:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1507227390/real 1507227390]  req@ffff8808142d0000 x1580442241204256/t0(0) o253->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 4768/4768 e 0 to 1 dl 1507227397 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1
            Oct  5 18:16:31 soak-8 kernel: LustreError: 2216:0:(events.c:304:request_in_callback()) event type 2, status -103, service mgs
            Oct  5 18:16:31 soak-8 kernel: LustreError: 2417:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check
            Oct  5 18:16:31 soak-8 kernel: LustreError: 2417:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-192.168.1.108@o2ib x1580442241204256
            Oct  5 18:16:31 soak-8 kernel: LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail
            Oct  5 18:16:31 soak-8 kernel: Lustre: MGS: Received new LWP connection from 192.168.1.108@o2ib, removing former export from same NID
            Oct  5 18:16:31 soak-8 kernel: Lustre: MGS: Connection restored to 0e7fadb0-8a70-1717-c708-3785eb9ea4ec (at 192.168.1.108@o2ib)
            Oct  5 18:16:38 soak-8 kernel: Lustre: 2365:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1507227391/real 1507227391]  req@ffff88080ef08000 x1580442241204384/t0(0) o503->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 272/8416 e 0 to 1 dl 1507227398 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
            Oct  5 18:16:38 soak-8 kernel: LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail
            Oct  5 18:16:38 soak-8 kernel: LustreError: 15c-8: MGC192.168.1.108@o2ib: The configuration from log 'soaked-MDT0000' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
            Oct  5 18:16:39 soak-8 kernel: Lustre: MGS: Received new LWP connection from 192.168.1.108@o2ib, removing former export from same NID
            Oct  5 18:16:39 soak-8 kernel: Lustre: MGS: Connection restored to 0e7fadb0-8a70-1717-c708-3785eb9ea4ec (at 192.168.1.108@o2ib)
            Oct  5 18:16:39 soak-8 kernel: Lustre: Skipped 1 previous similar message
            Oct  5 18:16:39 soak-8 kernel: LustreError: 2365:0:(obd_mount_server.c:1370:server_start_targets()) failed to start server soaked-MDT0000: -5
            Oct  5 18:16:39 soak-8 kernel: LustreError: 2365:0:(obd_mount_server.c:1863:server_fill_super()) Unable to start targets: -5
            Oct  5 18:16:39 soak-8 kernel: LustreError: 2365:0:(obd_mount_server.c:1573:server_put_super()) no obd soaked-MDT0000
            Oct  5 18:16:46 soak-8 kernel: Lustre: 2365:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1507227399/real 1507227399]  req@ffff88080ef08000 x1580442241204416/t0(0) o251->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 224/224 e 0 to 1 dl 1507227405 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
            Oct  5 18:16:46 soak-8 kernel: Lustre: server umount soaked-MDT0000 complete
            Oct  5 18:16:46 soak-8 kernel: LustreError: 2365:0:(obd_mount.c:1504:lustre_fill_super()) Unable to mount  (-5)
            Oct  5 18:16:46 soak-8 sshd[2342]: Received disconnect from 192.168.1.116 port 35786:11: disconnected by user
            Oct  5 18:16:46 soak-8 sshd[2342]: Disconnected from 192.168.1.116 port 35786
            
            cliffw Cliff White (Inactive) added a comment - Just to be double-certain,this morning I re-loaded 3637 (last good build) did a complete re-format to obtain a clean FS, verified that everything mounted. Then, I installed build 3638, power-cycled all systems. Verified that LNET was up on all systems, and that all systems could lctl ping the MGS/MDS (soak-8) - attempted to mount the filesystem. Initial MGS/MDS mount fails. Oct 5 18:16:28 soak-8 kernel: ZFS: Loaded module v0.7.1-1, ZFS pool version 5000, ZFS filesystem version 5 Oct 5 18:16:29 soak-8 kernel: LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc Oct 5 18:16:30 soak-8 kernel: Lustre: MGS: Connection restored to 0e7fadb0-8a70-1717-c708-3785eb9ea4ec (at 192.168.1.108@o2ib) Oct 5 18:16:31 soak-8 kernel: Lustre: 2365:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1507227390/real 1507227390] req@ffff8808142d0000 x1580442241204256/t0(0) o253->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 4768/4768 e 0 to 1 dl 1507227397 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1 Oct 5 18:16:31 soak-8 kernel: LustreError: 2216:0:(events.c:304:request_in_callback()) event type 2, status -103, service mgs Oct 5 18:16:31 soak-8 kernel: LustreError: 2417:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check Oct 5 18:16:31 soak-8 kernel: LustreError: 2417:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-192.168.1.108@o2ib x1580442241204256 Oct 5 18:16:31 soak-8 kernel: LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail Oct 5 18:16:31 soak-8 kernel: Lustre: MGS: Received new LWP connection from 192.168.1.108@o2ib, removing former export from same NID Oct 5 18:16:31 soak-8 kernel: Lustre: MGS: Connection restored to 0e7fadb0-8a70-1717-c708-3785eb9ea4ec (at 192.168.1.108@o2ib) Oct 5 18:16:38 soak-8 kernel: Lustre: 2365:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1507227391/real 1507227391] req@ffff88080ef08000 x1580442241204384/t0(0) o503->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 272/8416 e 0 to 1 dl 1507227398 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Oct 5 18:16:38 soak-8 kernel: LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail Oct 5 18:16:38 soak-8 kernel: LustreError: 15c-8: MGC192.168.1.108@o2ib: The configuration from log 'soaked-MDT0000' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. Oct 5 18:16:39 soak-8 kernel: Lustre: MGS: Received new LWP connection from 192.168.1.108@o2ib, removing former export from same NID Oct 5 18:16:39 soak-8 kernel: Lustre: MGS: Connection restored to 0e7fadb0-8a70-1717-c708-3785eb9ea4ec (at 192.168.1.108@o2ib) Oct 5 18:16:39 soak-8 kernel: Lustre: Skipped 1 previous similar message Oct 5 18:16:39 soak-8 kernel: LustreError: 2365:0:(obd_mount_server.c:1370:server_start_targets()) failed to start server soaked-MDT0000: -5 Oct 5 18:16:39 soak-8 kernel: LustreError: 2365:0:(obd_mount_server.c:1863:server_fill_super()) Unable to start targets: -5 Oct 5 18:16:39 soak-8 kernel: LustreError: 2365:0:(obd_mount_server.c:1573:server_put_super()) no obd soaked-MDT0000 Oct 5 18:16:46 soak-8 kernel: Lustre: 2365:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1507227399/real 1507227399] req@ffff88080ef08000 x1580442241204416/t0(0) o251->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 224/224 e 0 to 1 dl 1507227405 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 5 18:16:46 soak-8 kernel: Lustre: server umount soaked-MDT0000 complete Oct 5 18:16:46 soak-8 kernel: LustreError: 2365:0:(obd_mount.c:1504:lustre_fill_super()) Unable to mount (-5) Oct 5 18:16:46 soak-8 sshd[2342]: Received disconnect from 192.168.1.116 port 35786:11: disconnected by user Oct 5 18:16:46 soak-8 sshd[2342]: Disconnected from 192.168.1.116 port 35786

            We sure do.

            /scratch/logs/syslog/soak-8.log:Oct  4 23:19:28 soak-8 kernel: LNet: Using FastReg for registration
            /scratch/logs/syslog/soak-8.log:Oct  5 00:06:13 soak-8 kernel: LNet: Using FastReg for registration
            /scratch/logs/syslog/soak-8.log:Oct  5 00:30:19 soak-8 kernel: LNet: Using FastReg for registration
            /scratch/logs/syslog/soak-8.log:Oct  5 18:12:54 soak-8 kernel: LNet: Using FastReg for registration
            /scratch/logs/syslog/soak-9.log:Oct  3 18:03:02 soak-9 kernel: LNet: Using FastReg for registration
            /scratch/logs/syslog/soak-9.log:Oct  4 15:55:02 soak-9 kernel: LNet: Using FastReg for registration
            /scratch/logs/syslog/soak-9.log:Oct  4 18:36:43 soak-9 kernel: LNet: Using FastReg for registration
            /scratch/logs/syslog/soak-9.log:Oct  4 23:19:30 soak-9 kernel: LNet: Using FastReg for registration
            /scratch/logs/syslog/soak-9.log:Oct  5 00:30:19 soak-9 kernel: LNet: Using FastReg for registration
            /scratch/logs/syslog/soak-9.log:Oct  5 18:12:55 soak-9 kernel: LNet: Using FastReg for registration
            
            cliffw Cliff White (Inactive) added a comment - We sure do. /scratch/logs/syslog/soak-8.log:Oct 4 23:19:28 soak-8 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-8.log:Oct 5 00:06:13 soak-8 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-8.log:Oct 5 00:30:19 soak-8 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-8.log:Oct 5 18:12:54 soak-8 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-9.log:Oct 3 18:03:02 soak-9 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-9.log:Oct 4 15:55:02 soak-9 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-9.log:Oct 4 18:36:43 soak-9 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-9.log:Oct 4 23:19:30 soak-9 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-9.log:Oct 5 00:30:19 soak-9 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-9.log:Oct 5 18:12:55 soak-9 kernel: LNet: Using FastReg for registration

            Do we see this on startup?

            LCONSOLE_INFO("Using FastReg for registration\n");
            
            ashehata Amir Shehata (Inactive) added a comment - Do we see this on startup? LCONSOLE_INFO( "Using FastReg for registration\n" );

            That's true. OPA supports FMR in software but not FastReg.

            Was this run on OPA?

            ashehata Amir Shehata (Inactive) added a comment - That's true. OPA supports FMR in software but not FastReg. Was this run on OPA?
            jhammond John Hammond added a comment -

            It's likely that LU-9810 is broken for OPA. Since I seem to recall that OPA supports FMR but not FastReg. Amir?

            jhammond John Hammond added a comment - It's likely that LU-9810 is broken for OPA. Since I seem to recall that OPA supports FMR but not FastReg. Amir?

            Based on the above patch descriptions, my first guess would be LU-7001.

            Alternately, if this is only being seen on IB then LU-9810 seems likely.

            adilger Andreas Dilger added a comment - Based on the above patch descriptions, my first guess would be LU-7001 . Alternately, if this is only being seen on IB then LU-9810 seems likely.

            Having some problems also with lustre-review builds, so may also be soak, will re-load 3637 tomorrow to confirm.

            cliffw Cliff White (Inactive) added a comment - Having some problems also with lustre-review builds, so may also be soak, will re-load 3637 tomorrow to confirm.

            Between builds 3637 and 3638, the following patches landed:
            LU-9907 build: add patchless server for lbuild — oleg.drokin / gitweb
            LU-7988 hsm: update many cookie status at once — oleg.drokin / gitweb
            LU-7001 osp: fix llog processing — oleg.drokin / gitweb
            LU-9810 lnet: fix build with M-OFED 4.1 — oleg.drokin / gitweb
            LU-9810 lnet: prefer Fast Reg — oleg.drokin / gitweb
            LU-9260 test: Use the correct mount device when test against lustre — oleg.drokin / gitweb
            LU-9941 lov: lsm_is_composite isn't right — oleg.drokin / gitweb
            LU-8342 utils: Set dnodesize/recordsize at zfs dataset create — oleg.drokin / gitweb
            LU-9044 test: remove conf-sanity tests from ALWAYS_EXCEPT — oleg.drokin / gitweb
            LU-9870 build: handle SNMP missing on build box — oleg.drokin / gitweb
            LU-9891 tests: Increase space not released for ZFS — oleg.drokin / gitweb
            LU-7746 tests: skip tests for older (upstream) client — oleg.drokin / gitweb
            LU-9921 lnet: resolve unsafe list access — oleg.drokin / gitweb
            LU-9950 build: add support for Ubuntu(debian) arm64 — oleg.drokin / gitweb

            jamesanunez James Nunez (Inactive) added a comment - Between builds 3637 and 3638, the following patches landed: LU-9907 build: add patchless server for lbuild — oleg.drokin / gitweb LU-7988 hsm: update many cookie status at once — oleg.drokin / gitweb LU-7001 osp: fix llog processing — oleg.drokin / gitweb LU-9810 lnet: fix build with M-OFED 4.1 — oleg.drokin / gitweb LU-9810 lnet: prefer Fast Reg — oleg.drokin / gitweb LU-9260 test: Use the correct mount device when test against lustre — oleg.drokin / gitweb LU-9941 lov: lsm_is_composite isn't right — oleg.drokin / gitweb LU-8342 utils: Set dnodesize/recordsize at zfs dataset create — oleg.drokin / gitweb LU-9044 test: remove conf-sanity tests from ALWAYS_EXCEPT — oleg.drokin / gitweb LU-9870 build: handle SNMP missing on build box — oleg.drokin / gitweb LU-9891 tests: Increase space not released for ZFS — oleg.drokin / gitweb LU-7746 tests: skip tests for older (upstream) client — oleg.drokin / gitweb LU-9921 lnet: resolve unsafe list access — oleg.drokin / gitweb LU-9950 build: add support for Ubuntu(debian) arm64 — oleg.drokin / gitweb

            People

              ashehata Amir Shehata (Inactive)
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: