Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10068

OST fails to mount:LustreError: 14558:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.11.0
    • Soak cluster - latest lustre-master build (3650) version=2.10.53_32_g20ffe21
    • 3
    • 9223372036854775807

    Description

      Reformatted, created complete new filesystem.
      MDS/MGS mounts.
      OSS mount fails
      OSS errors

      [  584.585335] Lustre: 13317:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1507069853/real 1507069853]  req@ffff88083f758300 x1580277278179360/t0(0) o253->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 4768/4768 e 0 to 1 dl 1507069860 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1
      [  584.680525] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail
      [  584.727353] LustreError: 15f-b: soaked-OST0000: cannot register this server with the MGS: rc = -5. Is the MGS running?
      [  584.755152] Lustre: MGC192.168.1.108@o2ib: Connection restored to MGC192.168.1.108@o2ib_0 (at 192.168.1.108@o2ib)
      [  584.796681] LustreError: 13317:0:(obd_mount_server.c:1863:server_fill_super()) Unable to start targets: -5
      [  584.828634] LustreError: 13317:0:(obd_mount_server.c:1573:server_put_super()) no obd soaked-OST0000
      [  584.858501] LustreError: 13317:0:(obd_mount_server.c:132:server_deregister_mount()) soaked-OST0000 not registered
      [  585.118058] Lustre: server umount soaked-OST0000 complete
      [  585.135868] LustreError: 13317:0:(obd_mount.c:1504:lustre_fill_super()) Unable to mount  (-5)
      

      Errors on MDS

      [17524.112385] Lustre: soaked-MDT0000: new disk, initializing
      [17524.138680] Lustre: soaked-MDT0000: Imperative Recovery not enabled, recovery window 300-900
      [17524.153739] Lustre: ctl-soaked-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400]:0:mdt
      [17528.212947] Lustre: MGS: Connection restored to aa8414d2-f089-bf2e-b8c3-406370f048cc (at 192.168.1.102@o2ib)
      [17528.273221] LustreError: 12811:0:(events.c:304:request_in_callback()) event type 2, status -103, service mgs
      [17528.287356] LustreError: 14558:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check
      [17528.305726] LustreError: 14558:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-192.168.1.102@o2ib x1580277278179360
      [17528.418152] Lustre: MGS: Received new LWP connection from 192.168.1.102@o2ib, removing former export from same NID
      

      Not sure where to go from here.

      Attachments

        Issue Links

          Activity

            [LU-10068] OST fails to mount:LustreError: 14558:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check

            Tested James patch reverting Fast Reg. https://review.whamcloud.com/29341. Soak is running okay, but router have hit LBUG (LU-10103)

            cliffw Cliff White (Inactive) added a comment - Tested James patch reverting Fast Reg. https://review.whamcloud.com/29341 . Soak is running okay, but router have hit LBUG ( LU-10103 )
            ashehata Amir Shehata (Inactive) added a comment - - edited

            there appears to be 3 separate issues:
            1. Fastreg is not supported on OPA so reverting LU-9810 works (as a side note, I also tried John's patch for LU-9983 and that resolves the problem we were seeing there).
            2. Fastreg broken on MLX-5. I've been debugging this problem yesterday, and we already have a few bugs that are all related. For this particular problem ib_map_mr_sg() is called to map 11 fragments but ends up mapping 10. I'm trying to look at the mlx5 driver code and understand why it stops before mapping all fragments. The fragments look like:

            length = 4096, offset = 0, addr = 105c43c000, page offset = 0, page_addr = 105c43c000
            length = 4096, offset = 0, addr = 84cfb0000, page offset = 0, page_addr = 84cfb0000
            length = 4096, offset = 0, addr = 84cfb1000, page offset = 0, page_addr = 84cfb1000
            length = 4096, offset = 0, addr = 84cfb2000, page offset = 0, page_addr = 84cfb2000
            length = 4096, offset = 0, addr = 84cfb3000, page offset = 0, page_addr = 84cfb3000
            length = 4096, offset = 0, addr = 84cfb4000, page offset = 0, page_addr = 84cfb4000
            length = 4096, offset = 0, addr = 84cfb5000, page offset = 0, page_addr = 84cfb5000
            length = 4096, offset = 0, addr = 84cfb6000, page offset = 0, page_addr = 84cfb6000
            length = 4096, offset = 0, addr = 84cfb7000, page offset = 0, page_addr = 84cfb7000
            length = 73, offset = 0, addr = 84cfb8000, page offset = 0, page_addr = 84cfb8000
            length = 4096, offset = 0, addr = 105c43d000, page offset = 0, page_addr = 105c43d000
            

            So it looks like it stops on the fragment of length 73.

            3. MLX-4 failure. I still need to investigate further, because it could be different than both of the above.

            ashehata Amir Shehata (Inactive) added a comment - - edited there appears to be 3 separate issues: 1. Fastreg is not supported on OPA so reverting LU-9810 works (as a side note, I also tried John's patch for LU-9983 and that resolves the problem we were seeing there). 2. Fastreg broken on MLX-5. I've been debugging this problem yesterday, and we already have a few bugs that are all related. For this particular problem ib_map_mr_sg() is called to map 11 fragments but ends up mapping 10. I'm trying to look at the mlx5 driver code and understand why it stops before mapping all fragments. The fragments look like: length = 4096, offset = 0, addr = 105c43c000, page offset = 0, page_addr = 105c43c000 length = 4096, offset = 0, addr = 84cfb0000, page offset = 0, page_addr = 84cfb0000 length = 4096, offset = 0, addr = 84cfb1000, page offset = 0, page_addr = 84cfb1000 length = 4096, offset = 0, addr = 84cfb2000, page offset = 0, page_addr = 84cfb2000 length = 4096, offset = 0, addr = 84cfb3000, page offset = 0, page_addr = 84cfb3000 length = 4096, offset = 0, addr = 84cfb4000, page offset = 0, page_addr = 84cfb4000 length = 4096, offset = 0, addr = 84cfb5000, page offset = 0, page_addr = 84cfb5000 length = 4096, offset = 0, addr = 84cfb6000, page offset = 0, page_addr = 84cfb6000 length = 4096, offset = 0, addr = 84cfb7000, page offset = 0, page_addr = 84cfb7000 length = 73, offset = 0, addr = 84cfb8000, page offset = 0, page_addr = 84cfb8000 length = 4096, offset = 0, addr = 105c43d000, page offset = 0, page_addr = 105c43d000 So it looks like it stops on the fragment of length 73. 3. MLX-4 failure. I still need to investigate further, because it could be different than both of the above.

            Trying https://review.whamcloud.com/29341 on soak - servers mount okay, some clients are having timeouts

            soak-20 login: [ 3363.772469] INFO: task mount.lustre:2191 blocked for more than 120 seconds.^M
            [ 3363.782165] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.^M
            [ 3363.792785] mount.lustre    D 0000000000000001     0  2191   2190 0x00000080^M
            [ 3363.802418]  ffff88103fad3940 0000000000000082 ffff881013fedee0 ffff88103fad3fd8^M
            [ 3363.812340]  ffff88103fad3fd8 ffff88103fad3fd8 ffff881013fedee0 ffff881013b87270^M
            [ 3363.822136]  7fffffffffffffff ffff881013b87268 ffff881013fedee0 0000000000000001^M
            [ 3363.831817] Call Trace:^M
            [ 3363.835841]  [<ffffffff816a94e9>] schedule+0x29/0x70^M
            [ 3363.842720]  [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0^M
            [ 3363.850481]  [<ffffffff810c76f5>] ? sched_clock_cpu+0x85/0xc0^M
            [ 3363.858146]  [<ffffffff810c12c8>] ? check_preempt_curr+0x78/0xa0^M
            [ 3363.866025]  [<ffffffff810c1309>] ? ttwu_do_wakeup+0x19/0xd0^M
            [ 3363.873501]  [<ffffffff816a989d>] wait_for_completion+0xfd/0x140^M
            [ 3363.881287]  [<ffffffff810c4810>] ? wake_up_state+0x20/0x20^M
            [ 3363.888596]  [<ffffffffc09de6b4>] llog_process_or_fork+0x244/0x450 [obdclass]^M
            [ 3363.897603]  [<ffffffffc09de8d4>] llog_process+0x14/0x20 [obdclass]^M
            [ 3363.905633]  [<ffffffffc0a0ed65>] class_config_parse_llog+0x125/0x350 [obdclass]^M
            [ 3363.914874]  [<ffffffffc091c368>] mgc_process_cfg_log+0x788/0xc40 [mgc]^M
            [ 3363.923229]  [<ffffffffc091f1e3>] mgc_process_log+0x3d3/0x890 [mgc]^M
            [ 3363.931153]  [<ffffffffc0a17480>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]^M
            [ 3363.940736]  [<ffffffffc091f8e8>] ? do_config_log_add+0x248/0x580 [mgc]^M
            [ 3363.949020]  [<ffffffffc09207c0>] mgc_process_config+0x890/0x13f0 [mgc]^M
            [ 3363.957326]  [<ffffffffc0a1ad7e>] lustre_process_log+0x2de/0xaf0 [obdclass]^M
            [ 3363.965969]  [<ffffffff816aba0e>] ? _raw_spin_unlock_bh+0x1e/0x20^M
            [ 3363.973647]  [<ffffffff8132318b>] ? fprop_local_init_percpu+0x1b/0x30^M
            [ 3363.981681]  [<ffffffffc0e20608>] ll_fill_super+0xaf8/0x1220 [lustre]^M
            [ 3363.989721]  [<ffffffffc0a20ce6>] lustre_fill_super+0x286/0x910 [obdclass]^M
            [ 3363.998214]  [<ffffffffc0a20a60>] ? lustre_common_put_super+0x270/0x270 [obdclass]^M
            [ 3364.007482]  [<ffffffff81204b0d>] mount_nodev+0x4d/0xb0^M
            [ 3364.014099]  [<ffffffffc0a18ba8>] lustre_mount+0x38/0x60 [obdclass]^M
            [ 3364.021874]  [<ffffffff81205599>] mount_fs+0x39/0x1b0^M
            [ 3364.028291]  [<ffffffff81222067>] vfs_kern_mount+0x67/0x110^M
            [ 3364.035288]  [<ffffffff81224573>] do_mount+0x233/0xaf0^M
            [ 3364.041786]  [<ffffffff8118760e>] ? __get_free_pages+0xe/0x40^M
            [ 3364.048971]  [<ffffffff812251b6>] SyS_mount+0x96/0xf0^M
            [ 3364.055364]  [<ffffffff816b5009>] system_call_fastpath+0x16/0x1b^M
            

            Re-trying clients after power-cycling does work

            cliffw Cliff White (Inactive) added a comment - Trying https://review.whamcloud.com/29341 on soak - servers mount okay, some clients are having timeouts soak-20 login: [ 3363.772469] INFO: task mount.lustre:2191 blocked for more than 120 seconds.^M [ 3363.782165] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.^M [ 3363.792785] mount.lustre D 0000000000000001 0 2191 2190 0x00000080^M [ 3363.802418] ffff88103fad3940 0000000000000082 ffff881013fedee0 ffff88103fad3fd8^M [ 3363.812340] ffff88103fad3fd8 ffff88103fad3fd8 ffff881013fedee0 ffff881013b87270^M [ 3363.822136] 7fffffffffffffff ffff881013b87268 ffff881013fedee0 0000000000000001^M [ 3363.831817] Call Trace:^M [ 3363.835841] [<ffffffff816a94e9>] schedule+0x29/0x70^M [ 3363.842720] [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0^M [ 3363.850481] [<ffffffff810c76f5>] ? sched_clock_cpu+0x85/0xc0^M [ 3363.858146] [<ffffffff810c12c8>] ? check_preempt_curr+0x78/0xa0^M [ 3363.866025] [<ffffffff810c1309>] ? ttwu_do_wakeup+0x19/0xd0^M [ 3363.873501] [<ffffffff816a989d>] wait_for_completion+0xfd/0x140^M [ 3363.881287] [<ffffffff810c4810>] ? wake_up_state+0x20/0x20^M [ 3363.888596] [<ffffffffc09de6b4>] llog_process_or_fork+0x244/0x450 [obdclass]^M [ 3363.897603] [<ffffffffc09de8d4>] llog_process+0x14/0x20 [obdclass]^M [ 3363.905633] [<ffffffffc0a0ed65>] class_config_parse_llog+0x125/0x350 [obdclass]^M [ 3363.914874] [<ffffffffc091c368>] mgc_process_cfg_log+0x788/0xc40 [mgc]^M [ 3363.923229] [<ffffffffc091f1e3>] mgc_process_log+0x3d3/0x890 [mgc]^M [ 3363.931153] [<ffffffffc0a17480>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]^M [ 3363.940736] [<ffffffffc091f8e8>] ? do_config_log_add+0x248/0x580 [mgc]^M [ 3363.949020] [<ffffffffc09207c0>] mgc_process_config+0x890/0x13f0 [mgc]^M [ 3363.957326] [<ffffffffc0a1ad7e>] lustre_process_log+0x2de/0xaf0 [obdclass]^M [ 3363.965969] [<ffffffff816aba0e>] ? _raw_spin_unlock_bh+0x1e/0x20^M [ 3363.973647] [<ffffffff8132318b>] ? fprop_local_init_percpu+0x1b/0x30^M [ 3363.981681] [<ffffffffc0e20608>] ll_fill_super+0xaf8/0x1220 [lustre]^M [ 3363.989721] [<ffffffffc0a20ce6>] lustre_fill_super+0x286/0x910 [obdclass]^M [ 3363.998214] [<ffffffffc0a20a60>] ? lustre_common_put_super+0x270/0x270 [obdclass]^M [ 3364.007482] [<ffffffff81204b0d>] mount_nodev+0x4d/0xb0^M [ 3364.014099] [<ffffffffc0a18ba8>] lustre_mount+0x38/0x60 [obdclass]^M [ 3364.021874] [<ffffffff81205599>] mount_fs+0x39/0x1b0^M [ 3364.028291] [<ffffffff81222067>] vfs_kern_mount+0x67/0x110^M [ 3364.035288] [<ffffffff81224573>] do_mount+0x233/0xaf0^M [ 3364.041786] [<ffffffff8118760e>] ? __get_free_pages+0xe/0x40^M [ 3364.048971] [<ffffffff812251b6>] SyS_mount+0x96/0xf0^M [ 3364.055364] [<ffffffff816b5009>] system_call_fastpath+0x16/0x1b^M Re-trying clients after power-cycling does work

            I tried the patch for LU-9983 to see it would address this but it didn't

            simmonsja James A Simmons added a comment - I tried the patch for LU-9983 to see it would address this but it didn't
            jhammond John Hammond added a comment -

            Fails on OPA as well.

            jhammond John Hammond added a comment - Fails on OPA as well.

            so on MLX5 I'm able to mount, but I'm getting the following error:

            [ 1157.516112] LNetError: 34447:0:(o2iblnd.c:1940:kiblnd_fmr_pool_map()) Failed to map mr 10/11 elements
            [ 1157.516122] LNetError: 34528:0:(o2iblnd_cb.c:560:kiblnd_fmr_map_tx()) Can't map 41033 pages: -22
            

            That's again due to the fastreg code.

            ashehata Amir Shehata (Inactive) added a comment - so on MLX5 I'm able to mount, but I'm getting the following error: [ 1157.516112] LNetError: 34447:0:(o2iblnd.c:1940:kiblnd_fmr_pool_map()) Failed to map mr 10/11 elements [ 1157.516122] LNetError: 34528:0:(o2iblnd_cb.c:560:kiblnd_fmr_map_tx()) Can't map 41033 pages: -22 That's again due to the fastreg code.

            James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/29341
            Subject: LU-10068 lnet: Revert prefer Fast Reg
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 004295627f99c906f360acc0624285b4631782a1

            gerrit Gerrit Updater added a comment - James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/29341 Subject: LU-10068 lnet: Revert prefer Fast Reg Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 004295627f99c906f360acc0624285b4631782a1

            Just to be double-certain,this morning I re-loaded 3637 (last good build) did a complete re-format to obtain a clean FS, verified that everything mounted. Then, I installed build 3638, power-cycled all systems. Verified that LNET was up on all systems, and that all systems could lctl ping the MGS/MDS (soak-8) - attempted to mount the filesystem. Initial MGS/MDS mount fails.

            Oct  5 18:16:28 soak-8 kernel: ZFS: Loaded module v0.7.1-1, ZFS pool version 5000, ZFS filesystem version 5
            Oct  5 18:16:29 soak-8 kernel: LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc
            Oct  5 18:16:30 soak-8 kernel: Lustre: MGS: Connection restored to 0e7fadb0-8a70-1717-c708-3785eb9ea4ec (at 192.168.1.108@o2ib)
            Oct  5 18:16:31 soak-8 kernel: Lustre: 2365:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1507227390/real 1507227390]  req@ffff8808142d0000 x1580442241204256/t0(0) o253->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 4768/4768 e 0 to 1 dl 1507227397 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1
            Oct  5 18:16:31 soak-8 kernel: LustreError: 2216:0:(events.c:304:request_in_callback()) event type 2, status -103, service mgs
            Oct  5 18:16:31 soak-8 kernel: LustreError: 2417:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check
            Oct  5 18:16:31 soak-8 kernel: LustreError: 2417:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-192.168.1.108@o2ib x1580442241204256
            Oct  5 18:16:31 soak-8 kernel: LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail
            Oct  5 18:16:31 soak-8 kernel: Lustre: MGS: Received new LWP connection from 192.168.1.108@o2ib, removing former export from same NID
            Oct  5 18:16:31 soak-8 kernel: Lustre: MGS: Connection restored to 0e7fadb0-8a70-1717-c708-3785eb9ea4ec (at 192.168.1.108@o2ib)
            Oct  5 18:16:38 soak-8 kernel: Lustre: 2365:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1507227391/real 1507227391]  req@ffff88080ef08000 x1580442241204384/t0(0) o503->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 272/8416 e 0 to 1 dl 1507227398 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
            Oct  5 18:16:38 soak-8 kernel: LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail
            Oct  5 18:16:38 soak-8 kernel: LustreError: 15c-8: MGC192.168.1.108@o2ib: The configuration from log 'soaked-MDT0000' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
            Oct  5 18:16:39 soak-8 kernel: Lustre: MGS: Received new LWP connection from 192.168.1.108@o2ib, removing former export from same NID
            Oct  5 18:16:39 soak-8 kernel: Lustre: MGS: Connection restored to 0e7fadb0-8a70-1717-c708-3785eb9ea4ec (at 192.168.1.108@o2ib)
            Oct  5 18:16:39 soak-8 kernel: Lustre: Skipped 1 previous similar message
            Oct  5 18:16:39 soak-8 kernel: LustreError: 2365:0:(obd_mount_server.c:1370:server_start_targets()) failed to start server soaked-MDT0000: -5
            Oct  5 18:16:39 soak-8 kernel: LustreError: 2365:0:(obd_mount_server.c:1863:server_fill_super()) Unable to start targets: -5
            Oct  5 18:16:39 soak-8 kernel: LustreError: 2365:0:(obd_mount_server.c:1573:server_put_super()) no obd soaked-MDT0000
            Oct  5 18:16:46 soak-8 kernel: Lustre: 2365:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1507227399/real 1507227399]  req@ffff88080ef08000 x1580442241204416/t0(0) o251->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 224/224 e 0 to 1 dl 1507227405 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
            Oct  5 18:16:46 soak-8 kernel: Lustre: server umount soaked-MDT0000 complete
            Oct  5 18:16:46 soak-8 kernel: LustreError: 2365:0:(obd_mount.c:1504:lustre_fill_super()) Unable to mount  (-5)
            Oct  5 18:16:46 soak-8 sshd[2342]: Received disconnect from 192.168.1.116 port 35786:11: disconnected by user
            Oct  5 18:16:46 soak-8 sshd[2342]: Disconnected from 192.168.1.116 port 35786
            
            cliffw Cliff White (Inactive) added a comment - Just to be double-certain,this morning I re-loaded 3637 (last good build) did a complete re-format to obtain a clean FS, verified that everything mounted. Then, I installed build 3638, power-cycled all systems. Verified that LNET was up on all systems, and that all systems could lctl ping the MGS/MDS (soak-8) - attempted to mount the filesystem. Initial MGS/MDS mount fails. Oct 5 18:16:28 soak-8 kernel: ZFS: Loaded module v0.7.1-1, ZFS pool version 5000, ZFS filesystem version 5 Oct 5 18:16:29 soak-8 kernel: LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc Oct 5 18:16:30 soak-8 kernel: Lustre: MGS: Connection restored to 0e7fadb0-8a70-1717-c708-3785eb9ea4ec (at 192.168.1.108@o2ib) Oct 5 18:16:31 soak-8 kernel: Lustre: 2365:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1507227390/real 1507227390] req@ffff8808142d0000 x1580442241204256/t0(0) o253->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 4768/4768 e 0 to 1 dl 1507227397 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1 Oct 5 18:16:31 soak-8 kernel: LustreError: 2216:0:(events.c:304:request_in_callback()) event type 2, status -103, service mgs Oct 5 18:16:31 soak-8 kernel: LustreError: 2417:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check Oct 5 18:16:31 soak-8 kernel: LustreError: 2417:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-192.168.1.108@o2ib x1580442241204256 Oct 5 18:16:31 soak-8 kernel: LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail Oct 5 18:16:31 soak-8 kernel: Lustre: MGS: Received new LWP connection from 192.168.1.108@o2ib, removing former export from same NID Oct 5 18:16:31 soak-8 kernel: Lustre: MGS: Connection restored to 0e7fadb0-8a70-1717-c708-3785eb9ea4ec (at 192.168.1.108@o2ib) Oct 5 18:16:38 soak-8 kernel: Lustre: 2365:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1507227391/real 1507227391] req@ffff88080ef08000 x1580442241204384/t0(0) o503->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 272/8416 e 0 to 1 dl 1507227398 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Oct 5 18:16:38 soak-8 kernel: LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail Oct 5 18:16:38 soak-8 kernel: LustreError: 15c-8: MGC192.168.1.108@o2ib: The configuration from log 'soaked-MDT0000' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. Oct 5 18:16:39 soak-8 kernel: Lustre: MGS: Received new LWP connection from 192.168.1.108@o2ib, removing former export from same NID Oct 5 18:16:39 soak-8 kernel: Lustre: MGS: Connection restored to 0e7fadb0-8a70-1717-c708-3785eb9ea4ec (at 192.168.1.108@o2ib) Oct 5 18:16:39 soak-8 kernel: Lustre: Skipped 1 previous similar message Oct 5 18:16:39 soak-8 kernel: LustreError: 2365:0:(obd_mount_server.c:1370:server_start_targets()) failed to start server soaked-MDT0000: -5 Oct 5 18:16:39 soak-8 kernel: LustreError: 2365:0:(obd_mount_server.c:1863:server_fill_super()) Unable to start targets: -5 Oct 5 18:16:39 soak-8 kernel: LustreError: 2365:0:(obd_mount_server.c:1573:server_put_super()) no obd soaked-MDT0000 Oct 5 18:16:46 soak-8 kernel: Lustre: 2365:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1507227399/real 1507227399] req@ffff88080ef08000 x1580442241204416/t0(0) o251->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 224/224 e 0 to 1 dl 1507227405 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 5 18:16:46 soak-8 kernel: Lustre: server umount soaked-MDT0000 complete Oct 5 18:16:46 soak-8 kernel: LustreError: 2365:0:(obd_mount.c:1504:lustre_fill_super()) Unable to mount (-5) Oct 5 18:16:46 soak-8 sshd[2342]: Received disconnect from 192.168.1.116 port 35786:11: disconnected by user Oct 5 18:16:46 soak-8 sshd[2342]: Disconnected from 192.168.1.116 port 35786

            We sure do.

            /scratch/logs/syslog/soak-8.log:Oct  4 23:19:28 soak-8 kernel: LNet: Using FastReg for registration
            /scratch/logs/syslog/soak-8.log:Oct  5 00:06:13 soak-8 kernel: LNet: Using FastReg for registration
            /scratch/logs/syslog/soak-8.log:Oct  5 00:30:19 soak-8 kernel: LNet: Using FastReg for registration
            /scratch/logs/syslog/soak-8.log:Oct  5 18:12:54 soak-8 kernel: LNet: Using FastReg for registration
            /scratch/logs/syslog/soak-9.log:Oct  3 18:03:02 soak-9 kernel: LNet: Using FastReg for registration
            /scratch/logs/syslog/soak-9.log:Oct  4 15:55:02 soak-9 kernel: LNet: Using FastReg for registration
            /scratch/logs/syslog/soak-9.log:Oct  4 18:36:43 soak-9 kernel: LNet: Using FastReg for registration
            /scratch/logs/syslog/soak-9.log:Oct  4 23:19:30 soak-9 kernel: LNet: Using FastReg for registration
            /scratch/logs/syslog/soak-9.log:Oct  5 00:30:19 soak-9 kernel: LNet: Using FastReg for registration
            /scratch/logs/syslog/soak-9.log:Oct  5 18:12:55 soak-9 kernel: LNet: Using FastReg for registration
            
            cliffw Cliff White (Inactive) added a comment - We sure do. /scratch/logs/syslog/soak-8.log:Oct 4 23:19:28 soak-8 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-8.log:Oct 5 00:06:13 soak-8 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-8.log:Oct 5 00:30:19 soak-8 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-8.log:Oct 5 18:12:54 soak-8 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-9.log:Oct 3 18:03:02 soak-9 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-9.log:Oct 4 15:55:02 soak-9 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-9.log:Oct 4 18:36:43 soak-9 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-9.log:Oct 4 23:19:30 soak-9 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-9.log:Oct 5 00:30:19 soak-9 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-9.log:Oct 5 18:12:55 soak-9 kernel: LNet: Using FastReg for registration

            Do we see this on startup?

            LCONSOLE_INFO("Using FastReg for registration\n");
            
            ashehata Amir Shehata (Inactive) added a comment - Do we see this on startup? LCONSOLE_INFO( "Using FastReg for registration\n" );

            That's true. OPA supports FMR in software but not FastReg.

            Was this run on OPA?

            ashehata Amir Shehata (Inactive) added a comment - That's true. OPA supports FMR in software but not FastReg. Was this run on OPA?

            People

              ashehata Amir Shehata (Inactive)
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: