Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10068

OST fails to mount:LustreError: 14558:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.11.0
    • Soak cluster - latest lustre-master build (3650) version=2.10.53_32_g20ffe21
    • 3
    • 9223372036854775807

    Description

      Reformatted, created complete new filesystem.
      MDS/MGS mounts.
      OSS mount fails
      OSS errors

      [  584.585335] Lustre: 13317:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1507069853/real 1507069853]  req@ffff88083f758300 x1580277278179360/t0(0) o253->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 4768/4768 e 0 to 1 dl 1507069860 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1
      [  584.680525] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail
      [  584.727353] LustreError: 15f-b: soaked-OST0000: cannot register this server with the MGS: rc = -5. Is the MGS running?
      [  584.755152] Lustre: MGC192.168.1.108@o2ib: Connection restored to MGC192.168.1.108@o2ib_0 (at 192.168.1.108@o2ib)
      [  584.796681] LustreError: 13317:0:(obd_mount_server.c:1863:server_fill_super()) Unable to start targets: -5
      [  584.828634] LustreError: 13317:0:(obd_mount_server.c:1573:server_put_super()) no obd soaked-OST0000
      [  584.858501] LustreError: 13317:0:(obd_mount_server.c:132:server_deregister_mount()) soaked-OST0000 not registered
      [  585.118058] Lustre: server umount soaked-OST0000 complete
      [  585.135868] LustreError: 13317:0:(obd_mount.c:1504:lustre_fill_super()) Unable to mount  (-5)
      

      Errors on MDS

      [17524.112385] Lustre: soaked-MDT0000: new disk, initializing
      [17524.138680] Lustre: soaked-MDT0000: Imperative Recovery not enabled, recovery window 300-900
      [17524.153739] Lustre: ctl-soaked-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400]:0:mdt
      [17528.212947] Lustre: MGS: Connection restored to aa8414d2-f089-bf2e-b8c3-406370f048cc (at 192.168.1.102@o2ib)
      [17528.273221] LustreError: 12811:0:(events.c:304:request_in_callback()) event type 2, status -103, service mgs
      [17528.287356] LustreError: 14558:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check
      [17528.305726] LustreError: 14558:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-192.168.1.102@o2ib x1580277278179360
      [17528.418152] Lustre: MGS: Received new LWP connection from 192.168.1.102@o2ib, removing former export from same NID
      

      Not sure where to go from here.

      Attachments

        Issue Links

          Activity

            [LU-10068] OST fails to mount:LustreError: 14558:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check

            This was fixed by the LU-9983 patch.

            adilger Andreas Dilger added a comment - This was fixed by the LU-9983 patch.

            I am ok with closing it.

            cliffw Cliff White (Inactive) added a comment - I am ok with closing it.

            The combined patch is no longer needed (part reverted, part landed), can this ticket be closed?

            adilger Andreas Dilger added a comment - The combined patch is no longer needed (part reverted, part landed), can this ticket be closed?

            Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/29547
            Subject: LU-10068 lnet: Combined patch for testing on soak
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: f4009ebfecc1cfc4066a4b83808b224eae2ab852

            gerrit Gerrit Updater added a comment - Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/29547 Subject: LU-10068 lnet: Combined patch for testing on soak Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: f4009ebfecc1cfc4066a4b83808b224eae2ab852

            Tested James patch reverting Fast Reg. https://review.whamcloud.com/29341. Soak is running okay, but router have hit LBUG (LU-10103)

            cliffw Cliff White (Inactive) added a comment - Tested James patch reverting Fast Reg. https://review.whamcloud.com/29341 . Soak is running okay, but router have hit LBUG ( LU-10103 )
            ashehata Amir Shehata (Inactive) added a comment - - edited

            there appears to be 3 separate issues:
            1. Fastreg is not supported on OPA so reverting LU-9810 works (as a side note, I also tried John's patch for LU-9983 and that resolves the problem we were seeing there).
            2. Fastreg broken on MLX-5. I've been debugging this problem yesterday, and we already have a few bugs that are all related. For this particular problem ib_map_mr_sg() is called to map 11 fragments but ends up mapping 10. I'm trying to look at the mlx5 driver code and understand why it stops before mapping all fragments. The fragments look like:

            length = 4096, offset = 0, addr = 105c43c000, page offset = 0, page_addr = 105c43c000
            length = 4096, offset = 0, addr = 84cfb0000, page offset = 0, page_addr = 84cfb0000
            length = 4096, offset = 0, addr = 84cfb1000, page offset = 0, page_addr = 84cfb1000
            length = 4096, offset = 0, addr = 84cfb2000, page offset = 0, page_addr = 84cfb2000
            length = 4096, offset = 0, addr = 84cfb3000, page offset = 0, page_addr = 84cfb3000
            length = 4096, offset = 0, addr = 84cfb4000, page offset = 0, page_addr = 84cfb4000
            length = 4096, offset = 0, addr = 84cfb5000, page offset = 0, page_addr = 84cfb5000
            length = 4096, offset = 0, addr = 84cfb6000, page offset = 0, page_addr = 84cfb6000
            length = 4096, offset = 0, addr = 84cfb7000, page offset = 0, page_addr = 84cfb7000
            length = 73, offset = 0, addr = 84cfb8000, page offset = 0, page_addr = 84cfb8000
            length = 4096, offset = 0, addr = 105c43d000, page offset = 0, page_addr = 105c43d000
            

            So it looks like it stops on the fragment of length 73.

            3. MLX-4 failure. I still need to investigate further, because it could be different than both of the above.

            ashehata Amir Shehata (Inactive) added a comment - - edited there appears to be 3 separate issues: 1. Fastreg is not supported on OPA so reverting LU-9810 works (as a side note, I also tried John's patch for LU-9983 and that resolves the problem we were seeing there). 2. Fastreg broken on MLX-5. I've been debugging this problem yesterday, and we already have a few bugs that are all related. For this particular problem ib_map_mr_sg() is called to map 11 fragments but ends up mapping 10. I'm trying to look at the mlx5 driver code and understand why it stops before mapping all fragments. The fragments look like: length = 4096, offset = 0, addr = 105c43c000, page offset = 0, page_addr = 105c43c000 length = 4096, offset = 0, addr = 84cfb0000, page offset = 0, page_addr = 84cfb0000 length = 4096, offset = 0, addr = 84cfb1000, page offset = 0, page_addr = 84cfb1000 length = 4096, offset = 0, addr = 84cfb2000, page offset = 0, page_addr = 84cfb2000 length = 4096, offset = 0, addr = 84cfb3000, page offset = 0, page_addr = 84cfb3000 length = 4096, offset = 0, addr = 84cfb4000, page offset = 0, page_addr = 84cfb4000 length = 4096, offset = 0, addr = 84cfb5000, page offset = 0, page_addr = 84cfb5000 length = 4096, offset = 0, addr = 84cfb6000, page offset = 0, page_addr = 84cfb6000 length = 4096, offset = 0, addr = 84cfb7000, page offset = 0, page_addr = 84cfb7000 length = 73, offset = 0, addr = 84cfb8000, page offset = 0, page_addr = 84cfb8000 length = 4096, offset = 0, addr = 105c43d000, page offset = 0, page_addr = 105c43d000 So it looks like it stops on the fragment of length 73. 3. MLX-4 failure. I still need to investigate further, because it could be different than both of the above.

            Trying https://review.whamcloud.com/29341 on soak - servers mount okay, some clients are having timeouts

            soak-20 login: [ 3363.772469] INFO: task mount.lustre:2191 blocked for more than 120 seconds.^M
            [ 3363.782165] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.^M
            [ 3363.792785] mount.lustre    D 0000000000000001     0  2191   2190 0x00000080^M
            [ 3363.802418]  ffff88103fad3940 0000000000000082 ffff881013fedee0 ffff88103fad3fd8^M
            [ 3363.812340]  ffff88103fad3fd8 ffff88103fad3fd8 ffff881013fedee0 ffff881013b87270^M
            [ 3363.822136]  7fffffffffffffff ffff881013b87268 ffff881013fedee0 0000000000000001^M
            [ 3363.831817] Call Trace:^M
            [ 3363.835841]  [<ffffffff816a94e9>] schedule+0x29/0x70^M
            [ 3363.842720]  [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0^M
            [ 3363.850481]  [<ffffffff810c76f5>] ? sched_clock_cpu+0x85/0xc0^M
            [ 3363.858146]  [<ffffffff810c12c8>] ? check_preempt_curr+0x78/0xa0^M
            [ 3363.866025]  [<ffffffff810c1309>] ? ttwu_do_wakeup+0x19/0xd0^M
            [ 3363.873501]  [<ffffffff816a989d>] wait_for_completion+0xfd/0x140^M
            [ 3363.881287]  [<ffffffff810c4810>] ? wake_up_state+0x20/0x20^M
            [ 3363.888596]  [<ffffffffc09de6b4>] llog_process_or_fork+0x244/0x450 [obdclass]^M
            [ 3363.897603]  [<ffffffffc09de8d4>] llog_process+0x14/0x20 [obdclass]^M
            [ 3363.905633]  [<ffffffffc0a0ed65>] class_config_parse_llog+0x125/0x350 [obdclass]^M
            [ 3363.914874]  [<ffffffffc091c368>] mgc_process_cfg_log+0x788/0xc40 [mgc]^M
            [ 3363.923229]  [<ffffffffc091f1e3>] mgc_process_log+0x3d3/0x890 [mgc]^M
            [ 3363.931153]  [<ffffffffc0a17480>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]^M
            [ 3363.940736]  [<ffffffffc091f8e8>] ? do_config_log_add+0x248/0x580 [mgc]^M
            [ 3363.949020]  [<ffffffffc09207c0>] mgc_process_config+0x890/0x13f0 [mgc]^M
            [ 3363.957326]  [<ffffffffc0a1ad7e>] lustre_process_log+0x2de/0xaf0 [obdclass]^M
            [ 3363.965969]  [<ffffffff816aba0e>] ? _raw_spin_unlock_bh+0x1e/0x20^M
            [ 3363.973647]  [<ffffffff8132318b>] ? fprop_local_init_percpu+0x1b/0x30^M
            [ 3363.981681]  [<ffffffffc0e20608>] ll_fill_super+0xaf8/0x1220 [lustre]^M
            [ 3363.989721]  [<ffffffffc0a20ce6>] lustre_fill_super+0x286/0x910 [obdclass]^M
            [ 3363.998214]  [<ffffffffc0a20a60>] ? lustre_common_put_super+0x270/0x270 [obdclass]^M
            [ 3364.007482]  [<ffffffff81204b0d>] mount_nodev+0x4d/0xb0^M
            [ 3364.014099]  [<ffffffffc0a18ba8>] lustre_mount+0x38/0x60 [obdclass]^M
            [ 3364.021874]  [<ffffffff81205599>] mount_fs+0x39/0x1b0^M
            [ 3364.028291]  [<ffffffff81222067>] vfs_kern_mount+0x67/0x110^M
            [ 3364.035288]  [<ffffffff81224573>] do_mount+0x233/0xaf0^M
            [ 3364.041786]  [<ffffffff8118760e>] ? __get_free_pages+0xe/0x40^M
            [ 3364.048971]  [<ffffffff812251b6>] SyS_mount+0x96/0xf0^M
            [ 3364.055364]  [<ffffffff816b5009>] system_call_fastpath+0x16/0x1b^M
            

            Re-trying clients after power-cycling does work

            cliffw Cliff White (Inactive) added a comment - Trying https://review.whamcloud.com/29341 on soak - servers mount okay, some clients are having timeouts soak-20 login: [ 3363.772469] INFO: task mount.lustre:2191 blocked for more than 120 seconds.^M [ 3363.782165] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.^M [ 3363.792785] mount.lustre D 0000000000000001 0 2191 2190 0x00000080^M [ 3363.802418] ffff88103fad3940 0000000000000082 ffff881013fedee0 ffff88103fad3fd8^M [ 3363.812340] ffff88103fad3fd8 ffff88103fad3fd8 ffff881013fedee0 ffff881013b87270^M [ 3363.822136] 7fffffffffffffff ffff881013b87268 ffff881013fedee0 0000000000000001^M [ 3363.831817] Call Trace:^M [ 3363.835841] [<ffffffff816a94e9>] schedule+0x29/0x70^M [ 3363.842720] [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0^M [ 3363.850481] [<ffffffff810c76f5>] ? sched_clock_cpu+0x85/0xc0^M [ 3363.858146] [<ffffffff810c12c8>] ? check_preempt_curr+0x78/0xa0^M [ 3363.866025] [<ffffffff810c1309>] ? ttwu_do_wakeup+0x19/0xd0^M [ 3363.873501] [<ffffffff816a989d>] wait_for_completion+0xfd/0x140^M [ 3363.881287] [<ffffffff810c4810>] ? wake_up_state+0x20/0x20^M [ 3363.888596] [<ffffffffc09de6b4>] llog_process_or_fork+0x244/0x450 [obdclass]^M [ 3363.897603] [<ffffffffc09de8d4>] llog_process+0x14/0x20 [obdclass]^M [ 3363.905633] [<ffffffffc0a0ed65>] class_config_parse_llog+0x125/0x350 [obdclass]^M [ 3363.914874] [<ffffffffc091c368>] mgc_process_cfg_log+0x788/0xc40 [mgc]^M [ 3363.923229] [<ffffffffc091f1e3>] mgc_process_log+0x3d3/0x890 [mgc]^M [ 3363.931153] [<ffffffffc0a17480>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]^M [ 3363.940736] [<ffffffffc091f8e8>] ? do_config_log_add+0x248/0x580 [mgc]^M [ 3363.949020] [<ffffffffc09207c0>] mgc_process_config+0x890/0x13f0 [mgc]^M [ 3363.957326] [<ffffffffc0a1ad7e>] lustre_process_log+0x2de/0xaf0 [obdclass]^M [ 3363.965969] [<ffffffff816aba0e>] ? _raw_spin_unlock_bh+0x1e/0x20^M [ 3363.973647] [<ffffffff8132318b>] ? fprop_local_init_percpu+0x1b/0x30^M [ 3363.981681] [<ffffffffc0e20608>] ll_fill_super+0xaf8/0x1220 [lustre]^M [ 3363.989721] [<ffffffffc0a20ce6>] lustre_fill_super+0x286/0x910 [obdclass]^M [ 3363.998214] [<ffffffffc0a20a60>] ? lustre_common_put_super+0x270/0x270 [obdclass]^M [ 3364.007482] [<ffffffff81204b0d>] mount_nodev+0x4d/0xb0^M [ 3364.014099] [<ffffffffc0a18ba8>] lustre_mount+0x38/0x60 [obdclass]^M [ 3364.021874] [<ffffffff81205599>] mount_fs+0x39/0x1b0^M [ 3364.028291] [<ffffffff81222067>] vfs_kern_mount+0x67/0x110^M [ 3364.035288] [<ffffffff81224573>] do_mount+0x233/0xaf0^M [ 3364.041786] [<ffffffff8118760e>] ? __get_free_pages+0xe/0x40^M [ 3364.048971] [<ffffffff812251b6>] SyS_mount+0x96/0xf0^M [ 3364.055364] [<ffffffff816b5009>] system_call_fastpath+0x16/0x1b^M Re-trying clients after power-cycling does work

            I tried the patch for LU-9983 to see it would address this but it didn't

            simmonsja James A Simmons added a comment - I tried the patch for LU-9983 to see it would address this but it didn't
            jhammond John Hammond added a comment -

            Fails on OPA as well.

            jhammond John Hammond added a comment - Fails on OPA as well.

            so on MLX5 I'm able to mount, but I'm getting the following error:

            [ 1157.516112] LNetError: 34447:0:(o2iblnd.c:1940:kiblnd_fmr_pool_map()) Failed to map mr 10/11 elements
            [ 1157.516122] LNetError: 34528:0:(o2iblnd_cb.c:560:kiblnd_fmr_map_tx()) Can't map 41033 pages: -22
            

            That's again due to the fastreg code.

            ashehata Amir Shehata (Inactive) added a comment - so on MLX5 I'm able to mount, but I'm getting the following error: [ 1157.516112] LNetError: 34447:0:(o2iblnd.c:1940:kiblnd_fmr_pool_map()) Failed to map mr 10/11 elements [ 1157.516122] LNetError: 34528:0:(o2iblnd_cb.c:560:kiblnd_fmr_map_tx()) Can't map 41033 pages: -22 That's again due to the fastreg code.

            People

              ashehata Amir Shehata (Inactive)
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: