Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10068

OST fails to mount:LustreError: 14558:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.11.0
    • Soak cluster - latest lustre-master build (3650) version=2.10.53_32_g20ffe21
    • 3
    • 9223372036854775807

    Description

      Reformatted, created complete new filesystem.
      MDS/MGS mounts.
      OSS mount fails
      OSS errors

      [  584.585335] Lustre: 13317:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1507069853/real 1507069853]  req@ffff88083f758300 x1580277278179360/t0(0) o253->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 4768/4768 e 0 to 1 dl 1507069860 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1
      [  584.680525] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail
      [  584.727353] LustreError: 15f-b: soaked-OST0000: cannot register this server with the MGS: rc = -5. Is the MGS running?
      [  584.755152] Lustre: MGC192.168.1.108@o2ib: Connection restored to MGC192.168.1.108@o2ib_0 (at 192.168.1.108@o2ib)
      [  584.796681] LustreError: 13317:0:(obd_mount_server.c:1863:server_fill_super()) Unable to start targets: -5
      [  584.828634] LustreError: 13317:0:(obd_mount_server.c:1573:server_put_super()) no obd soaked-OST0000
      [  584.858501] LustreError: 13317:0:(obd_mount_server.c:132:server_deregister_mount()) soaked-OST0000 not registered
      [  585.118058] Lustre: server umount soaked-OST0000 complete
      [  585.135868] LustreError: 13317:0:(obd_mount.c:1504:lustre_fill_super()) Unable to mount  (-5)
      

      Errors on MDS

      [17524.112385] Lustre: soaked-MDT0000: new disk, initializing
      [17524.138680] Lustre: soaked-MDT0000: Imperative Recovery not enabled, recovery window 300-900
      [17524.153739] Lustre: ctl-soaked-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400]:0:mdt
      [17528.212947] Lustre: MGS: Connection restored to aa8414d2-f089-bf2e-b8c3-406370f048cc (at 192.168.1.102@o2ib)
      [17528.273221] LustreError: 12811:0:(events.c:304:request_in_callback()) event type 2, status -103, service mgs
      [17528.287356] LustreError: 14558:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check
      [17528.305726] LustreError: 14558:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-192.168.1.102@o2ib x1580277278179360
      [17528.418152] Lustre: MGS: Received new LWP connection from 192.168.1.102@o2ib, removing former export from same NID
      

      Not sure where to go from here.

      Attachments

        Issue Links

          Activity

            [LU-10068] OST fails to mount:LustreError: 14558:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check

            Based on the above patch descriptions, my first guess would be LU-7001.

            Alternately, if this is only being seen on IB then LU-9810 seems likely.

            adilger Andreas Dilger added a comment - Based on the above patch descriptions, my first guess would be LU-7001 . Alternately, if this is only being seen on IB then LU-9810 seems likely.

            Having some problems also with lustre-review builds, so may also be soak, will re-load 3637 tomorrow to confirm.

            cliffw Cliff White (Inactive) added a comment - Having some problems also with lustre-review builds, so may also be soak, will re-load 3637 tomorrow to confirm.

            Between builds 3637 and 3638, the following patches landed:
            LU-9907 build: add patchless server for lbuild — oleg.drokin / gitweb
            LU-7988 hsm: update many cookie status at once — oleg.drokin / gitweb
            LU-7001 osp: fix llog processing — oleg.drokin / gitweb
            LU-9810 lnet: fix build with M-OFED 4.1 — oleg.drokin / gitweb
            LU-9810 lnet: prefer Fast Reg — oleg.drokin / gitweb
            LU-9260 test: Use the correct mount device when test against lustre — oleg.drokin / gitweb
            LU-9941 lov: lsm_is_composite isn't right — oleg.drokin / gitweb
            LU-8342 utils: Set dnodesize/recordsize at zfs dataset create — oleg.drokin / gitweb
            LU-9044 test: remove conf-sanity tests from ALWAYS_EXCEPT — oleg.drokin / gitweb
            LU-9870 build: handle SNMP missing on build box — oleg.drokin / gitweb
            LU-9891 tests: Increase space not released for ZFS — oleg.drokin / gitweb
            LU-7746 tests: skip tests for older (upstream) client — oleg.drokin / gitweb
            LU-9921 lnet: resolve unsafe list access — oleg.drokin / gitweb
            LU-9950 build: add support for Ubuntu(debian) arm64 — oleg.drokin / gitweb

            jamesanunez James Nunez (Inactive) added a comment - Between builds 3637 and 3638, the following patches landed: LU-9907 build: add patchless server for lbuild — oleg.drokin / gitweb LU-7988 hsm: update many cookie status at once — oleg.drokin / gitweb LU-7001 osp: fix llog processing — oleg.drokin / gitweb LU-9810 lnet: fix build with M-OFED 4.1 — oleg.drokin / gitweb LU-9810 lnet: prefer Fast Reg — oleg.drokin / gitweb LU-9260 test: Use the correct mount device when test against lustre — oleg.drokin / gitweb LU-9941 lov: lsm_is_composite isn't right — oleg.drokin / gitweb LU-8342 utils: Set dnodesize/recordsize at zfs dataset create — oleg.drokin / gitweb LU-9044 test: remove conf-sanity tests from ALWAYS_EXCEPT — oleg.drokin / gitweb LU-9870 build: handle SNMP missing on build box — oleg.drokin / gitweb LU-9891 tests: Increase space not released for ZFS — oleg.drokin / gitweb LU-7746 tests: skip tests for older (upstream) client — oleg.drokin / gitweb LU-9921 lnet: resolve unsafe list access — oleg.drokin / gitweb LU-9950 build: add support for Ubuntu(debian) arm64 — oleg.drokin / gitweb

            What patches were landed between these two builds?

            adilger Andreas Dilger added a comment - What patches were landed between these two builds?

            Tested build 3637, worked fine.
            Build 3638 fails immedately

            [  230.588325] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail
            [  230.635151] LustreError: 15c-8: MGC192.168.1.108@o2ib: The configuration from log 'soaked-MDT0000' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
            [  230.638598] Lustre: MGS: Received new LWP connection from 192.168.1.108@o2ib, removing former export from same NID
            [  230.638625] Lustre: MGS: Connection restored to 8c9d0ffb-0dac-9c76-1262-80c392d2d0ce (at 192.168.1.108@o2ib)
            [  230.638628] Lustre: Skipped 2 previous similar messages
            [  230.796182] LustreError: 2039:0:(obd_mount_server.c:1370:server_start_targets()) failed to start server soaked-MDT0000: -5
            [  230.873594] LustreError: 2039:0:(obd_mount_server.c:1863:server_fill_super()) Unable to start targets: -5
            [  230.905264] LustreError: 2039:0:(obd_mount_server.c:1573:server_put_super()) no obd soaked-MDT0000
            
            cliffw Cliff White (Inactive) added a comment - Tested build 3637, worked fine. Build 3638 fails immedately [ 230.588325] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail [ 230.635151] LustreError: 15c-8: MGC192.168.1.108@o2ib: The configuration from log 'soaked-MDT0000' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. [ 230.638598] Lustre: MGS: Received new LWP connection from 192.168.1.108@o2ib, removing former export from same NID [ 230.638625] Lustre: MGS: Connection restored to 8c9d0ffb-0dac-9c76-1262-80c392d2d0ce (at 192.168.1.108@o2ib) [ 230.638628] Lustre: Skipped 2 previous similar messages [ 230.796182] LustreError: 2039:0:(obd_mount_server.c:1370:server_start_targets()) failed to start server soaked-MDT0000: -5 [ 230.873594] LustreError: 2039:0:(obd_mount_server.c:1863:server_fill_super()) Unable to start targets: -5 [ 230.905264] LustreError: 2039:0:(obd_mount_server.c:1573:server_put_super()) no obd soaked-MDT0000

            Did a debug=-1 dump of soak-8 (the MDS) while the mount was hanging. Attached.

            cliffw Cliff White (Inactive) added a comment - Did a debug=-1 dump of soak-8 (the MDS) while the mount was hanging. Attached.

            Reverted back to build 3649, and it's even worse! The initial MGT/MDT mount hangs, big time.

            Oct  4 18:40:43 soak-8 kernel: INFO: task mount.lustre:2489 blocked for more than 120 seconds.
            Oct  4 18:40:43 soak-8 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            Oct  4 18:40:43 soak-8 kernel: mount.lustre    D 0000000000000001     0  2489   2488 0x00000082
            Oct  4 18:40:44 soak-8 kernel: ffff8804102d77b0 0000000000000082 ffff88041140dee0 ffff8804102d7fd8
            Oct  4 18:40:44 soak-8 kernel: ffff8804102d7fd8 ffff8804102d7fd8 ffff88041140dee0 ffff88082c4c4810
            Oct  4 18:40:44 soak-8 kernel: 7fffffffffffffff ffff88082c4c4808 ffff88041140dee0 0000000000000001
            Oct  4 18:40:44 soak-8 kernel: Call Trace:
            Oct  4 18:40:44 soak-8 kernel: [<ffffffff816a94a9>] schedule+0x29/0x70
            Oct  4 18:40:44 soak-8 kernel: [<ffffffff816a6fb9>] schedule_timeout+0x239/0x2c0
            Oct  4 18:40:44 soak-8 kernel: [<ffffffff81050b5c>] ? native_smp_send_reschedule+0x4c/0x70
            Oct  4 18:40:44 soak-8 kernel: [<ffffffff810c0548>] ? resched_curr+0xa8/0xc0
            Oct  4 18:40:44 soak-8 kernel: [<ffffffff810c12c8>] ? check_preempt_curr+0x78/0xa0
            Oct  4 18:40:44 soak-8 kernel: [<ffffffff810c1309>] ? ttwu_do_wakeup+0x19/0xd0
            Oct  4 18:40:44 soak-8 kernel: [<ffffffff816a985d>] wait_for_completion+0xfd/0x140
            Oct  4 18:40:44 soak-8 kernel: [<ffffffff810c4810>] ? wake_up_state+0x20/0x20
            Oct  4 18:40:44 soak-8 kernel: [<ffffffffc0a496b4>] llog_process_or_fork+0x244/0x450 [obdclass]
            Oct  4 18:40:44 soak-8 kernel: [<ffffffffc0a498d4>] llog_process+0x14/0x20 [obdclass]
            Oct  4 18:40:44 soak-8 kernel: [<ffffffffc0a79e95>] class_config_parse_llog+0x125/0x350 [obdclass]
            Oct  4 18:40:44 soak-8 kernel: [<ffffffffc14465a8>] mgc_process_cfg_log+0x788/0xc40 [mgc]
            Oct  4 18:40:44 soak-8 kernel: [<ffffffffc1449413>] mgc_process_log+0x3d3/0x890 [mgc]
            Oct  4 18:40:44 soak-8 kernel: [<ffffffffc0a826b0>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
            Oct  4 18:40:44 soak-8 kernel: [<ffffffffc1449b18>] ? do_config_log_add+0x248/0x580 [mgc]
            Oct  4 18:40:44 soak-8 kernel: [<ffffffffc144a9f0>] mgc_process_config+0x890/0x13f0 [mgc]
            Oct  4 18:40:44 soak-8 kernel: [<ffffffffc0a861ea>] lustre_process_log+0x2da/0xae0 [obdclass]
            Oct  4 18:40:44 soak-8 kernel: [<ffffffffc0927ba7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
            Oct  4 18:40:44 soak-8 kernel: [<ffffffffc0a713e9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
            Oct  4 18:40:44 soak-8 kernel: [<ffffffffc0ab2da2>] server_start_targets+0x1352/0x2a70 [obdclass]
            Oct  4 18:40:44 soak-8 kernel: [<ffffffffc0cd0b48>] ? ptlrpc_pinger_wake_up+0x28/0x30 [ptlrpc]
            Oct  4 18:40:44 soak-8 kernel: [<ffffffffc0a71511>] ? lprocfs_counter_sub+0xc1/0x130 [obdclass]
            Oct  4 18:40:44 soak-8 kernel: [<ffffffffc0a826b0>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
            Oct  4 18:40:44 soak-8 kernel: [<ffffffffc0ab554d>] server_fill_super+0x108d/0x184c [obdclass]
            Oct  4 18:40:44 soak-8 kernel: [<ffffffffc0a8c168>] lustre_fill_super+0x328/0x950 [obdclass]
            Oct  4 18:40:44 soak-8 kernel: [<ffffffffc0a8be40>] ? lustre_common_put_super+0x270/0x270 [obdclass]
            Oct  4 18:40:44 soak-8 kernel: [<ffffffff81204afd>] mount_nodev+0x4d/0xb0
            Oct  4 18:40:44 soak-8 kernel: [<ffffffffc0a83f78>] lustre_mount+0x38/0x60 [obdclass]
            Oct  4 18:40:44 soak-8 kernel: [<ffffffff81205589>] mount_fs+0x39/0x1b0
            Oct  4 18:40:44 soak-8 kernel: [<ffffffff81222067>] vfs_kern_mount+0x67/0x110
            Oct  4 18:40:44 soak-8 kernel: [<ffffffff81224573>] do_mount+0x233/0xaf0
            Oct  4 18:40:44 soak-8 kernel: [<ffffffff8118760e>] ? __get_free_pages+0xe/0x40
            Oct  4 18:40:44 soak-8 kernel: [<ffffffff812251b6>] SyS_mount+0x96/0xf0
            Oct  4 18:40:44 soak-8 kernel: [<ffffffff816b4fc9>] system_call_fastpath+0x16/0x1b
            Oct  4 18:42:44 soak-8 kernel: INFO: task mount.lustre:2489 blocked for more than 120 seconds.
            
            cliffw Cliff White (Inactive) added a comment - Reverted back to build 3649, and it's even worse! The initial MGT/MDT mount hangs, big time. Oct 4 18:40:43 soak-8 kernel: INFO: task mount.lustre:2489 blocked for more than 120 seconds. Oct 4 18:40:43 soak-8 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 4 18:40:43 soak-8 kernel: mount.lustre D 0000000000000001 0 2489 2488 0x00000082 Oct 4 18:40:44 soak-8 kernel: ffff8804102d77b0 0000000000000082 ffff88041140dee0 ffff8804102d7fd8 Oct 4 18:40:44 soak-8 kernel: ffff8804102d7fd8 ffff8804102d7fd8 ffff88041140dee0 ffff88082c4c4810 Oct 4 18:40:44 soak-8 kernel: 7fffffffffffffff ffff88082c4c4808 ffff88041140dee0 0000000000000001 Oct 4 18:40:44 soak-8 kernel: Call Trace: Oct 4 18:40:44 soak-8 kernel: [<ffffffff816a94a9>] schedule+0x29/0x70 Oct 4 18:40:44 soak-8 kernel: [<ffffffff816a6fb9>] schedule_timeout+0x239/0x2c0 Oct 4 18:40:44 soak-8 kernel: [<ffffffff81050b5c>] ? native_smp_send_reschedule+0x4c/0x70 Oct 4 18:40:44 soak-8 kernel: [<ffffffff810c0548>] ? resched_curr+0xa8/0xc0 Oct 4 18:40:44 soak-8 kernel: [<ffffffff810c12c8>] ? check_preempt_curr+0x78/0xa0 Oct 4 18:40:44 soak-8 kernel: [<ffffffff810c1309>] ? ttwu_do_wakeup+0x19/0xd0 Oct 4 18:40:44 soak-8 kernel: [<ffffffff816a985d>] wait_for_completion+0xfd/0x140 Oct 4 18:40:44 soak-8 kernel: [<ffffffff810c4810>] ? wake_up_state+0x20/0x20 Oct 4 18:40:44 soak-8 kernel: [<ffffffffc0a496b4>] llog_process_or_fork+0x244/0x450 [obdclass] Oct 4 18:40:44 soak-8 kernel: [<ffffffffc0a498d4>] llog_process+0x14/0x20 [obdclass] Oct 4 18:40:44 soak-8 kernel: [<ffffffffc0a79e95>] class_config_parse_llog+0x125/0x350 [obdclass] Oct 4 18:40:44 soak-8 kernel: [<ffffffffc14465a8>] mgc_process_cfg_log+0x788/0xc40 [mgc] Oct 4 18:40:44 soak-8 kernel: [<ffffffffc1449413>] mgc_process_log+0x3d3/0x890 [mgc] Oct 4 18:40:44 soak-8 kernel: [<ffffffffc0a826b0>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] Oct 4 18:40:44 soak-8 kernel: [<ffffffffc1449b18>] ? do_config_log_add+0x248/0x580 [mgc] Oct 4 18:40:44 soak-8 kernel: [<ffffffffc144a9f0>] mgc_process_config+0x890/0x13f0 [mgc] Oct 4 18:40:44 soak-8 kernel: [<ffffffffc0a861ea>] lustre_process_log+0x2da/0xae0 [obdclass] Oct 4 18:40:44 soak-8 kernel: [<ffffffffc0927ba7>] ? libcfs_debug_msg+0x57/0x80 [libcfs] Oct 4 18:40:44 soak-8 kernel: [<ffffffffc0a713e9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] Oct 4 18:40:44 soak-8 kernel: [<ffffffffc0ab2da2>] server_start_targets+0x1352/0x2a70 [obdclass] Oct 4 18:40:44 soak-8 kernel: [<ffffffffc0cd0b48>] ? ptlrpc_pinger_wake_up+0x28/0x30 [ptlrpc] Oct 4 18:40:44 soak-8 kernel: [<ffffffffc0a71511>] ? lprocfs_counter_sub+0xc1/0x130 [obdclass] Oct 4 18:40:44 soak-8 kernel: [<ffffffffc0a826b0>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] Oct 4 18:40:44 soak-8 kernel: [<ffffffffc0ab554d>] server_fill_super+0x108d/0x184c [obdclass] Oct 4 18:40:44 soak-8 kernel: [<ffffffffc0a8c168>] lustre_fill_super+0x328/0x950 [obdclass] Oct 4 18:40:44 soak-8 kernel: [<ffffffffc0a8be40>] ? lustre_common_put_super+0x270/0x270 [obdclass] Oct 4 18:40:44 soak-8 kernel: [<ffffffff81204afd>] mount_nodev+0x4d/0xb0 Oct 4 18:40:44 soak-8 kernel: [<ffffffffc0a83f78>] lustre_mount+0x38/0x60 [obdclass] Oct 4 18:40:44 soak-8 kernel: [<ffffffff81205589>] mount_fs+0x39/0x1b0 Oct 4 18:40:44 soak-8 kernel: [<ffffffff81222067>] vfs_kern_mount+0x67/0x110 Oct 4 18:40:44 soak-8 kernel: [<ffffffff81224573>] do_mount+0x233/0xaf0 Oct 4 18:40:44 soak-8 kernel: [<ffffffff8118760e>] ? __get_free_pages+0xe/0x40 Oct 4 18:40:44 soak-8 kernel: [<ffffffff812251b6>] SyS_mount+0x96/0xf0 Oct 4 18:40:44 soak-8 kernel: [<ffffffff816b4fc9>] system_call_fastpath+0x16/0x1b Oct 4 18:42:44 soak-8 kernel: INFO: task mount.lustre:2489 blocked for more than 120 seconds.

            The last master build that ran successfully on soak was build #3637.

            jamesanunez James Nunez (Inactive) added a comment - The last master build that ran successfully on soak was build #3637.

            However, now all client mount attempts fail:

            [ 1891.859459] Lustre: 9069:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1507133982/real 1507133982]  req@ffff880813ea0300 x1580342921134176/t0(0) o503->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 272/8416 e 0 to 1 dl 1507133989 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
            [ 1891.952822] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail
            [ 1891.999529] LustreError: 15c-8: MGC192.168.1.108@o2ib: The configuration from log 'soaked-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
            [ 1892.002601] Lustre: MGC192.168.1.108@o2ib: Connection restored to MGC192.168.1.108@o2ib_0 (at 192.168.1.108@o2ib)
            [ 1892.110166] Lustre: Unmounted soaked-client
            
            cliffw Cliff White (Inactive) added a comment - However, now all client mount attempts fail: [ 1891.859459] Lustre: 9069:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1507133982/real 1507133982] req@ffff880813ea0300 x1580342921134176/t0(0) o503->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 272/8416 e 0 to 1 dl 1507133989 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 [ 1891.952822] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail [ 1891.999529] LustreError: 15c-8: MGC192.168.1.108@o2ib: The configuration from log 'soaked-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. [ 1892.002601] Lustre: MGC192.168.1.108@o2ib: Connection restored to MGC192.168.1.108@o2ib_0 (at 192.168.1.108@o2ib) [ 1892.110166] Lustre: Unmounted soaked-client

            In addition, the OST mounts are very, very slow

            cliffw Cliff White (Inactive) added a comment - In addition, the OST mounts are very, very slow

            I re-loaded b2.10 GA - filesystem formatted and mounted with not issues.
            I then loaded lustre-master build 3650, and attempted to mount the previously formatted filesystem. This filesystem was formatted and mounted once, no work other than mount/umount performed. The OST mounts do complete, but there are many errors

            Oct  4 16:01:24 soak-2 kernel: Lustre: 24621:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1507132883/real 1507132883]  req@ffff8800b2960300 x1580342920085536/t0(0) o253->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 4768/4768 e 0 to 1 dl 1507132890 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1
            Oct  4 16:01:24 soak-2 kernel: LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail
            Oct  4 16:01:24 soak-2 kernel: Lustre: MGC192.168.1.108@o2ib: Connection restored to MGC192.168.1.108@o2ib_0 (at 192.168.1.108@o2ib)
            Oct  4 16:01:24 soak-2 kernel: LustreError: 23440:0:(events.c:199:client_bulk_callback()) event type 2, status -103, desc ffff88082acd7e00
            Oct  4 16:01:24 soak-2 kernel: Lustre: 24621:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1507132884/real 1507132884]  req@ffff8803ff218000 x1580342920085632/t0(0) o256->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 304/240 e 0 to 1 dl 1507132891 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1
            Oct  4 16:01:24 soak-2 kernel: LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail
            Oct  4 16:01:25 soak-2 kernel: LustreError: 24621:0:(mgc_request.c:251:do_config_log_add()) MGC192.168.1.108@o2ib: failed processing log, type 4: rc = -5
            Oct  4 16:01:25 soak-2 kernel: Lustre: MGC192.168.1.108@o2ib: Connection restored to MGC192.168.1.108@o2ib_0 (at 192.168.1.108@o2ib)
            Oct  4 16:01:32 soak-2 kernel: Lustre: 24621:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1507132885/real 1507132885]  req@ffff8803ff218000 x1580342920085696/t0(0) o503->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 272/8416 e 0 to 1 dl 1507132892 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
            Oct  4 16:01:32 soak-2 kernel: LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail
            Oct  4 16:01:33 soak-2 kernel: LustreError: 24621:0:(mgc_request.c:1885:mgc_llog_local_copy()) MGC192.168.1.108@o2ib: failed to copy remote log soaked-OST0000: rc = -5
            Oct  4 16:01:33 soak-2 kernel: LustreError: 24937:0:(ldlm_resource.c:1100:ldlm_resource_complain()) MGC192.168.1.108@o2ib: namespace resource [0x64656b616f73:0x0:0x0].0x0 (ffff8804184e1200) refcount nonzero (1) after lock cleanup; forcing cleanup.
            Oct  4 16:01:33 soak-2 kernel: LustreError: 24937:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x64656b616f73:0x0:0x0].0x0 (ffff8804184e1200) refcount = 2
            Oct  4 16:01:33 soak-2 kernel: LustreError: 24937:0:(ldlm_resource.c:1685:ldlm_resource_dump()) Granted locks (in reverse order):
            Oct  4 16:01:33 soak-2 kernel: LustreError: 24937:0:(ldlm_resource.c:1688:ldlm_resource_dump()) ### ### ns: MGC192.168.1.108@o2ib lock: ffff88040eb20400/0xc3f9d16c2df9943e lrc: 2/1,0 mode: CR/CR res: [0x64656b616f73:0x0:0x0].0x0 rrc: 3 type: PLN flags: 0x1106400000000 nid: local remote: 0xc8dccd9f1ede04d5 expref: -99 pid: 24621 timeout: 0 lvb_type: 0
            Oct  4 16:01:33 soak-2 kernel: Lustre: MGC192.168.1.108@o2ib: Connection restored to MGC192.168.1.108@o2ib_0 (at 192.168.1.108@o2ib)
            Oct  4 16:01:34 soak-2 kernel: LustreError: 23440:0:(events.c:199:client_bulk_callback()) event type 2, status -103, desc ffff88082cc40800
            Oct  4 16:01:34 soak-2 kernel: LustreError: 24665:0:(mgc_request.c:603:do_requeue()) failed processing log: -5
            Oct  4 16:01:45 soak-2 kernel: LustreError: 137-5: soaked-OST0001_UUID: not available for connect from 192.168.1.108@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
            
            cliffw Cliff White (Inactive) added a comment - I re-loaded b2.10 GA - filesystem formatted and mounted with not issues. I then loaded lustre-master build 3650, and attempted to mount the previously formatted filesystem. This filesystem was formatted and mounted once, no work other than mount/umount performed. The OST mounts do complete, but there are many errors Oct 4 16:01:24 soak-2 kernel: Lustre: 24621:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1507132883/real 1507132883] req@ffff8800b2960300 x1580342920085536/t0(0) o253->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 4768/4768 e 0 to 1 dl 1507132890 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1 Oct 4 16:01:24 soak-2 kernel: LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail Oct 4 16:01:24 soak-2 kernel: Lustre: MGC192.168.1.108@o2ib: Connection restored to MGC192.168.1.108@o2ib_0 (at 192.168.1.108@o2ib) Oct 4 16:01:24 soak-2 kernel: LustreError: 23440:0:(events.c:199:client_bulk_callback()) event type 2, status -103, desc ffff88082acd7e00 Oct 4 16:01:24 soak-2 kernel: Lustre: 24621:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1507132884/real 1507132884] req@ffff8803ff218000 x1580342920085632/t0(0) o256->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 304/240 e 0 to 1 dl 1507132891 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1 Oct 4 16:01:24 soak-2 kernel: LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail Oct 4 16:01:25 soak-2 kernel: LustreError: 24621:0:(mgc_request.c:251:do_config_log_add()) MGC192.168.1.108@o2ib: failed processing log, type 4: rc = -5 Oct 4 16:01:25 soak-2 kernel: Lustre: MGC192.168.1.108@o2ib: Connection restored to MGC192.168.1.108@o2ib_0 (at 192.168.1.108@o2ib) Oct 4 16:01:32 soak-2 kernel: Lustre: 24621:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1507132885/real 1507132885] req@ffff8803ff218000 x1580342920085696/t0(0) o503->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 272/8416 e 0 to 1 dl 1507132892 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Oct 4 16:01:32 soak-2 kernel: LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail Oct 4 16:01:33 soak-2 kernel: LustreError: 24621:0:(mgc_request.c:1885:mgc_llog_local_copy()) MGC192.168.1.108@o2ib: failed to copy remote log soaked-OST0000: rc = -5 Oct 4 16:01:33 soak-2 kernel: LustreError: 24937:0:(ldlm_resource.c:1100:ldlm_resource_complain()) MGC192.168.1.108@o2ib: namespace resource [0x64656b616f73:0x0:0x0].0x0 (ffff8804184e1200) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 4 16:01:33 soak-2 kernel: LustreError: 24937:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x64656b616f73:0x0:0x0].0x0 (ffff8804184e1200) refcount = 2 Oct 4 16:01:33 soak-2 kernel: LustreError: 24937:0:(ldlm_resource.c:1685:ldlm_resource_dump()) Granted locks (in reverse order): Oct 4 16:01:33 soak-2 kernel: LustreError: 24937:0:(ldlm_resource.c:1688:ldlm_resource_dump()) ### ### ns: MGC192.168.1.108@o2ib lock: ffff88040eb20400/0xc3f9d16c2df9943e lrc: 2/1,0 mode: CR/CR res: [0x64656b616f73:0x0:0x0].0x0 rrc: 3 type: PLN flags: 0x1106400000000 nid: local remote: 0xc8dccd9f1ede04d5 expref: -99 pid: 24621 timeout: 0 lvb_type: 0 Oct 4 16:01:33 soak-2 kernel: Lustre: MGC192.168.1.108@o2ib: Connection restored to MGC192.168.1.108@o2ib_0 (at 192.168.1.108@o2ib) Oct 4 16:01:34 soak-2 kernel: LustreError: 23440:0:(events.c:199:client_bulk_callback()) event type 2, status -103, desc ffff88082cc40800 Oct 4 16:01:34 soak-2 kernel: LustreError: 24665:0:(mgc_request.c:603:do_requeue()) failed processing log: -5 Oct 4 16:01:45 soak-2 kernel: LustreError: 137-5: soaked-OST0001_UUID: not available for connect from 192.168.1.108@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.

            People

              ashehata Amir Shehata (Inactive)
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: