[LU-10068] OST fails to mount:LustreError: 14558:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.11.0
Labels:
- soak
Environment:
Soak cluster - latest lustre-master build (3650) version=2.10.53_32_g20ffe21

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Reformatted, created complete new filesystem.
MDS/MGS mounts.
OSS mount fails
OSS errors

[  584.585335] Lustre: 13317:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1507069853/real 1507069853]  req@ffff88083f758300 x1580277278179360/t0(0) o253->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 4768/4768 e 0 to 1 dl 1507069860 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1
[  584.680525] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail
[  584.727353] LustreError: 15f-b: soaked-OST0000: cannot register this server with the MGS: rc = -5. Is the MGS running?
[  584.755152] Lustre: MGC192.168.1.108@o2ib: Connection restored to MGC192.168.1.108@o2ib_0 (at 192.168.1.108@o2ib)
[  584.796681] LustreError: 13317:0:(obd_mount_server.c:1863:server_fill_super()) Unable to start targets: -5
[  584.828634] LustreError: 13317:0:(obd_mount_server.c:1573:server_put_super()) no obd soaked-OST0000
[  584.858501] LustreError: 13317:0:(obd_mount_server.c:132:server_deregister_mount()) soaked-OST0000 not registered
[  585.118058] Lustre: server umount soaked-OST0000 complete
[  585.135868] LustreError: 13317:0:(obd_mount.c:1504:lustre_fill_super()) Unable to mount  (-5)

Errors on MDS

[17524.112385] Lustre: soaked-MDT0000: new disk, initializing
[17524.138680] Lustre: soaked-MDT0000: Imperative Recovery not enabled, recovery window 300-900
[17524.153739] Lustre: ctl-soaked-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400]:0:mdt
[17528.212947] Lustre: MGS: Connection restored to aa8414d2-f089-bf2e-b8c3-406370f048cc (at 192.168.1.102@o2ib)
[17528.273221] LustreError: 12811:0:(events.c:304:request_in_callback()) event type 2, status -103, service mgs
[17528.287356] LustreError: 14558:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check
[17528.305726] LustreError: 14558:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-192.168.1.102@o2ib x1580277278179360
[17528.418152] Lustre: MGS: Received new LWP connection from 192.168.1.102@o2ib, removing former export from same NID

Not sure where to go from here.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

soak-8.lustre.log.txt
1.40 MB
04/Oct/17 6:54 PM

Issue Links

is related to

LU-9983 LBUG llog_osd.c:327:llog_osd_declare_write_rec() - all DNE MDS

Resolved

mentioned in: Page Loading...; Page Loading...

Activity

[LU-10068] OST fails to mount:LustreError: 14558:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check

Cliff White (Inactive) added a comment - 06/Oct/17 11:42 PM

Tested James patch reverting Fast Reg. https://review.whamcloud.com/29341. Soak is running okay, but router have hit LBUG (LU-10103)

Cliff White (Inactive) added a comment - 06/Oct/17 11:42 PM Tested James patch reverting Fast Reg. https://review.whamcloud.com/29341 . Soak is running okay, but router have hit LBUG ( LU-10103 )

Amir Shehata (Inactive) added a comment - 06/Oct/17 4:43 PM - edited

there appears to be 3 separate issues:
1. Fastreg is not supported on OPA so reverting ~~LU-9810~~ works (as a side note, I also tried John's patch for ~~LU-9983~~ and that resolves the problem we were seeing there).
2. Fastreg broken on MLX-5. I've been debugging this problem yesterday, and we already have a few bugs that are all related. For this particular problem ib_map_mr_sg() is called to map 11 fragments but ends up mapping 10. I'm trying to look at the mlx5 driver code and understand why it stops before mapping all fragments. The fragments look like:

length = 4096, offset = 0, addr = 105c43c000, page offset = 0, page_addr = 105c43c000
length = 4096, offset = 0, addr = 84cfb0000, page offset = 0, page_addr = 84cfb0000
length = 4096, offset = 0, addr = 84cfb1000, page offset = 0, page_addr = 84cfb1000
length = 4096, offset = 0, addr = 84cfb2000, page offset = 0, page_addr = 84cfb2000
length = 4096, offset = 0, addr = 84cfb3000, page offset = 0, page_addr = 84cfb3000
length = 4096, offset = 0, addr = 84cfb4000, page offset = 0, page_addr = 84cfb4000
length = 4096, offset = 0, addr = 84cfb5000, page offset = 0, page_addr = 84cfb5000
length = 4096, offset = 0, addr = 84cfb6000, page offset = 0, page_addr = 84cfb6000
length = 4096, offset = 0, addr = 84cfb7000, page offset = 0, page_addr = 84cfb7000
length = 73, offset = 0, addr = 84cfb8000, page offset = 0, page_addr = 84cfb8000
length = 4096, offset = 0, addr = 105c43d000, page offset = 0, page_addr = 105c43d000

So it looks like it stops on the fragment of length 73.

3. MLX-4 failure. I still need to investigate further, because it could be different than both of the above.

Amir Shehata (Inactive) added a comment - 06/Oct/17 4:43 PM - edited there appears to be 3 separate issues: 1. Fastreg is not supported on OPA so reverting LU-9810 works (as a side note, I also tried John's patch for LU-9983 and that resolves the problem we were seeing there). 2. Fastreg broken on MLX-5. I've been debugging this problem yesterday, and we already have a few bugs that are all related. For this particular problem ib_map_mr_sg() is called to map 11 fragments but ends up mapping 10. I'm trying to look at the mlx5 driver code and understand why it stops before mapping all fragments. The fragments look like: length = 4096, offset = 0, addr = 105c43c000, page offset = 0, page_addr = 105c43c000 length = 4096, offset = 0, addr = 84cfb0000, page offset = 0, page_addr = 84cfb0000 length = 4096, offset = 0, addr = 84cfb1000, page offset = 0, page_addr = 84cfb1000 length = 4096, offset = 0, addr = 84cfb2000, page offset = 0, page_addr = 84cfb2000 length = 4096, offset = 0, addr = 84cfb3000, page offset = 0, page_addr = 84cfb3000 length = 4096, offset = 0, addr = 84cfb4000, page offset = 0, page_addr = 84cfb4000 length = 4096, offset = 0, addr = 84cfb5000, page offset = 0, page_addr = 84cfb5000 length = 4096, offset = 0, addr = 84cfb6000, page offset = 0, page_addr = 84cfb6000 length = 4096, offset = 0, addr = 84cfb7000, page offset = 0, page_addr = 84cfb7000 length = 73, offset = 0, addr = 84cfb8000, page offset = 0, page_addr = 84cfb8000 length = 4096, offset = 0, addr = 105c43d000, page offset = 0, page_addr = 105c43d000 So it looks like it stops on the fragment of length 73. 3. MLX-4 failure. I still need to investigate further, because it could be different than both of the above.

Cliff White (Inactive) added a comment - 06/Oct/17 4:16 PM

Trying https://review.whamcloud.com/29341 on soak - servers mount okay, some clients are having timeouts

soak-20 login: [ 3363.772469] INFO: task mount.lustre:2191 blocked for more than 120 seconds.^M
[ 3363.782165] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.^M
[ 3363.792785] mount.lustre    D 0000000000000001     0  2191   2190 0x00000080^M
[ 3363.802418]  ffff88103fad3940 0000000000000082 ffff881013fedee0 ffff88103fad3fd8^M
[ 3363.812340]  ffff88103fad3fd8 ffff88103fad3fd8 ffff881013fedee0 ffff881013b87270^M
[ 3363.822136]  7fffffffffffffff ffff881013b87268 ffff881013fedee0 0000000000000001^M
[ 3363.831817] Call Trace:^M
[ 3363.835841]  [<ffffffff816a94e9>] schedule+0x29/0x70^M
[ 3363.842720]  [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0^M
[ 3363.850481]  [<ffffffff810c76f5>] ? sched_clock_cpu+0x85/0xc0^M
[ 3363.858146]  [<ffffffff810c12c8>] ? check_preempt_curr+0x78/0xa0^M
[ 3363.866025]  [<ffffffff810c1309>] ? ttwu_do_wakeup+0x19/0xd0^M
[ 3363.873501]  [<ffffffff816a989d>] wait_for_completion+0xfd/0x140^M
[ 3363.881287]  [<ffffffff810c4810>] ? wake_up_state+0x20/0x20^M
[ 3363.888596]  [<ffffffffc09de6b4>] llog_process_or_fork+0x244/0x450 [obdclass]^M
[ 3363.897603]  [<ffffffffc09de8d4>] llog_process+0x14/0x20 [obdclass]^M
[ 3363.905633]  [<ffffffffc0a0ed65>] class_config_parse_llog+0x125/0x350 [obdclass]^M
[ 3363.914874]  [<ffffffffc091c368>] mgc_process_cfg_log+0x788/0xc40 [mgc]^M
[ 3363.923229]  [<ffffffffc091f1e3>] mgc_process_log+0x3d3/0x890 [mgc]^M
[ 3363.931153]  [<ffffffffc0a17480>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]^M
[ 3363.940736]  [<ffffffffc091f8e8>] ? do_config_log_add+0x248/0x580 [mgc]^M
[ 3363.949020]  [<ffffffffc09207c0>] mgc_process_config+0x890/0x13f0 [mgc]^M
[ 3363.957326]  [<ffffffffc0a1ad7e>] lustre_process_log+0x2de/0xaf0 [obdclass]^M
[ 3363.965969]  [<ffffffff816aba0e>] ? _raw_spin_unlock_bh+0x1e/0x20^M
[ 3363.973647]  [<ffffffff8132318b>] ? fprop_local_init_percpu+0x1b/0x30^M
[ 3363.981681]  [<ffffffffc0e20608>] ll_fill_super+0xaf8/0x1220 [lustre]^M
[ 3363.989721]  [<ffffffffc0a20ce6>] lustre_fill_super+0x286/0x910 [obdclass]^M
[ 3363.998214]  [<ffffffffc0a20a60>] ? lustre_common_put_super+0x270/0x270 [obdclass]^M
[ 3364.007482]  [<ffffffff81204b0d>] mount_nodev+0x4d/0xb0^M
[ 3364.014099]  [<ffffffffc0a18ba8>] lustre_mount+0x38/0x60 [obdclass]^M
[ 3364.021874]  [<ffffffff81205599>] mount_fs+0x39/0x1b0^M
[ 3364.028291]  [<ffffffff81222067>] vfs_kern_mount+0x67/0x110^M
[ 3364.035288]  [<ffffffff81224573>] do_mount+0x233/0xaf0^M
[ 3364.041786]  [<ffffffff8118760e>] ? __get_free_pages+0xe/0x40^M
[ 3364.048971]  [<ffffffff812251b6>] SyS_mount+0x96/0xf0^M
[ 3364.055364]  [<ffffffff816b5009>] system_call_fastpath+0x16/0x1b^M

Re-trying clients after power-cycling does work

Cliff White (Inactive) added a comment - 06/Oct/17 4:16 PM Trying https://review.whamcloud.com/29341 on soak - servers mount okay, some clients are having timeouts soak-20 login: [ 3363.772469] INFO: task mount.lustre:2191 blocked for more than 120 seconds.^M [ 3363.782165] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.^M [ 3363.792785] mount.lustre D 0000000000000001 0 2191 2190 0x00000080^M [ 3363.802418] ffff88103fad3940 0000000000000082 ffff881013fedee0 ffff88103fad3fd8^M [ 3363.812340] ffff88103fad3fd8 ffff88103fad3fd8 ffff881013fedee0 ffff881013b87270^M [ 3363.822136] 7fffffffffffffff ffff881013b87268 ffff881013fedee0 0000000000000001^M [ 3363.831817] Call Trace:^M [ 3363.835841] [<ffffffff816a94e9>] schedule+0x29/0x70^M [ 3363.842720] [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0^M [ 3363.850481] [<ffffffff810c76f5>] ? sched_clock_cpu+0x85/0xc0^M [ 3363.858146] [<ffffffff810c12c8>] ? check_preempt_curr+0x78/0xa0^M [ 3363.866025] [<ffffffff810c1309>] ? ttwu_do_wakeup+0x19/0xd0^M [ 3363.873501] [<ffffffff816a989d>] wait_for_completion+0xfd/0x140^M [ 3363.881287] [<ffffffff810c4810>] ? wake_up_state+0x20/0x20^M [ 3363.888596] [<ffffffffc09de6b4>] llog_process_or_fork+0x244/0x450 [obdclass]^M [ 3363.897603] [<ffffffffc09de8d4>] llog_process+0x14/0x20 [obdclass]^M [ 3363.905633] [<ffffffffc0a0ed65>] class_config_parse_llog+0x125/0x350 [obdclass]^M [ 3363.914874] [<ffffffffc091c368>] mgc_process_cfg_log+0x788/0xc40 [mgc]^M [ 3363.923229] [<ffffffffc091f1e3>] mgc_process_log+0x3d3/0x890 [mgc]^M [ 3363.931153] [<ffffffffc0a17480>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]^M [ 3363.940736] [<ffffffffc091f8e8>] ? do_config_log_add+0x248/0x580 [mgc]^M [ 3363.949020] [<ffffffffc09207c0>] mgc_process_config+0x890/0x13f0 [mgc]^M [ 3363.957326] [<ffffffffc0a1ad7e>] lustre_process_log+0x2de/0xaf0 [obdclass]^M [ 3363.965969] [<ffffffff816aba0e>] ? _raw_spin_unlock_bh+0x1e/0x20^M [ 3363.973647] [<ffffffff8132318b>] ? fprop_local_init_percpu+0x1b/0x30^M [ 3363.981681] [<ffffffffc0e20608>] ll_fill_super+0xaf8/0x1220 [lustre]^M [ 3363.989721] [<ffffffffc0a20ce6>] lustre_fill_super+0x286/0x910 [obdclass]^M [ 3363.998214] [<ffffffffc0a20a60>] ? lustre_common_put_super+0x270/0x270 [obdclass]^M [ 3364.007482] [<ffffffff81204b0d>] mount_nodev+0x4d/0xb0^M [ 3364.014099] [<ffffffffc0a18ba8>] lustre_mount+0x38/0x60 [obdclass]^M [ 3364.021874] [<ffffffff81205599>] mount_fs+0x39/0x1b0^M [ 3364.028291] [<ffffffff81222067>] vfs_kern_mount+0x67/0x110^M [ 3364.035288] [<ffffffff81224573>] do_mount+0x233/0xaf0^M [ 3364.041786] [<ffffffff8118760e>] ? __get_free_pages+0xe/0x40^M [ 3364.048971] [<ffffffff812251b6>] SyS_mount+0x96/0xf0^M [ 3364.055364] [<ffffffff816b5009>] system_call_fastpath+0x16/0x1b^M Re-trying clients after power-cycling does work

James A Simmons added a comment - 06/Oct/17 3:37 PM

I tried the patch for ~~LU-9983~~ to see it would address this but it didn't

James A Simmons added a comment - 06/Oct/17 3:37 PM I tried the patch for LU-9983 to see it would address this but it didn't

John Hammond added a comment - 06/Oct/17 2:18 PM

Fails on OPA as well.

John Hammond added a comment - 06/Oct/17 2:18 PM Fails on OPA as well.

Amir Shehata (Inactive) added a comment - 05/Oct/17 11:46 PM

so on MLX5 I'm able to mount, but I'm getting the following error:

[ 1157.516112] LNetError: 34447:0:(o2iblnd.c:1940:kiblnd_fmr_pool_map()) Failed to map mr 10/11 elements
[ 1157.516122] LNetError: 34528:0:(o2iblnd_cb.c:560:kiblnd_fmr_map_tx()) Can't map 41033 pages: -22

That's again due to the fastreg code.

Amir Shehata (Inactive) added a comment - 05/Oct/17 11:46 PM so on MLX5 I'm able to mount, but I'm getting the following error: [ 1157.516112] LNetError: 34447:0:(o2iblnd.c:1940:kiblnd_fmr_pool_map()) Failed to map mr 10/11 elements [ 1157.516122] LNetError: 34528:0:(o2iblnd_cb.c:560:kiblnd_fmr_map_tx()) Can't map 41033 pages: -22 That's again due to the fastreg code.

Gerrit Updater added a comment - 05/Oct/17 9:44 PM

James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/29341
Subject: ~~LU-10068~~ lnet: Revert prefer Fast Reg
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 004295627f99c906f360acc0624285b4631782a1

Gerrit Updater added a comment - 05/Oct/17 9:44 PM James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/29341 Subject: LU-10068 lnet: Revert prefer Fast Reg Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 004295627f99c906f360acc0624285b4631782a1

Cliff White (Inactive) added a comment - 05/Oct/17 6:20 PM

Just to be double-certain,this morning I re-loaded 3637 (last good build) did a complete re-format to obtain a clean FS, verified that everything mounted. Then, I installed build 3638, power-cycled all systems. Verified that LNET was up on all systems, and that all systems could lctl ping the MGS/MDS (soak-8) - attempted to mount the filesystem. Initial MGS/MDS mount fails.

Oct  5 18:16:28 soak-8 kernel: ZFS: Loaded module v0.7.1-1, ZFS pool version 5000, ZFS filesystem version 5
Oct  5 18:16:29 soak-8 kernel: LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc
Oct  5 18:16:30 soak-8 kernel: Lustre: MGS: Connection restored to 0e7fadb0-8a70-1717-c708-3785eb9ea4ec (at 192.168.1.108@o2ib)
Oct  5 18:16:31 soak-8 kernel: Lustre: 2365:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1507227390/real 1507227390]  req@ffff8808142d0000 x1580442241204256/t0(0) o253->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 4768/4768 e 0 to 1 dl 1507227397 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1
Oct  5 18:16:31 soak-8 kernel: LustreError: 2216:0:(events.c:304:request_in_callback()) event type 2, status -103, service mgs
Oct  5 18:16:31 soak-8 kernel: LustreError: 2417:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check
Oct  5 18:16:31 soak-8 kernel: LustreError: 2417:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-192.168.1.108@o2ib x1580442241204256
Oct  5 18:16:31 soak-8 kernel: LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail
Oct  5 18:16:31 soak-8 kernel: Lustre: MGS: Received new LWP connection from 192.168.1.108@o2ib, removing former export from same NID
Oct  5 18:16:31 soak-8 kernel: Lustre: MGS: Connection restored to 0e7fadb0-8a70-1717-c708-3785eb9ea4ec (at 192.168.1.108@o2ib)
Oct  5 18:16:38 soak-8 kernel: Lustre: 2365:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1507227391/real 1507227391]  req@ffff88080ef08000 x1580442241204384/t0(0) o503->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 272/8416 e 0 to 1 dl 1507227398 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Oct  5 18:16:38 soak-8 kernel: LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail
Oct  5 18:16:38 soak-8 kernel: LustreError: 15c-8: MGC192.168.1.108@o2ib: The configuration from log 'soaked-MDT0000' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
Oct  5 18:16:39 soak-8 kernel: Lustre: MGS: Received new LWP connection from 192.168.1.108@o2ib, removing former export from same NID
Oct  5 18:16:39 soak-8 kernel: Lustre: MGS: Connection restored to 0e7fadb0-8a70-1717-c708-3785eb9ea4ec (at 192.168.1.108@o2ib)
Oct  5 18:16:39 soak-8 kernel: Lustre: Skipped 1 previous similar message
Oct  5 18:16:39 soak-8 kernel: LustreError: 2365:0:(obd_mount_server.c:1370:server_start_targets()) failed to start server soaked-MDT0000: -5
Oct  5 18:16:39 soak-8 kernel: LustreError: 2365:0:(obd_mount_server.c:1863:server_fill_super()) Unable to start targets: -5
Oct  5 18:16:39 soak-8 kernel: LustreError: 2365:0:(obd_mount_server.c:1573:server_put_super()) no obd soaked-MDT0000
Oct  5 18:16:46 soak-8 kernel: Lustre: 2365:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1507227399/real 1507227399]  req@ffff88080ef08000 x1580442241204416/t0(0) o251->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 224/224 e 0 to 1 dl 1507227405 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Oct  5 18:16:46 soak-8 kernel: Lustre: server umount soaked-MDT0000 complete
Oct  5 18:16:46 soak-8 kernel: LustreError: 2365:0:(obd_mount.c:1504:lustre_fill_super()) Unable to mount  (-5)
Oct  5 18:16:46 soak-8 sshd[2342]: Received disconnect from 192.168.1.116 port 35786:11: disconnected by user
Oct  5 18:16:46 soak-8 sshd[2342]: Disconnected from 192.168.1.116 port 35786

Cliff White (Inactive) added a comment - 05/Oct/17 6:20 PM Just to be double-certain,this morning I re-loaded 3637 (last good build) did a complete re-format to obtain a clean FS, verified that everything mounted. Then, I installed build 3638, power-cycled all systems. Verified that LNET was up on all systems, and that all systems could lctl ping the MGS/MDS (soak-8) - attempted to mount the filesystem. Initial MGS/MDS mount fails. Oct 5 18:16:28 soak-8 kernel: ZFS: Loaded module v0.7.1-1, ZFS pool version 5000, ZFS filesystem version 5 Oct 5 18:16:29 soak-8 kernel: LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc Oct 5 18:16:30 soak-8 kernel: Lustre: MGS: Connection restored to 0e7fadb0-8a70-1717-c708-3785eb9ea4ec (at 192.168.1.108@o2ib) Oct 5 18:16:31 soak-8 kernel: Lustre: 2365:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1507227390/real 1507227390] req@ffff8808142d0000 x1580442241204256/t0(0) o253->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 4768/4768 e 0 to 1 dl 1507227397 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1 Oct 5 18:16:31 soak-8 kernel: LustreError: 2216:0:(events.c:304:request_in_callback()) event type 2, status -103, service mgs Oct 5 18:16:31 soak-8 kernel: LustreError: 2417:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check Oct 5 18:16:31 soak-8 kernel: LustreError: 2417:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-192.168.1.108@o2ib x1580442241204256 Oct 5 18:16:31 soak-8 kernel: LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail Oct 5 18:16:31 soak-8 kernel: Lustre: MGS: Received new LWP connection from 192.168.1.108@o2ib, removing former export from same NID Oct 5 18:16:31 soak-8 kernel: Lustre: MGS: Connection restored to 0e7fadb0-8a70-1717-c708-3785eb9ea4ec (at 192.168.1.108@o2ib) Oct 5 18:16:38 soak-8 kernel: Lustre: 2365:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1507227391/real 1507227391] req@ffff88080ef08000 x1580442241204384/t0(0) o503->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 272/8416 e 0 to 1 dl 1507227398 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Oct 5 18:16:38 soak-8 kernel: LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail Oct 5 18:16:38 soak-8 kernel: LustreError: 15c-8: MGC192.168.1.108@o2ib: The configuration from log 'soaked-MDT0000' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. Oct 5 18:16:39 soak-8 kernel: Lustre: MGS: Received new LWP connection from 192.168.1.108@o2ib, removing former export from same NID Oct 5 18:16:39 soak-8 kernel: Lustre: MGS: Connection restored to 0e7fadb0-8a70-1717-c708-3785eb9ea4ec (at 192.168.1.108@o2ib) Oct 5 18:16:39 soak-8 kernel: Lustre: Skipped 1 previous similar message Oct 5 18:16:39 soak-8 kernel: LustreError: 2365:0:(obd_mount_server.c:1370:server_start_targets()) failed to start server soaked-MDT0000: -5 Oct 5 18:16:39 soak-8 kernel: LustreError: 2365:0:(obd_mount_server.c:1863:server_fill_super()) Unable to start targets: -5 Oct 5 18:16:39 soak-8 kernel: LustreError: 2365:0:(obd_mount_server.c:1573:server_put_super()) no obd soaked-MDT0000 Oct 5 18:16:46 soak-8 kernel: Lustre: 2365:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1507227399/real 1507227399] req@ffff88080ef08000 x1580442241204416/t0(0) o251->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 224/224 e 0 to 1 dl 1507227405 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 5 18:16:46 soak-8 kernel: Lustre: server umount soaked-MDT0000 complete Oct 5 18:16:46 soak-8 kernel: LustreError: 2365:0:(obd_mount.c:1504:lustre_fill_super()) Unable to mount (-5) Oct 5 18:16:46 soak-8 sshd[2342]: Received disconnect from 192.168.1.116 port 35786:11: disconnected by user Oct 5 18:16:46 soak-8 sshd[2342]: Disconnected from 192.168.1.116 port 35786

Cliff White (Inactive) added a comment - 05/Oct/17 6:14 PM

We sure do.

/scratch/logs/syslog/soak-8.log:Oct  4 23:19:28 soak-8 kernel: LNet: Using FastReg for registration
/scratch/logs/syslog/soak-8.log:Oct  5 00:06:13 soak-8 kernel: LNet: Using FastReg for registration
/scratch/logs/syslog/soak-8.log:Oct  5 00:30:19 soak-8 kernel: LNet: Using FastReg for registration
/scratch/logs/syslog/soak-8.log:Oct  5 18:12:54 soak-8 kernel: LNet: Using FastReg for registration
/scratch/logs/syslog/soak-9.log:Oct  3 18:03:02 soak-9 kernel: LNet: Using FastReg for registration
/scratch/logs/syslog/soak-9.log:Oct  4 15:55:02 soak-9 kernel: LNet: Using FastReg for registration
/scratch/logs/syslog/soak-9.log:Oct  4 18:36:43 soak-9 kernel: LNet: Using FastReg for registration
/scratch/logs/syslog/soak-9.log:Oct  4 23:19:30 soak-9 kernel: LNet: Using FastReg for registration
/scratch/logs/syslog/soak-9.log:Oct  5 00:30:19 soak-9 kernel: LNet: Using FastReg for registration
/scratch/logs/syslog/soak-9.log:Oct  5 18:12:55 soak-9 kernel: LNet: Using FastReg for registration

Cliff White (Inactive) added a comment - 05/Oct/17 6:14 PM We sure do. /scratch/logs/syslog/soak-8.log:Oct 4 23:19:28 soak-8 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-8.log:Oct 5 00:06:13 soak-8 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-8.log:Oct 5 00:30:19 soak-8 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-8.log:Oct 5 18:12:54 soak-8 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-9.log:Oct 3 18:03:02 soak-9 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-9.log:Oct 4 15:55:02 soak-9 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-9.log:Oct 4 18:36:43 soak-9 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-9.log:Oct 4 23:19:30 soak-9 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-9.log:Oct 5 00:30:19 soak-9 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-9.log:Oct 5 18:12:55 soak-9 kernel: LNet: Using FastReg for registration

Amir Shehata (Inactive) added a comment - 05/Oct/17 4:51 PM

Do we see this on startup?

LCONSOLE_INFO("Using FastReg for registration\n");

Amir Shehata (Inactive) added a comment - 05/Oct/17 4:51 PM Do we see this on startup? LCONSOLE_INFO( "Using FastReg for registration\n" );

Amir Shehata (Inactive) added a comment - 05/Oct/17 4:30 PM

That's true. OPA supports FMR in software but not FastReg.

Was this run on OPA?

Amir Shehata (Inactive) added a comment - 05/Oct/17 4:30 PM That's true. OPA supports FMR in software but not FastReg. Was this run on OPA?

People

Assignee:: Amir Shehata (Inactive)

Reporter:: Cliff White (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 03/Oct/17 10:40 PM

Updated:: 12/Jan/18 7:37 PM

Resolved:: 19/Dec/17 10:12 AM