[LU-10068] OST fails to mount:LustreError: 14558:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.11.0
Labels:
- soak
Environment:
Soak cluster - latest lustre-master build (3650) version=2.10.53_32_g20ffe21

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Reformatted, created complete new filesystem.
MDS/MGS mounts.
OSS mount fails
OSS errors

[  584.585335] Lustre: 13317:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1507069853/real 1507069853]  req@ffff88083f758300 x1580277278179360/t0(0) o253->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 4768/4768 e 0 to 1 dl 1507069860 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1
[  584.680525] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail
[  584.727353] LustreError: 15f-b: soaked-OST0000: cannot register this server with the MGS: rc = -5. Is the MGS running?
[  584.755152] Lustre: MGC192.168.1.108@o2ib: Connection restored to MGC192.168.1.108@o2ib_0 (at 192.168.1.108@o2ib)
[  584.796681] LustreError: 13317:0:(obd_mount_server.c:1863:server_fill_super()) Unable to start targets: -5
[  584.828634] LustreError: 13317:0:(obd_mount_server.c:1573:server_put_super()) no obd soaked-OST0000
[  584.858501] LustreError: 13317:0:(obd_mount_server.c:132:server_deregister_mount()) soaked-OST0000 not registered
[  585.118058] Lustre: server umount soaked-OST0000 complete
[  585.135868] LustreError: 13317:0:(obd_mount.c:1504:lustre_fill_super()) Unable to mount  (-5)

Errors on MDS

[17524.112385] Lustre: soaked-MDT0000: new disk, initializing
[17524.138680] Lustre: soaked-MDT0000: Imperative Recovery not enabled, recovery window 300-900
[17524.153739] Lustre: ctl-soaked-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400]:0:mdt
[17528.212947] Lustre: MGS: Connection restored to aa8414d2-f089-bf2e-b8c3-406370f048cc (at 192.168.1.102@o2ib)
[17528.273221] LustreError: 12811:0:(events.c:304:request_in_callback()) event type 2, status -103, service mgs
[17528.287356] LustreError: 14558:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check
[17528.305726] LustreError: 14558:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-192.168.1.102@o2ib x1580277278179360
[17528.418152] Lustre: MGS: Received new LWP connection from 192.168.1.102@o2ib, removing former export from same NID

Not sure where to go from here.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

soak-8.lustre.log.txt
1.40 MB
04/Oct/17 6:54 PM

Issue Links

is related to

LU-9983 LBUG llog_osd.c:327:llog_osd_declare_write_rec() - all DNE MDS

Resolved

mentioned in: Page No Confluence page found with the given URL.; Page No Confluence page found with the given URL.

Activity

[LU-10068] OST fails to mount:LustreError: 14558:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check

Andreas Dilger added a comment - 19/Dec/17 10:12 AM

This was fixed by the ~~LU-9983~~ patch.

Andreas Dilger added a comment - 19/Dec/17 10:12 AM This was fixed by the LU-9983 patch.

Cliff White (Inactive) added a comment - 18/Dec/17 6:47 PM

I am ok with closing it.

Cliff White (Inactive) added a comment - 18/Dec/17 6:47 PM I am ok with closing it.

Andreas Dilger added a comment - 18/Dec/17 6:41 PM

The combined patch is no longer needed (part reverted, part landed), can this ticket be closed?

Andreas Dilger added a comment - 18/Dec/17 6:41 PM The combined patch is no longer needed (part reverted, part landed), can this ticket be closed?

Gerrit Updater added a comment - 10/Oct/17 6:08 PM

Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/29547
Subject: ~~LU-10068~~ lnet: Combined patch for testing on soak
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f4009ebfecc1cfc4066a4b83808b224eae2ab852

Gerrit Updater added a comment - 10/Oct/17 6:08 PM Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/29547 Subject: LU-10068 lnet: Combined patch for testing on soak Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: f4009ebfecc1cfc4066a4b83808b224eae2ab852

Cliff White (Inactive) added a comment - 06/Oct/17 11:42 PM

Tested James patch reverting Fast Reg. https://review.whamcloud.com/29341. Soak is running okay, but router have hit LBUG (LU-10103)

Cliff White (Inactive) added a comment - 06/Oct/17 11:42 PM Tested James patch reverting Fast Reg. https://review.whamcloud.com/29341 . Soak is running okay, but router have hit LBUG ( LU-10103 )

Amir Shehata (Inactive) added a comment - 06/Oct/17 4:43 PM - edited

there appears to be 3 separate issues:
1. Fastreg is not supported on OPA so reverting ~~LU-9810~~ works (as a side note, I also tried John's patch for ~~LU-9983~~ and that resolves the problem we were seeing there).
2. Fastreg broken on MLX-5. I've been debugging this problem yesterday, and we already have a few bugs that are all related. For this particular problem ib_map_mr_sg() is called to map 11 fragments but ends up mapping 10. I'm trying to look at the mlx5 driver code and understand why it stops before mapping all fragments. The fragments look like:

length = 4096, offset = 0, addr = 105c43c000, page offset = 0, page_addr = 105c43c000
length = 4096, offset = 0, addr = 84cfb0000, page offset = 0, page_addr = 84cfb0000
length = 4096, offset = 0, addr = 84cfb1000, page offset = 0, page_addr = 84cfb1000
length = 4096, offset = 0, addr = 84cfb2000, page offset = 0, page_addr = 84cfb2000
length = 4096, offset = 0, addr = 84cfb3000, page offset = 0, page_addr = 84cfb3000
length = 4096, offset = 0, addr = 84cfb4000, page offset = 0, page_addr = 84cfb4000
length = 4096, offset = 0, addr = 84cfb5000, page offset = 0, page_addr = 84cfb5000
length = 4096, offset = 0, addr = 84cfb6000, page offset = 0, page_addr = 84cfb6000
length = 4096, offset = 0, addr = 84cfb7000, page offset = 0, page_addr = 84cfb7000
length = 73, offset = 0, addr = 84cfb8000, page offset = 0, page_addr = 84cfb8000
length = 4096, offset = 0, addr = 105c43d000, page offset = 0, page_addr = 105c43d000

So it looks like it stops on the fragment of length 73.

3. MLX-4 failure. I still need to investigate further, because it could be different than both of the above.

Amir Shehata (Inactive) added a comment - 06/Oct/17 4:43 PM - edited there appears to be 3 separate issues: 1. Fastreg is not supported on OPA so reverting LU-9810 works (as a side note, I also tried John's patch for LU-9983 and that resolves the problem we were seeing there). 2. Fastreg broken on MLX-5. I've been debugging this problem yesterday, and we already have a few bugs that are all related. For this particular problem ib_map_mr_sg() is called to map 11 fragments but ends up mapping 10. I'm trying to look at the mlx5 driver code and understand why it stops before mapping all fragments. The fragments look like: length = 4096, offset = 0, addr = 105c43c000, page offset = 0, page_addr = 105c43c000 length = 4096, offset = 0, addr = 84cfb0000, page offset = 0, page_addr = 84cfb0000 length = 4096, offset = 0, addr = 84cfb1000, page offset = 0, page_addr = 84cfb1000 length = 4096, offset = 0, addr = 84cfb2000, page offset = 0, page_addr = 84cfb2000 length = 4096, offset = 0, addr = 84cfb3000, page offset = 0, page_addr = 84cfb3000 length = 4096, offset = 0, addr = 84cfb4000, page offset = 0, page_addr = 84cfb4000 length = 4096, offset = 0, addr = 84cfb5000, page offset = 0, page_addr = 84cfb5000 length = 4096, offset = 0, addr = 84cfb6000, page offset = 0, page_addr = 84cfb6000 length = 4096, offset = 0, addr = 84cfb7000, page offset = 0, page_addr = 84cfb7000 length = 73, offset = 0, addr = 84cfb8000, page offset = 0, page_addr = 84cfb8000 length = 4096, offset = 0, addr = 105c43d000, page offset = 0, page_addr = 105c43d000 So it looks like it stops on the fragment of length 73. 3. MLX-4 failure. I still need to investigate further, because it could be different than both of the above.

Cliff White (Inactive) added a comment - 06/Oct/17 4:16 PM

Trying https://review.whamcloud.com/29341 on soak - servers mount okay, some clients are having timeouts

soak-20 login: [ 3363.772469] INFO: task mount.lustre:2191 blocked for more than 120 seconds.^M
[ 3363.782165] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.^M
[ 3363.792785] mount.lustre    D 0000000000000001     0  2191   2190 0x00000080^M
[ 3363.802418]  ffff88103fad3940 0000000000000082 ffff881013fedee0 ffff88103fad3fd8^M
[ 3363.812340]  ffff88103fad3fd8 ffff88103fad3fd8 ffff881013fedee0 ffff881013b87270^M
[ 3363.822136]  7fffffffffffffff ffff881013b87268 ffff881013fedee0 0000000000000001^M
[ 3363.831817] Call Trace:^M
[ 3363.835841]  [<ffffffff816a94e9>] schedule+0x29/0x70^M
[ 3363.842720]  [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0^M
[ 3363.850481]  [<ffffffff810c76f5>] ? sched_clock_cpu+0x85/0xc0^M
[ 3363.858146]  [<ffffffff810c12c8>] ? check_preempt_curr+0x78/0xa0^M
[ 3363.866025]  [<ffffffff810c1309>] ? ttwu_do_wakeup+0x19/0xd0^M
[ 3363.873501]  [<ffffffff816a989d>] wait_for_completion+0xfd/0x140^M
[ 3363.881287]  [<ffffffff810c4810>] ? wake_up_state+0x20/0x20^M
[ 3363.888596]  [<ffffffffc09de6b4>] llog_process_or_fork+0x244/0x450 [obdclass]^M
[ 3363.897603]  [<ffffffffc09de8d4>] llog_process+0x14/0x20 [obdclass]^M
[ 3363.905633]  [<ffffffffc0a0ed65>] class_config_parse_llog+0x125/0x350 [obdclass]^M
[ 3363.914874]  [<ffffffffc091c368>] mgc_process_cfg_log+0x788/0xc40 [mgc]^M
[ 3363.923229]  [<ffffffffc091f1e3>] mgc_process_log+0x3d3/0x890 [mgc]^M
[ 3363.931153]  [<ffffffffc0a17480>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]^M
[ 3363.940736]  [<ffffffffc091f8e8>] ? do_config_log_add+0x248/0x580 [mgc]^M
[ 3363.949020]  [<ffffffffc09207c0>] mgc_process_config+0x890/0x13f0 [mgc]^M
[ 3363.957326]  [<ffffffffc0a1ad7e>] lustre_process_log+0x2de/0xaf0 [obdclass]^M
[ 3363.965969]  [<ffffffff816aba0e>] ? _raw_spin_unlock_bh+0x1e/0x20^M
[ 3363.973647]  [<ffffffff8132318b>] ? fprop_local_init_percpu+0x1b/0x30^M
[ 3363.981681]  [<ffffffffc0e20608>] ll_fill_super+0xaf8/0x1220 [lustre]^M
[ 3363.989721]  [<ffffffffc0a20ce6>] lustre_fill_super+0x286/0x910 [obdclass]^M
[ 3363.998214]  [<ffffffffc0a20a60>] ? lustre_common_put_super+0x270/0x270 [obdclass]^M
[ 3364.007482]  [<ffffffff81204b0d>] mount_nodev+0x4d/0xb0^M
[ 3364.014099]  [<ffffffffc0a18ba8>] lustre_mount+0x38/0x60 [obdclass]^M
[ 3364.021874]  [<ffffffff81205599>] mount_fs+0x39/0x1b0^M
[ 3364.028291]  [<ffffffff81222067>] vfs_kern_mount+0x67/0x110^M
[ 3364.035288]  [<ffffffff81224573>] do_mount+0x233/0xaf0^M
[ 3364.041786]  [<ffffffff8118760e>] ? __get_free_pages+0xe/0x40^M
[ 3364.048971]  [<ffffffff812251b6>] SyS_mount+0x96/0xf0^M
[ 3364.055364]  [<ffffffff816b5009>] system_call_fastpath+0x16/0x1b^M

Re-trying clients after power-cycling does work

Cliff White (Inactive) added a comment - 06/Oct/17 4:16 PM Trying https://review.whamcloud.com/29341 on soak - servers mount okay, some clients are having timeouts soak-20 login: [ 3363.772469] INFO: task mount.lustre:2191 blocked for more than 120 seconds.^M [ 3363.782165] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.^M [ 3363.792785] mount.lustre D 0000000000000001 0 2191 2190 0x00000080^M [ 3363.802418] ffff88103fad3940 0000000000000082 ffff881013fedee0 ffff88103fad3fd8^M [ 3363.812340] ffff88103fad3fd8 ffff88103fad3fd8 ffff881013fedee0 ffff881013b87270^M [ 3363.822136] 7fffffffffffffff ffff881013b87268 ffff881013fedee0 0000000000000001^M [ 3363.831817] Call Trace:^M [ 3363.835841] [<ffffffff816a94e9>] schedule+0x29/0x70^M [ 3363.842720] [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0^M [ 3363.850481] [<ffffffff810c76f5>] ? sched_clock_cpu+0x85/0xc0^M [ 3363.858146] [<ffffffff810c12c8>] ? check_preempt_curr+0x78/0xa0^M [ 3363.866025] [<ffffffff810c1309>] ? ttwu_do_wakeup+0x19/0xd0^M [ 3363.873501] [<ffffffff816a989d>] wait_for_completion+0xfd/0x140^M [ 3363.881287] [<ffffffff810c4810>] ? wake_up_state+0x20/0x20^M [ 3363.888596] [<ffffffffc09de6b4>] llog_process_or_fork+0x244/0x450 [obdclass]^M [ 3363.897603] [<ffffffffc09de8d4>] llog_process+0x14/0x20 [obdclass]^M [ 3363.905633] [<ffffffffc0a0ed65>] class_config_parse_llog+0x125/0x350 [obdclass]^M [ 3363.914874] [<ffffffffc091c368>] mgc_process_cfg_log+0x788/0xc40 [mgc]^M [ 3363.923229] [<ffffffffc091f1e3>] mgc_process_log+0x3d3/0x890 [mgc]^M [ 3363.931153] [<ffffffffc0a17480>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]^M [ 3363.940736] [<ffffffffc091f8e8>] ? do_config_log_add+0x248/0x580 [mgc]^M [ 3363.949020] [<ffffffffc09207c0>] mgc_process_config+0x890/0x13f0 [mgc]^M [ 3363.957326] [<ffffffffc0a1ad7e>] lustre_process_log+0x2de/0xaf0 [obdclass]^M [ 3363.965969] [<ffffffff816aba0e>] ? _raw_spin_unlock_bh+0x1e/0x20^M [ 3363.973647] [<ffffffff8132318b>] ? fprop_local_init_percpu+0x1b/0x30^M [ 3363.981681] [<ffffffffc0e20608>] ll_fill_super+0xaf8/0x1220 [lustre]^M [ 3363.989721] [<ffffffffc0a20ce6>] lustre_fill_super+0x286/0x910 [obdclass]^M [ 3363.998214] [<ffffffffc0a20a60>] ? lustre_common_put_super+0x270/0x270 [obdclass]^M [ 3364.007482] [<ffffffff81204b0d>] mount_nodev+0x4d/0xb0^M [ 3364.014099] [<ffffffffc0a18ba8>] lustre_mount+0x38/0x60 [obdclass]^M [ 3364.021874] [<ffffffff81205599>] mount_fs+0x39/0x1b0^M [ 3364.028291] [<ffffffff81222067>] vfs_kern_mount+0x67/0x110^M [ 3364.035288] [<ffffffff81224573>] do_mount+0x233/0xaf0^M [ 3364.041786] [<ffffffff8118760e>] ? __get_free_pages+0xe/0x40^M [ 3364.048971] [<ffffffff812251b6>] SyS_mount+0x96/0xf0^M [ 3364.055364] [<ffffffff816b5009>] system_call_fastpath+0x16/0x1b^M Re-trying clients after power-cycling does work

James A Simmons added a comment - 06/Oct/17 3:37 PM

I tried the patch for ~~LU-9983~~ to see it would address this but it didn't

James A Simmons added a comment - 06/Oct/17 3:37 PM I tried the patch for LU-9983 to see it would address this but it didn't

John Hammond added a comment - 06/Oct/17 2:18 PM

Fails on OPA as well.

John Hammond added a comment - 06/Oct/17 2:18 PM Fails on OPA as well.

Amir Shehata (Inactive) added a comment - 05/Oct/17 11:46 PM

so on MLX5 I'm able to mount, but I'm getting the following error:

[ 1157.516112] LNetError: 34447:0:(o2iblnd.c:1940:kiblnd_fmr_pool_map()) Failed to map mr 10/11 elements
[ 1157.516122] LNetError: 34528:0:(o2iblnd_cb.c:560:kiblnd_fmr_map_tx()) Can't map 41033 pages: -22

That's again due to the fastreg code.

Amir Shehata (Inactive) added a comment - 05/Oct/17 11:46 PM so on MLX5 I'm able to mount, but I'm getting the following error: [ 1157.516112] LNetError: 34447:0:(o2iblnd.c:1940:kiblnd_fmr_pool_map()) Failed to map mr 10/11 elements [ 1157.516122] LNetError: 34528:0:(o2iblnd_cb.c:560:kiblnd_fmr_map_tx()) Can't map 41033 pages: -22 That's again due to the fastreg code.

People

Assignee:: Amir Shehata (Inactive)

Reporter:: Cliff White (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 03/Oct/17 10:40 PM

Updated:: 12/Jan/18 7:37 PM

Resolved:: 19/Dec/17 10:12 AM