[LU-10068] OST fails to mount:LustreError: 14558:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check Created: 03/Oct/17  Updated: 12/Jan/18  Resolved: 19/Dec/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Cliff White (Inactive) Assignee: Amir Shehata (Inactive)
Resolution: Duplicate Votes: 0
Labels: soak
Environment:

Soak cluster - latest lustre-master build (3650) version=2.10.53_32_g20ffe21


Attachments: Text File soak-8.lustre.log.txt    
Issue Links:
Related
is related to LU-9983 LBUG llog_osd.c:327:llog_osd_declare_... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Reformatted, created complete new filesystem.
MDS/MGS mounts.
OSS mount fails
OSS errors

[  584.585335] Lustre: 13317:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1507069853/real 1507069853]  req@ffff88083f758300 x1580277278179360/t0(0) o253->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 4768/4768 e 0 to 1 dl 1507069860 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1
[  584.680525] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail
[  584.727353] LustreError: 15f-b: soaked-OST0000: cannot register this server with the MGS: rc = -5. Is the MGS running?
[  584.755152] Lustre: MGC192.168.1.108@o2ib: Connection restored to MGC192.168.1.108@o2ib_0 (at 192.168.1.108@o2ib)
[  584.796681] LustreError: 13317:0:(obd_mount_server.c:1863:server_fill_super()) Unable to start targets: -5
[  584.828634] LustreError: 13317:0:(obd_mount_server.c:1573:server_put_super()) no obd soaked-OST0000
[  584.858501] LustreError: 13317:0:(obd_mount_server.c:132:server_deregister_mount()) soaked-OST0000 not registered
[  585.118058] Lustre: server umount soaked-OST0000 complete
[  585.135868] LustreError: 13317:0:(obd_mount.c:1504:lustre_fill_super()) Unable to mount  (-5)

Errors on MDS

[17524.112385] Lustre: soaked-MDT0000: new disk, initializing
[17524.138680] Lustre: soaked-MDT0000: Imperative Recovery not enabled, recovery window 300-900
[17524.153739] Lustre: ctl-soaked-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400]:0:mdt
[17528.212947] Lustre: MGS: Connection restored to aa8414d2-f089-bf2e-b8c3-406370f048cc (at 192.168.1.102@o2ib)
[17528.273221] LustreError: 12811:0:(events.c:304:request_in_callback()) event type 2, status -103, service mgs
[17528.287356] LustreError: 14558:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check
[17528.305726] LustreError: 14558:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-192.168.1.102@o2ib x1580277278179360
[17528.418152] Lustre: MGS: Received new LWP connection from 192.168.1.102@o2ib, removing former export from same NID

Not sure where to go from here.



 Comments   
Comment by Cliff White (Inactive) [ 04/Oct/17 ]

I re-loaded b2.10 GA - filesystem formatted and mounted with not issues.
I then loaded lustre-master build 3650, and attempted to mount the previously formatted filesystem. This filesystem was formatted and mounted once, no work other than mount/umount performed. The OST mounts do complete, but there are many errors

Oct  4 16:01:24 soak-2 kernel: Lustre: 24621:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1507132883/real 1507132883]  req@ffff8800b2960300 x1580342920085536/t0(0) o253->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 4768/4768 e 0 to 1 dl 1507132890 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1
Oct  4 16:01:24 soak-2 kernel: LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail
Oct  4 16:01:24 soak-2 kernel: Lustre: MGC192.168.1.108@o2ib: Connection restored to MGC192.168.1.108@o2ib_0 (at 192.168.1.108@o2ib)
Oct  4 16:01:24 soak-2 kernel: LustreError: 23440:0:(events.c:199:client_bulk_callback()) event type 2, status -103, desc ffff88082acd7e00
Oct  4 16:01:24 soak-2 kernel: Lustre: 24621:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1507132884/real 1507132884]  req@ffff8803ff218000 x1580342920085632/t0(0) o256->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 304/240 e 0 to 1 dl 1507132891 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1
Oct  4 16:01:24 soak-2 kernel: LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail
Oct  4 16:01:25 soak-2 kernel: LustreError: 24621:0:(mgc_request.c:251:do_config_log_add()) MGC192.168.1.108@o2ib: failed processing log, type 4: rc = -5
Oct  4 16:01:25 soak-2 kernel: Lustre: MGC192.168.1.108@o2ib: Connection restored to MGC192.168.1.108@o2ib_0 (at 192.168.1.108@o2ib)
Oct  4 16:01:32 soak-2 kernel: Lustre: 24621:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1507132885/real 1507132885]  req@ffff8803ff218000 x1580342920085696/t0(0) o503->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 272/8416 e 0 to 1 dl 1507132892 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Oct  4 16:01:32 soak-2 kernel: LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail
Oct  4 16:01:33 soak-2 kernel: LustreError: 24621:0:(mgc_request.c:1885:mgc_llog_local_copy()) MGC192.168.1.108@o2ib: failed to copy remote log soaked-OST0000: rc = -5
Oct  4 16:01:33 soak-2 kernel: LustreError: 24937:0:(ldlm_resource.c:1100:ldlm_resource_complain()) MGC192.168.1.108@o2ib: namespace resource [0x64656b616f73:0x0:0x0].0x0 (ffff8804184e1200) refcount nonzero (1) after lock cleanup; forcing cleanup.
Oct  4 16:01:33 soak-2 kernel: LustreError: 24937:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x64656b616f73:0x0:0x0].0x0 (ffff8804184e1200) refcount = 2
Oct  4 16:01:33 soak-2 kernel: LustreError: 24937:0:(ldlm_resource.c:1685:ldlm_resource_dump()) Granted locks (in reverse order):
Oct  4 16:01:33 soak-2 kernel: LustreError: 24937:0:(ldlm_resource.c:1688:ldlm_resource_dump()) ### ### ns: MGC192.168.1.108@o2ib lock: ffff88040eb20400/0xc3f9d16c2df9943e lrc: 2/1,0 mode: CR/CR res: [0x64656b616f73:0x0:0x0].0x0 rrc: 3 type: PLN flags: 0x1106400000000 nid: local remote: 0xc8dccd9f1ede04d5 expref: -99 pid: 24621 timeout: 0 lvb_type: 0
Oct  4 16:01:33 soak-2 kernel: Lustre: MGC192.168.1.108@o2ib: Connection restored to MGC192.168.1.108@o2ib_0 (at 192.168.1.108@o2ib)
Oct  4 16:01:34 soak-2 kernel: LustreError: 23440:0:(events.c:199:client_bulk_callback()) event type 2, status -103, desc ffff88082cc40800
Oct  4 16:01:34 soak-2 kernel: LustreError: 24665:0:(mgc_request.c:603:do_requeue()) failed processing log: -5
Oct  4 16:01:45 soak-2 kernel: LustreError: 137-5: soaked-OST0001_UUID: not available for connect from 192.168.1.108@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
Comment by Cliff White (Inactive) [ 04/Oct/17 ]

In addition, the OST mounts are very, very slow

Comment by Cliff White (Inactive) [ 04/Oct/17 ]

However, now all client mount attempts fail:

[ 1891.859459] Lustre: 9069:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1507133982/real 1507133982]  req@ffff880813ea0300 x1580342921134176/t0(0) o503->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 272/8416 e 0 to 1 dl 1507133989 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
[ 1891.952822] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail
[ 1891.999529] LustreError: 15c-8: MGC192.168.1.108@o2ib: The configuration from log 'soaked-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
[ 1892.002601] Lustre: MGC192.168.1.108@o2ib: Connection restored to MGC192.168.1.108@o2ib_0 (at 192.168.1.108@o2ib)
[ 1892.110166] Lustre: Unmounted soaked-client
Comment by James Nunez (Inactive) [ 04/Oct/17 ]

The last master build that ran successfully on soak was build #3637.

Comment by Cliff White (Inactive) [ 04/Oct/17 ]

Reverted back to build 3649, and it's even worse! The initial MGT/MDT mount hangs, big time.

Oct  4 18:40:43 soak-8 kernel: INFO: task mount.lustre:2489 blocked for more than 120 seconds.
Oct  4 18:40:43 soak-8 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  4 18:40:43 soak-8 kernel: mount.lustre    D 0000000000000001     0  2489   2488 0x00000082
Oct  4 18:40:44 soak-8 kernel: ffff8804102d77b0 0000000000000082 ffff88041140dee0 ffff8804102d7fd8
Oct  4 18:40:44 soak-8 kernel: ffff8804102d7fd8 ffff8804102d7fd8 ffff88041140dee0 ffff88082c4c4810
Oct  4 18:40:44 soak-8 kernel: 7fffffffffffffff ffff88082c4c4808 ffff88041140dee0 0000000000000001
Oct  4 18:40:44 soak-8 kernel: Call Trace:
Oct  4 18:40:44 soak-8 kernel: [<ffffffff816a94a9>] schedule+0x29/0x70
Oct  4 18:40:44 soak-8 kernel: [<ffffffff816a6fb9>] schedule_timeout+0x239/0x2c0
Oct  4 18:40:44 soak-8 kernel: [<ffffffff81050b5c>] ? native_smp_send_reschedule+0x4c/0x70
Oct  4 18:40:44 soak-8 kernel: [<ffffffff810c0548>] ? resched_curr+0xa8/0xc0
Oct  4 18:40:44 soak-8 kernel: [<ffffffff810c12c8>] ? check_preempt_curr+0x78/0xa0
Oct  4 18:40:44 soak-8 kernel: [<ffffffff810c1309>] ? ttwu_do_wakeup+0x19/0xd0
Oct  4 18:40:44 soak-8 kernel: [<ffffffff816a985d>] wait_for_completion+0xfd/0x140
Oct  4 18:40:44 soak-8 kernel: [<ffffffff810c4810>] ? wake_up_state+0x20/0x20
Oct  4 18:40:44 soak-8 kernel: [<ffffffffc0a496b4>] llog_process_or_fork+0x244/0x450 [obdclass]
Oct  4 18:40:44 soak-8 kernel: [<ffffffffc0a498d4>] llog_process+0x14/0x20 [obdclass]
Oct  4 18:40:44 soak-8 kernel: [<ffffffffc0a79e95>] class_config_parse_llog+0x125/0x350 [obdclass]
Oct  4 18:40:44 soak-8 kernel: [<ffffffffc14465a8>] mgc_process_cfg_log+0x788/0xc40 [mgc]
Oct  4 18:40:44 soak-8 kernel: [<ffffffffc1449413>] mgc_process_log+0x3d3/0x890 [mgc]
Oct  4 18:40:44 soak-8 kernel: [<ffffffffc0a826b0>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
Oct  4 18:40:44 soak-8 kernel: [<ffffffffc1449b18>] ? do_config_log_add+0x248/0x580 [mgc]
Oct  4 18:40:44 soak-8 kernel: [<ffffffffc144a9f0>] mgc_process_config+0x890/0x13f0 [mgc]
Oct  4 18:40:44 soak-8 kernel: [<ffffffffc0a861ea>] lustre_process_log+0x2da/0xae0 [obdclass]
Oct  4 18:40:44 soak-8 kernel: [<ffffffffc0927ba7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
Oct  4 18:40:44 soak-8 kernel: [<ffffffffc0a713e9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
Oct  4 18:40:44 soak-8 kernel: [<ffffffffc0ab2da2>] server_start_targets+0x1352/0x2a70 [obdclass]
Oct  4 18:40:44 soak-8 kernel: [<ffffffffc0cd0b48>] ? ptlrpc_pinger_wake_up+0x28/0x30 [ptlrpc]
Oct  4 18:40:44 soak-8 kernel: [<ffffffffc0a71511>] ? lprocfs_counter_sub+0xc1/0x130 [obdclass]
Oct  4 18:40:44 soak-8 kernel: [<ffffffffc0a826b0>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
Oct  4 18:40:44 soak-8 kernel: [<ffffffffc0ab554d>] server_fill_super+0x108d/0x184c [obdclass]
Oct  4 18:40:44 soak-8 kernel: [<ffffffffc0a8c168>] lustre_fill_super+0x328/0x950 [obdclass]
Oct  4 18:40:44 soak-8 kernel: [<ffffffffc0a8be40>] ? lustre_common_put_super+0x270/0x270 [obdclass]
Oct  4 18:40:44 soak-8 kernel: [<ffffffff81204afd>] mount_nodev+0x4d/0xb0
Oct  4 18:40:44 soak-8 kernel: [<ffffffffc0a83f78>] lustre_mount+0x38/0x60 [obdclass]
Oct  4 18:40:44 soak-8 kernel: [<ffffffff81205589>] mount_fs+0x39/0x1b0
Oct  4 18:40:44 soak-8 kernel: [<ffffffff81222067>] vfs_kern_mount+0x67/0x110
Oct  4 18:40:44 soak-8 kernel: [<ffffffff81224573>] do_mount+0x233/0xaf0
Oct  4 18:40:44 soak-8 kernel: [<ffffffff8118760e>] ? __get_free_pages+0xe/0x40
Oct  4 18:40:44 soak-8 kernel: [<ffffffff812251b6>] SyS_mount+0x96/0xf0
Oct  4 18:40:44 soak-8 kernel: [<ffffffff816b4fc9>] system_call_fastpath+0x16/0x1b
Oct  4 18:42:44 soak-8 kernel: INFO: task mount.lustre:2489 blocked for more than 120 seconds.
Comment by Cliff White (Inactive) [ 04/Oct/17 ]

Did a debug=-1 dump of soak-8 (the MDS) while the mount was hanging. Attached.

Comment by Cliff White (Inactive) [ 04/Oct/17 ]

Tested build 3637, worked fine.
Build 3638 fails immedately

[  230.588325] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail
[  230.635151] LustreError: 15c-8: MGC192.168.1.108@o2ib: The configuration from log 'soaked-MDT0000' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
[  230.638598] Lustre: MGS: Received new LWP connection from 192.168.1.108@o2ib, removing former export from same NID
[  230.638625] Lustre: MGS: Connection restored to 8c9d0ffb-0dac-9c76-1262-80c392d2d0ce (at 192.168.1.108@o2ib)
[  230.638628] Lustre: Skipped 2 previous similar messages
[  230.796182] LustreError: 2039:0:(obd_mount_server.c:1370:server_start_targets()) failed to start server soaked-MDT0000: -5
[  230.873594] LustreError: 2039:0:(obd_mount_server.c:1863:server_fill_super()) Unable to start targets: -5
[  230.905264] LustreError: 2039:0:(obd_mount_server.c:1573:server_put_super()) no obd soaked-MDT0000
Comment by Andreas Dilger [ 04/Oct/17 ]

What patches were landed between these two builds?

Comment by James Nunez (Inactive) [ 05/Oct/17 ]

Between builds 3637 and 3638, the following patches landed:
LU-9907 build: add patchless server for lbuild — oleg.drokin / gitweb
LU-7988 hsm: update many cookie status at once — oleg.drokin / gitweb
LU-7001 osp: fix llog processing — oleg.drokin / gitweb
LU-9810 lnet: fix build with M-OFED 4.1 — oleg.drokin / gitweb
LU-9810 lnet: prefer Fast Reg — oleg.drokin / gitweb
LU-9260 test: Use the correct mount device when test against lustre — oleg.drokin / gitweb
LU-9941 lov: lsm_is_composite isn't right — oleg.drokin / gitweb
LU-8342 utils: Set dnodesize/recordsize at zfs dataset create — oleg.drokin / gitweb
LU-9044 test: remove conf-sanity tests from ALWAYS_EXCEPT — oleg.drokin / gitweb
LU-9870 build: handle SNMP missing on build box — oleg.drokin / gitweb
LU-9891 tests: Increase space not released for ZFS — oleg.drokin / gitweb
LU-7746 tests: skip tests for older (upstream) client — oleg.drokin / gitweb
LU-9921 lnet: resolve unsafe list access — oleg.drokin / gitweb
LU-9950 build: add support for Ubuntu(debian) arm64 — oleg.drokin / gitweb

Comment by Cliff White (Inactive) [ 05/Oct/17 ]

Having some problems also with lustre-review builds, so may also be soak, will re-load 3637 tomorrow to confirm.

Comment by Andreas Dilger [ 05/Oct/17 ]

Based on the above patch descriptions, my first guess would be LU-7001.

Alternately, if this is only being seen on IB then LU-9810 seems likely.

Comment by John Hammond [ 05/Oct/17 ]

It's likely that LU-9810 is broken for OPA. Since I seem to recall that OPA supports FMR but not FastReg. Amir?

Comment by Amir Shehata (Inactive) [ 05/Oct/17 ]

That's true. OPA supports FMR in software but not FastReg.

Was this run on OPA?

Comment by Amir Shehata (Inactive) [ 05/Oct/17 ]

Do we see this on startup?

LCONSOLE_INFO("Using FastReg for registration\n");
Comment by Cliff White (Inactive) [ 05/Oct/17 ]

We sure do.

/scratch/logs/syslog/soak-8.log:Oct  4 23:19:28 soak-8 kernel: LNet: Using FastReg for registration
/scratch/logs/syslog/soak-8.log:Oct  5 00:06:13 soak-8 kernel: LNet: Using FastReg for registration
/scratch/logs/syslog/soak-8.log:Oct  5 00:30:19 soak-8 kernel: LNet: Using FastReg for registration
/scratch/logs/syslog/soak-8.log:Oct  5 18:12:54 soak-8 kernel: LNet: Using FastReg for registration
/scratch/logs/syslog/soak-9.log:Oct  3 18:03:02 soak-9 kernel: LNet: Using FastReg for registration
/scratch/logs/syslog/soak-9.log:Oct  4 15:55:02 soak-9 kernel: LNet: Using FastReg for registration
/scratch/logs/syslog/soak-9.log:Oct  4 18:36:43 soak-9 kernel: LNet: Using FastReg for registration
/scratch/logs/syslog/soak-9.log:Oct  4 23:19:30 soak-9 kernel: LNet: Using FastReg for registration
/scratch/logs/syslog/soak-9.log:Oct  5 00:30:19 soak-9 kernel: LNet: Using FastReg for registration
/scratch/logs/syslog/soak-9.log:Oct  5 18:12:55 soak-9 kernel: LNet: Using FastReg for registration
Comment by Cliff White (Inactive) [ 05/Oct/17 ]

Just to be double-certain,this morning I re-loaded 3637 (last good build) did a complete re-format to obtain a clean FS, verified that everything mounted. Then, I installed build 3638, power-cycled all systems. Verified that LNET was up on all systems, and that all systems could lctl ping the MGS/MDS (soak-8) - attempted to mount the filesystem. Initial MGS/MDS mount fails.

Oct  5 18:16:28 soak-8 kernel: ZFS: Loaded module v0.7.1-1, ZFS pool version 5000, ZFS filesystem version 5
Oct  5 18:16:29 soak-8 kernel: LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc
Oct  5 18:16:30 soak-8 kernel: Lustre: MGS: Connection restored to 0e7fadb0-8a70-1717-c708-3785eb9ea4ec (at 192.168.1.108@o2ib)
Oct  5 18:16:31 soak-8 kernel: Lustre: 2365:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1507227390/real 1507227390]  req@ffff8808142d0000 x1580442241204256/t0(0) o253->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 4768/4768 e 0 to 1 dl 1507227397 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1
Oct  5 18:16:31 soak-8 kernel: LustreError: 2216:0:(events.c:304:request_in_callback()) event type 2, status -103, service mgs
Oct  5 18:16:31 soak-8 kernel: LustreError: 2417:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check
Oct  5 18:16:31 soak-8 kernel: LustreError: 2417:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-192.168.1.108@o2ib x1580442241204256
Oct  5 18:16:31 soak-8 kernel: LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail
Oct  5 18:16:31 soak-8 kernel: Lustre: MGS: Received new LWP connection from 192.168.1.108@o2ib, removing former export from same NID
Oct  5 18:16:31 soak-8 kernel: Lustre: MGS: Connection restored to 0e7fadb0-8a70-1717-c708-3785eb9ea4ec (at 192.168.1.108@o2ib)
Oct  5 18:16:38 soak-8 kernel: Lustre: 2365:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1507227391/real 1507227391]  req@ffff88080ef08000 x1580442241204384/t0(0) o503->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 272/8416 e 0 to 1 dl 1507227398 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Oct  5 18:16:38 soak-8 kernel: LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail
Oct  5 18:16:38 soak-8 kernel: LustreError: 15c-8: MGC192.168.1.108@o2ib: The configuration from log 'soaked-MDT0000' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
Oct  5 18:16:39 soak-8 kernel: Lustre: MGS: Received new LWP connection from 192.168.1.108@o2ib, removing former export from same NID
Oct  5 18:16:39 soak-8 kernel: Lustre: MGS: Connection restored to 0e7fadb0-8a70-1717-c708-3785eb9ea4ec (at 192.168.1.108@o2ib)
Oct  5 18:16:39 soak-8 kernel: Lustre: Skipped 1 previous similar message
Oct  5 18:16:39 soak-8 kernel: LustreError: 2365:0:(obd_mount_server.c:1370:server_start_targets()) failed to start server soaked-MDT0000: -5
Oct  5 18:16:39 soak-8 kernel: LustreError: 2365:0:(obd_mount_server.c:1863:server_fill_super()) Unable to start targets: -5
Oct  5 18:16:39 soak-8 kernel: LustreError: 2365:0:(obd_mount_server.c:1573:server_put_super()) no obd soaked-MDT0000
Oct  5 18:16:46 soak-8 kernel: Lustre: 2365:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1507227399/real 1507227399]  req@ffff88080ef08000 x1580442241204416/t0(0) o251->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 224/224 e 0 to 1 dl 1507227405 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Oct  5 18:16:46 soak-8 kernel: Lustre: server umount soaked-MDT0000 complete
Oct  5 18:16:46 soak-8 kernel: LustreError: 2365:0:(obd_mount.c:1504:lustre_fill_super()) Unable to mount  (-5)
Oct  5 18:16:46 soak-8 sshd[2342]: Received disconnect from 192.168.1.116 port 35786:11: disconnected by user
Oct  5 18:16:46 soak-8 sshd[2342]: Disconnected from 192.168.1.116 port 35786
Comment by Gerrit Updater [ 05/Oct/17 ]

James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/29341
Subject: LU-10068 lnet: Revert prefer Fast Reg
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 004295627f99c906f360acc0624285b4631782a1

Comment by Amir Shehata (Inactive) [ 05/Oct/17 ]

so on MLX5 I'm able to mount, but I'm getting the following error:

[ 1157.516112] LNetError: 34447:0:(o2iblnd.c:1940:kiblnd_fmr_pool_map()) Failed to map mr 10/11 elements
[ 1157.516122] LNetError: 34528:0:(o2iblnd_cb.c:560:kiblnd_fmr_map_tx()) Can't map 41033 pages: -22

That's again due to the fastreg code.

Comment by John Hammond [ 06/Oct/17 ]

Fails on OPA as well.

Comment by James A Simmons [ 06/Oct/17 ]

I tried the patch for LU-9983 to see it would address this but it didn't

Comment by Cliff White (Inactive) [ 06/Oct/17 ]

Trying https://review.whamcloud.com/29341 on soak - servers mount okay, some clients are having timeouts

soak-20 login: [ 3363.772469] INFO: task mount.lustre:2191 blocked for more than 120 seconds.^M
[ 3363.782165] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.^M
[ 3363.792785] mount.lustre    D 0000000000000001     0  2191   2190 0x00000080^M
[ 3363.802418]  ffff88103fad3940 0000000000000082 ffff881013fedee0 ffff88103fad3fd8^M
[ 3363.812340]  ffff88103fad3fd8 ffff88103fad3fd8 ffff881013fedee0 ffff881013b87270^M
[ 3363.822136]  7fffffffffffffff ffff881013b87268 ffff881013fedee0 0000000000000001^M
[ 3363.831817] Call Trace:^M
[ 3363.835841]  [<ffffffff816a94e9>] schedule+0x29/0x70^M
[ 3363.842720]  [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0^M
[ 3363.850481]  [<ffffffff810c76f5>] ? sched_clock_cpu+0x85/0xc0^M
[ 3363.858146]  [<ffffffff810c12c8>] ? check_preempt_curr+0x78/0xa0^M
[ 3363.866025]  [<ffffffff810c1309>] ? ttwu_do_wakeup+0x19/0xd0^M
[ 3363.873501]  [<ffffffff816a989d>] wait_for_completion+0xfd/0x140^M
[ 3363.881287]  [<ffffffff810c4810>] ? wake_up_state+0x20/0x20^M
[ 3363.888596]  [<ffffffffc09de6b4>] llog_process_or_fork+0x244/0x450 [obdclass]^M
[ 3363.897603]  [<ffffffffc09de8d4>] llog_process+0x14/0x20 [obdclass]^M
[ 3363.905633]  [<ffffffffc0a0ed65>] class_config_parse_llog+0x125/0x350 [obdclass]^M
[ 3363.914874]  [<ffffffffc091c368>] mgc_process_cfg_log+0x788/0xc40 [mgc]^M
[ 3363.923229]  [<ffffffffc091f1e3>] mgc_process_log+0x3d3/0x890 [mgc]^M
[ 3363.931153]  [<ffffffffc0a17480>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]^M
[ 3363.940736]  [<ffffffffc091f8e8>] ? do_config_log_add+0x248/0x580 [mgc]^M
[ 3363.949020]  [<ffffffffc09207c0>] mgc_process_config+0x890/0x13f0 [mgc]^M
[ 3363.957326]  [<ffffffffc0a1ad7e>] lustre_process_log+0x2de/0xaf0 [obdclass]^M
[ 3363.965969]  [<ffffffff816aba0e>] ? _raw_spin_unlock_bh+0x1e/0x20^M
[ 3363.973647]  [<ffffffff8132318b>] ? fprop_local_init_percpu+0x1b/0x30^M
[ 3363.981681]  [<ffffffffc0e20608>] ll_fill_super+0xaf8/0x1220 [lustre]^M
[ 3363.989721]  [<ffffffffc0a20ce6>] lustre_fill_super+0x286/0x910 [obdclass]^M
[ 3363.998214]  [<ffffffffc0a20a60>] ? lustre_common_put_super+0x270/0x270 [obdclass]^M
[ 3364.007482]  [<ffffffff81204b0d>] mount_nodev+0x4d/0xb0^M
[ 3364.014099]  [<ffffffffc0a18ba8>] lustre_mount+0x38/0x60 [obdclass]^M
[ 3364.021874]  [<ffffffff81205599>] mount_fs+0x39/0x1b0^M
[ 3364.028291]  [<ffffffff81222067>] vfs_kern_mount+0x67/0x110^M
[ 3364.035288]  [<ffffffff81224573>] do_mount+0x233/0xaf0^M
[ 3364.041786]  [<ffffffff8118760e>] ? __get_free_pages+0xe/0x40^M
[ 3364.048971]  [<ffffffff812251b6>] SyS_mount+0x96/0xf0^M
[ 3364.055364]  [<ffffffff816b5009>] system_call_fastpath+0x16/0x1b^M

Re-trying clients after power-cycling does work

Comment by Amir Shehata (Inactive) [ 06/Oct/17 ]

there appears to be 3 separate issues:
1. Fastreg is not supported on OPA so reverting LU-9810 works (as a side note, I also tried John's patch for LU-9983 and that resolves the problem we were seeing there).
2. Fastreg broken on MLX-5. I've been debugging this problem yesterday, and we already have a few bugs that are all related. For this particular problem ib_map_mr_sg() is called to map 11 fragments but ends up mapping 10. I'm trying to look at the mlx5 driver code and understand why it stops before mapping all fragments. The fragments look like:

length = 4096, offset = 0, addr = 105c43c000, page offset = 0, page_addr = 105c43c000
length = 4096, offset = 0, addr = 84cfb0000, page offset = 0, page_addr = 84cfb0000
length = 4096, offset = 0, addr = 84cfb1000, page offset = 0, page_addr = 84cfb1000
length = 4096, offset = 0, addr = 84cfb2000, page offset = 0, page_addr = 84cfb2000
length = 4096, offset = 0, addr = 84cfb3000, page offset = 0, page_addr = 84cfb3000
length = 4096, offset = 0, addr = 84cfb4000, page offset = 0, page_addr = 84cfb4000
length = 4096, offset = 0, addr = 84cfb5000, page offset = 0, page_addr = 84cfb5000
length = 4096, offset = 0, addr = 84cfb6000, page offset = 0, page_addr = 84cfb6000
length = 4096, offset = 0, addr = 84cfb7000, page offset = 0, page_addr = 84cfb7000
length = 73, offset = 0, addr = 84cfb8000, page offset = 0, page_addr = 84cfb8000
length = 4096, offset = 0, addr = 105c43d000, page offset = 0, page_addr = 105c43d000

So it looks like it stops on the fragment of length 73.

3. MLX-4 failure. I still need to investigate further, because it could be different than both of the above.

Comment by Cliff White (Inactive) [ 06/Oct/17 ]

Tested James patch reverting Fast Reg. https://review.whamcloud.com/29341. Soak is running okay, but router have hit LBUG (LU-10103)

Comment by Gerrit Updater [ 10/Oct/17 ]

Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/29547
Subject: LU-10068 lnet: Combined patch for testing on soak
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f4009ebfecc1cfc4066a4b83808b224eae2ab852

Comment by Andreas Dilger [ 18/Dec/17 ]

The combined patch is no longer needed (part reverted, part landed), can this ticket be closed?

Comment by Cliff White (Inactive) [ 18/Dec/17 ]

I am ok with closing it.

Comment by Andreas Dilger [ 19/Dec/17 ]

This was fixed by the LU-9983 patch.

Generated at Sat Feb 10 02:31:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.