[LU-10068] OST fails to mount:LustreError: 14558:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check Created: 03/Oct/17 Updated: 12/Jan/18 Resolved: 19/Dec/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Cliff White (Inactive) | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
Soak cluster - latest lustre-master build (3650) version=2.10.53_32_g20ffe21 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Reformatted, created complete new filesystem. [ 584.585335] Lustre: 13317:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1507069853/real 1507069853] req@ffff88083f758300 x1580277278179360/t0(0) o253->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 4768/4768 e 0 to 1 dl 1507069860 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1 [ 584.680525] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail [ 584.727353] LustreError: 15f-b: soaked-OST0000: cannot register this server with the MGS: rc = -5. Is the MGS running? [ 584.755152] Lustre: MGC192.168.1.108@o2ib: Connection restored to MGC192.168.1.108@o2ib_0 (at 192.168.1.108@o2ib) [ 584.796681] LustreError: 13317:0:(obd_mount_server.c:1863:server_fill_super()) Unable to start targets: -5 [ 584.828634] LustreError: 13317:0:(obd_mount_server.c:1573:server_put_super()) no obd soaked-OST0000 [ 584.858501] LustreError: 13317:0:(obd_mount_server.c:132:server_deregister_mount()) soaked-OST0000 not registered [ 585.118058] Lustre: server umount soaked-OST0000 complete [ 585.135868] LustreError: 13317:0:(obd_mount.c:1504:lustre_fill_super()) Unable to mount (-5) Errors on MDS [17524.112385] Lustre: soaked-MDT0000: new disk, initializing [17524.138680] Lustre: soaked-MDT0000: Imperative Recovery not enabled, recovery window 300-900 [17524.153739] Lustre: ctl-soaked-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400]:0:mdt [17528.212947] Lustre: MGS: Connection restored to aa8414d2-f089-bf2e-b8c3-406370f048cc (at 192.168.1.102@o2ib) [17528.273221] LustreError: 12811:0:(events.c:304:request_in_callback()) event type 2, status -103, service mgs [17528.287356] LustreError: 14558:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check [17528.305726] LustreError: 14558:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-192.168.1.102@o2ib x1580277278179360 [17528.418152] Lustre: MGS: Received new LWP connection from 192.168.1.102@o2ib, removing former export from same NID Not sure where to go from here. |
| Comments |
| Comment by Cliff White (Inactive) [ 04/Oct/17 ] |
|
I re-loaded b2.10 GA - filesystem formatted and mounted with not issues. Oct 4 16:01:24 soak-2 kernel: Lustre: 24621:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1507132883/real 1507132883] req@ffff8800b2960300 x1580342920085536/t0(0) o253->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 4768/4768 e 0 to 1 dl 1507132890 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1 Oct 4 16:01:24 soak-2 kernel: LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail Oct 4 16:01:24 soak-2 kernel: Lustre: MGC192.168.1.108@o2ib: Connection restored to MGC192.168.1.108@o2ib_0 (at 192.168.1.108@o2ib) Oct 4 16:01:24 soak-2 kernel: LustreError: 23440:0:(events.c:199:client_bulk_callback()) event type 2, status -103, desc ffff88082acd7e00 Oct 4 16:01:24 soak-2 kernel: Lustre: 24621:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1507132884/real 1507132884] req@ffff8803ff218000 x1580342920085632/t0(0) o256->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 304/240 e 0 to 1 dl 1507132891 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1 Oct 4 16:01:24 soak-2 kernel: LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail Oct 4 16:01:25 soak-2 kernel: LustreError: 24621:0:(mgc_request.c:251:do_config_log_add()) MGC192.168.1.108@o2ib: failed processing log, type 4: rc = -5 Oct 4 16:01:25 soak-2 kernel: Lustre: MGC192.168.1.108@o2ib: Connection restored to MGC192.168.1.108@o2ib_0 (at 192.168.1.108@o2ib) Oct 4 16:01:32 soak-2 kernel: Lustre: 24621:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1507132885/real 1507132885] req@ffff8803ff218000 x1580342920085696/t0(0) o503->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 272/8416 e 0 to 1 dl 1507132892 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Oct 4 16:01:32 soak-2 kernel: LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail Oct 4 16:01:33 soak-2 kernel: LustreError: 24621:0:(mgc_request.c:1885:mgc_llog_local_copy()) MGC192.168.1.108@o2ib: failed to copy remote log soaked-OST0000: rc = -5 Oct 4 16:01:33 soak-2 kernel: LustreError: 24937:0:(ldlm_resource.c:1100:ldlm_resource_complain()) MGC192.168.1.108@o2ib: namespace resource [0x64656b616f73:0x0:0x0].0x0 (ffff8804184e1200) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 4 16:01:33 soak-2 kernel: LustreError: 24937:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x64656b616f73:0x0:0x0].0x0 (ffff8804184e1200) refcount = 2 Oct 4 16:01:33 soak-2 kernel: LustreError: 24937:0:(ldlm_resource.c:1685:ldlm_resource_dump()) Granted locks (in reverse order): Oct 4 16:01:33 soak-2 kernel: LustreError: 24937:0:(ldlm_resource.c:1688:ldlm_resource_dump()) ### ### ns: MGC192.168.1.108@o2ib lock: ffff88040eb20400/0xc3f9d16c2df9943e lrc: 2/1,0 mode: CR/CR res: [0x64656b616f73:0x0:0x0].0x0 rrc: 3 type: PLN flags: 0x1106400000000 nid: local remote: 0xc8dccd9f1ede04d5 expref: -99 pid: 24621 timeout: 0 lvb_type: 0 Oct 4 16:01:33 soak-2 kernel: Lustre: MGC192.168.1.108@o2ib: Connection restored to MGC192.168.1.108@o2ib_0 (at 192.168.1.108@o2ib) Oct 4 16:01:34 soak-2 kernel: LustreError: 23440:0:(events.c:199:client_bulk_callback()) event type 2, status -103, desc ffff88082cc40800 Oct 4 16:01:34 soak-2 kernel: LustreError: 24665:0:(mgc_request.c:603:do_requeue()) failed processing log: -5 Oct 4 16:01:45 soak-2 kernel: LustreError: 137-5: soaked-OST0001_UUID: not available for connect from 192.168.1.108@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. |
| Comment by Cliff White (Inactive) [ 04/Oct/17 ] |
|
In addition, the OST mounts are very, very slow |
| Comment by Cliff White (Inactive) [ 04/Oct/17 ] |
|
However, now all client mount attempts fail: [ 1891.859459] Lustre: 9069:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1507133982/real 1507133982] req@ffff880813ea0300 x1580342921134176/t0(0) o503->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 272/8416 e 0 to 1 dl 1507133989 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 [ 1891.952822] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail [ 1891.999529] LustreError: 15c-8: MGC192.168.1.108@o2ib: The configuration from log 'soaked-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. [ 1892.002601] Lustre: MGC192.168.1.108@o2ib: Connection restored to MGC192.168.1.108@o2ib_0 (at 192.168.1.108@o2ib) [ 1892.110166] Lustre: Unmounted soaked-client |
| Comment by James Nunez (Inactive) [ 04/Oct/17 ] |
|
The last master build that ran successfully on soak was build #3637. |
| Comment by Cliff White (Inactive) [ 04/Oct/17 ] |
|
Reverted back to build 3649, and it's even worse! The initial MGT/MDT mount hangs, big time. Oct 4 18:40:43 soak-8 kernel: INFO: task mount.lustre:2489 blocked for more than 120 seconds. Oct 4 18:40:43 soak-8 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 4 18:40:43 soak-8 kernel: mount.lustre D 0000000000000001 0 2489 2488 0x00000082 Oct 4 18:40:44 soak-8 kernel: ffff8804102d77b0 0000000000000082 ffff88041140dee0 ffff8804102d7fd8 Oct 4 18:40:44 soak-8 kernel: ffff8804102d7fd8 ffff8804102d7fd8 ffff88041140dee0 ffff88082c4c4810 Oct 4 18:40:44 soak-8 kernel: 7fffffffffffffff ffff88082c4c4808 ffff88041140dee0 0000000000000001 Oct 4 18:40:44 soak-8 kernel: Call Trace: Oct 4 18:40:44 soak-8 kernel: [<ffffffff816a94a9>] schedule+0x29/0x70 Oct 4 18:40:44 soak-8 kernel: [<ffffffff816a6fb9>] schedule_timeout+0x239/0x2c0 Oct 4 18:40:44 soak-8 kernel: [<ffffffff81050b5c>] ? native_smp_send_reschedule+0x4c/0x70 Oct 4 18:40:44 soak-8 kernel: [<ffffffff810c0548>] ? resched_curr+0xa8/0xc0 Oct 4 18:40:44 soak-8 kernel: [<ffffffff810c12c8>] ? check_preempt_curr+0x78/0xa0 Oct 4 18:40:44 soak-8 kernel: [<ffffffff810c1309>] ? ttwu_do_wakeup+0x19/0xd0 Oct 4 18:40:44 soak-8 kernel: [<ffffffff816a985d>] wait_for_completion+0xfd/0x140 Oct 4 18:40:44 soak-8 kernel: [<ffffffff810c4810>] ? wake_up_state+0x20/0x20 Oct 4 18:40:44 soak-8 kernel: [<ffffffffc0a496b4>] llog_process_or_fork+0x244/0x450 [obdclass] Oct 4 18:40:44 soak-8 kernel: [<ffffffffc0a498d4>] llog_process+0x14/0x20 [obdclass] Oct 4 18:40:44 soak-8 kernel: [<ffffffffc0a79e95>] class_config_parse_llog+0x125/0x350 [obdclass] Oct 4 18:40:44 soak-8 kernel: [<ffffffffc14465a8>] mgc_process_cfg_log+0x788/0xc40 [mgc] Oct 4 18:40:44 soak-8 kernel: [<ffffffffc1449413>] mgc_process_log+0x3d3/0x890 [mgc] Oct 4 18:40:44 soak-8 kernel: [<ffffffffc0a826b0>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] Oct 4 18:40:44 soak-8 kernel: [<ffffffffc1449b18>] ? do_config_log_add+0x248/0x580 [mgc] Oct 4 18:40:44 soak-8 kernel: [<ffffffffc144a9f0>] mgc_process_config+0x890/0x13f0 [mgc] Oct 4 18:40:44 soak-8 kernel: [<ffffffffc0a861ea>] lustre_process_log+0x2da/0xae0 [obdclass] Oct 4 18:40:44 soak-8 kernel: [<ffffffffc0927ba7>] ? libcfs_debug_msg+0x57/0x80 [libcfs] Oct 4 18:40:44 soak-8 kernel: [<ffffffffc0a713e9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] Oct 4 18:40:44 soak-8 kernel: [<ffffffffc0ab2da2>] server_start_targets+0x1352/0x2a70 [obdclass] Oct 4 18:40:44 soak-8 kernel: [<ffffffffc0cd0b48>] ? ptlrpc_pinger_wake_up+0x28/0x30 [ptlrpc] Oct 4 18:40:44 soak-8 kernel: [<ffffffffc0a71511>] ? lprocfs_counter_sub+0xc1/0x130 [obdclass] Oct 4 18:40:44 soak-8 kernel: [<ffffffffc0a826b0>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] Oct 4 18:40:44 soak-8 kernel: [<ffffffffc0ab554d>] server_fill_super+0x108d/0x184c [obdclass] Oct 4 18:40:44 soak-8 kernel: [<ffffffffc0a8c168>] lustre_fill_super+0x328/0x950 [obdclass] Oct 4 18:40:44 soak-8 kernel: [<ffffffffc0a8be40>] ? lustre_common_put_super+0x270/0x270 [obdclass] Oct 4 18:40:44 soak-8 kernel: [<ffffffff81204afd>] mount_nodev+0x4d/0xb0 Oct 4 18:40:44 soak-8 kernel: [<ffffffffc0a83f78>] lustre_mount+0x38/0x60 [obdclass] Oct 4 18:40:44 soak-8 kernel: [<ffffffff81205589>] mount_fs+0x39/0x1b0 Oct 4 18:40:44 soak-8 kernel: [<ffffffff81222067>] vfs_kern_mount+0x67/0x110 Oct 4 18:40:44 soak-8 kernel: [<ffffffff81224573>] do_mount+0x233/0xaf0 Oct 4 18:40:44 soak-8 kernel: [<ffffffff8118760e>] ? __get_free_pages+0xe/0x40 Oct 4 18:40:44 soak-8 kernel: [<ffffffff812251b6>] SyS_mount+0x96/0xf0 Oct 4 18:40:44 soak-8 kernel: [<ffffffff816b4fc9>] system_call_fastpath+0x16/0x1b Oct 4 18:42:44 soak-8 kernel: INFO: task mount.lustre:2489 blocked for more than 120 seconds. |
| Comment by Cliff White (Inactive) [ 04/Oct/17 ] |
|
Did a debug=-1 dump of soak-8 (the MDS) while the mount was hanging. Attached. |
| Comment by Cliff White (Inactive) [ 04/Oct/17 ] |
|
Tested build 3637, worked fine. [ 230.588325] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail [ 230.635151] LustreError: 15c-8: MGC192.168.1.108@o2ib: The configuration from log 'soaked-MDT0000' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. [ 230.638598] Lustre: MGS: Received new LWP connection from 192.168.1.108@o2ib, removing former export from same NID [ 230.638625] Lustre: MGS: Connection restored to 8c9d0ffb-0dac-9c76-1262-80c392d2d0ce (at 192.168.1.108@o2ib) [ 230.638628] Lustre: Skipped 2 previous similar messages [ 230.796182] LustreError: 2039:0:(obd_mount_server.c:1370:server_start_targets()) failed to start server soaked-MDT0000: -5 [ 230.873594] LustreError: 2039:0:(obd_mount_server.c:1863:server_fill_super()) Unable to start targets: -5 [ 230.905264] LustreError: 2039:0:(obd_mount_server.c:1573:server_put_super()) no obd soaked-MDT0000 |
| Comment by Andreas Dilger [ 04/Oct/17 ] |
|
What patches were landed between these two builds? |
| Comment by James Nunez (Inactive) [ 05/Oct/17 ] |
|
Between builds 3637 and 3638, the following patches landed: |
| Comment by Cliff White (Inactive) [ 05/Oct/17 ] |
|
Having some problems also with lustre-review builds, so may also be soak, will re-load 3637 tomorrow to confirm. |
| Comment by Andreas Dilger [ 05/Oct/17 ] |
|
Based on the above patch descriptions, my first guess would be Alternately, if this is only being seen on IB then |
| Comment by John Hammond [ 05/Oct/17 ] |
|
It's likely that |
| Comment by Amir Shehata (Inactive) [ 05/Oct/17 ] |
|
That's true. OPA supports FMR in software but not FastReg. Was this run on OPA? |
| Comment by Amir Shehata (Inactive) [ 05/Oct/17 ] |
|
Do we see this on startup? LCONSOLE_INFO("Using FastReg for registration\n");
|
| Comment by Cliff White (Inactive) [ 05/Oct/17 ] |
|
We sure do. /scratch/logs/syslog/soak-8.log:Oct 4 23:19:28 soak-8 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-8.log:Oct 5 00:06:13 soak-8 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-8.log:Oct 5 00:30:19 soak-8 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-8.log:Oct 5 18:12:54 soak-8 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-9.log:Oct 3 18:03:02 soak-9 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-9.log:Oct 4 15:55:02 soak-9 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-9.log:Oct 4 18:36:43 soak-9 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-9.log:Oct 4 23:19:30 soak-9 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-9.log:Oct 5 00:30:19 soak-9 kernel: LNet: Using FastReg for registration /scratch/logs/syslog/soak-9.log:Oct 5 18:12:55 soak-9 kernel: LNet: Using FastReg for registration |
| Comment by Cliff White (Inactive) [ 05/Oct/17 ] |
|
Just to be double-certain,this morning I re-loaded 3637 (last good build) did a complete re-format to obtain a clean FS, verified that everything mounted. Then, I installed build 3638, power-cycled all systems. Verified that LNET was up on all systems, and that all systems could lctl ping the MGS/MDS (soak-8) - attempted to mount the filesystem. Initial MGS/MDS mount fails. Oct 5 18:16:28 soak-8 kernel: ZFS: Loaded module v0.7.1-1, ZFS pool version 5000, ZFS filesystem version 5 Oct 5 18:16:29 soak-8 kernel: LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc Oct 5 18:16:30 soak-8 kernel: Lustre: MGS: Connection restored to 0e7fadb0-8a70-1717-c708-3785eb9ea4ec (at 192.168.1.108@o2ib) Oct 5 18:16:31 soak-8 kernel: Lustre: 2365:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1507227390/real 1507227390] req@ffff8808142d0000 x1580442241204256/t0(0) o253->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 4768/4768 e 0 to 1 dl 1507227397 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1 Oct 5 18:16:31 soak-8 kernel: LustreError: 2216:0:(events.c:304:request_in_callback()) event type 2, status -103, service mgs Oct 5 18:16:31 soak-8 kernel: LustreError: 2417:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check Oct 5 18:16:31 soak-8 kernel: LustreError: 2417:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-192.168.1.108@o2ib x1580442241204256 Oct 5 18:16:31 soak-8 kernel: LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail Oct 5 18:16:31 soak-8 kernel: Lustre: MGS: Received new LWP connection from 192.168.1.108@o2ib, removing former export from same NID Oct 5 18:16:31 soak-8 kernel: Lustre: MGS: Connection restored to 0e7fadb0-8a70-1717-c708-3785eb9ea4ec (at 192.168.1.108@o2ib) Oct 5 18:16:38 soak-8 kernel: Lustre: 2365:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1507227391/real 1507227391] req@ffff88080ef08000 x1580442241204384/t0(0) o503->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 272/8416 e 0 to 1 dl 1507227398 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Oct 5 18:16:38 soak-8 kernel: LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail Oct 5 18:16:38 soak-8 kernel: LustreError: 15c-8: MGC192.168.1.108@o2ib: The configuration from log 'soaked-MDT0000' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. Oct 5 18:16:39 soak-8 kernel: Lustre: MGS: Received new LWP connection from 192.168.1.108@o2ib, removing former export from same NID Oct 5 18:16:39 soak-8 kernel: Lustre: MGS: Connection restored to 0e7fadb0-8a70-1717-c708-3785eb9ea4ec (at 192.168.1.108@o2ib) Oct 5 18:16:39 soak-8 kernel: Lustre: Skipped 1 previous similar message Oct 5 18:16:39 soak-8 kernel: LustreError: 2365:0:(obd_mount_server.c:1370:server_start_targets()) failed to start server soaked-MDT0000: -5 Oct 5 18:16:39 soak-8 kernel: LustreError: 2365:0:(obd_mount_server.c:1863:server_fill_super()) Unable to start targets: -5 Oct 5 18:16:39 soak-8 kernel: LustreError: 2365:0:(obd_mount_server.c:1573:server_put_super()) no obd soaked-MDT0000 Oct 5 18:16:46 soak-8 kernel: Lustre: 2365:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1507227399/real 1507227399] req@ffff88080ef08000 x1580442241204416/t0(0) o251->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 224/224 e 0 to 1 dl 1507227405 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 5 18:16:46 soak-8 kernel: Lustre: server umount soaked-MDT0000 complete Oct 5 18:16:46 soak-8 kernel: LustreError: 2365:0:(obd_mount.c:1504:lustre_fill_super()) Unable to mount (-5) Oct 5 18:16:46 soak-8 sshd[2342]: Received disconnect from 192.168.1.116 port 35786:11: disconnected by user Oct 5 18:16:46 soak-8 sshd[2342]: Disconnected from 192.168.1.116 port 35786 |
| Comment by Gerrit Updater [ 05/Oct/17 ] |
|
James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/29341 |
| Comment by Amir Shehata (Inactive) [ 05/Oct/17 ] |
|
so on MLX5 I'm able to mount, but I'm getting the following error: [ 1157.516112] LNetError: 34447:0:(o2iblnd.c:1940:kiblnd_fmr_pool_map()) Failed to map mr 10/11 elements [ 1157.516122] LNetError: 34528:0:(o2iblnd_cb.c:560:kiblnd_fmr_map_tx()) Can't map 41033 pages: -22 That's again due to the fastreg code. |
| Comment by John Hammond [ 06/Oct/17 ] |
|
Fails on OPA as well. |
| Comment by James A Simmons [ 06/Oct/17 ] |
|
I tried the patch for |
| Comment by Cliff White (Inactive) [ 06/Oct/17 ] |
|
Trying https://review.whamcloud.com/29341 on soak - servers mount okay, some clients are having timeouts soak-20 login: [ 3363.772469] INFO: task mount.lustre:2191 blocked for more than 120 seconds.^M [ 3363.782165] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.^M [ 3363.792785] mount.lustre D 0000000000000001 0 2191 2190 0x00000080^M [ 3363.802418] ffff88103fad3940 0000000000000082 ffff881013fedee0 ffff88103fad3fd8^M [ 3363.812340] ffff88103fad3fd8 ffff88103fad3fd8 ffff881013fedee0 ffff881013b87270^M [ 3363.822136] 7fffffffffffffff ffff881013b87268 ffff881013fedee0 0000000000000001^M [ 3363.831817] Call Trace:^M [ 3363.835841] [<ffffffff816a94e9>] schedule+0x29/0x70^M [ 3363.842720] [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0^M [ 3363.850481] [<ffffffff810c76f5>] ? sched_clock_cpu+0x85/0xc0^M [ 3363.858146] [<ffffffff810c12c8>] ? check_preempt_curr+0x78/0xa0^M [ 3363.866025] [<ffffffff810c1309>] ? ttwu_do_wakeup+0x19/0xd0^M [ 3363.873501] [<ffffffff816a989d>] wait_for_completion+0xfd/0x140^M [ 3363.881287] [<ffffffff810c4810>] ? wake_up_state+0x20/0x20^M [ 3363.888596] [<ffffffffc09de6b4>] llog_process_or_fork+0x244/0x450 [obdclass]^M [ 3363.897603] [<ffffffffc09de8d4>] llog_process+0x14/0x20 [obdclass]^M [ 3363.905633] [<ffffffffc0a0ed65>] class_config_parse_llog+0x125/0x350 [obdclass]^M [ 3363.914874] [<ffffffffc091c368>] mgc_process_cfg_log+0x788/0xc40 [mgc]^M [ 3363.923229] [<ffffffffc091f1e3>] mgc_process_log+0x3d3/0x890 [mgc]^M [ 3363.931153] [<ffffffffc0a17480>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]^M [ 3363.940736] [<ffffffffc091f8e8>] ? do_config_log_add+0x248/0x580 [mgc]^M [ 3363.949020] [<ffffffffc09207c0>] mgc_process_config+0x890/0x13f0 [mgc]^M [ 3363.957326] [<ffffffffc0a1ad7e>] lustre_process_log+0x2de/0xaf0 [obdclass]^M [ 3363.965969] [<ffffffff816aba0e>] ? _raw_spin_unlock_bh+0x1e/0x20^M [ 3363.973647] [<ffffffff8132318b>] ? fprop_local_init_percpu+0x1b/0x30^M [ 3363.981681] [<ffffffffc0e20608>] ll_fill_super+0xaf8/0x1220 [lustre]^M [ 3363.989721] [<ffffffffc0a20ce6>] lustre_fill_super+0x286/0x910 [obdclass]^M [ 3363.998214] [<ffffffffc0a20a60>] ? lustre_common_put_super+0x270/0x270 [obdclass]^M [ 3364.007482] [<ffffffff81204b0d>] mount_nodev+0x4d/0xb0^M [ 3364.014099] [<ffffffffc0a18ba8>] lustre_mount+0x38/0x60 [obdclass]^M [ 3364.021874] [<ffffffff81205599>] mount_fs+0x39/0x1b0^M [ 3364.028291] [<ffffffff81222067>] vfs_kern_mount+0x67/0x110^M [ 3364.035288] [<ffffffff81224573>] do_mount+0x233/0xaf0^M [ 3364.041786] [<ffffffff8118760e>] ? __get_free_pages+0xe/0x40^M [ 3364.048971] [<ffffffff812251b6>] SyS_mount+0x96/0xf0^M [ 3364.055364] [<ffffffff816b5009>] system_call_fastpath+0x16/0x1b^M Re-trying clients after power-cycling does work |
| Comment by Amir Shehata (Inactive) [ 06/Oct/17 ] |
|
there appears to be 3 separate issues: length = 4096, offset = 0, addr = 105c43c000, page offset = 0, page_addr = 105c43c000 length = 4096, offset = 0, addr = 84cfb0000, page offset = 0, page_addr = 84cfb0000 length = 4096, offset = 0, addr = 84cfb1000, page offset = 0, page_addr = 84cfb1000 length = 4096, offset = 0, addr = 84cfb2000, page offset = 0, page_addr = 84cfb2000 length = 4096, offset = 0, addr = 84cfb3000, page offset = 0, page_addr = 84cfb3000 length = 4096, offset = 0, addr = 84cfb4000, page offset = 0, page_addr = 84cfb4000 length = 4096, offset = 0, addr = 84cfb5000, page offset = 0, page_addr = 84cfb5000 length = 4096, offset = 0, addr = 84cfb6000, page offset = 0, page_addr = 84cfb6000 length = 4096, offset = 0, addr = 84cfb7000, page offset = 0, page_addr = 84cfb7000 length = 73, offset = 0, addr = 84cfb8000, page offset = 0, page_addr = 84cfb8000 length = 4096, offset = 0, addr = 105c43d000, page offset = 0, page_addr = 105c43d000 So it looks like it stops on the fragment of length 73. 3. MLX-4 failure. I still need to investigate further, because it could be different than both of the above. |
| Comment by Cliff White (Inactive) [ 06/Oct/17 ] |
|
Tested James patch reverting Fast Reg. https://review.whamcloud.com/29341. Soak is running okay, but router have hit LBUG (LU-10103) |
| Comment by Gerrit Updater [ 10/Oct/17 ] |
|
Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/29547 |
| Comment by Andreas Dilger [ 18/Dec/17 ] |
|
The combined patch is no longer needed (part reverted, part landed), can this ticket be closed? |
| Comment by Cliff White (Inactive) [ 18/Dec/17 ] |
|
I am ok with closing it. |
| Comment by Andreas Dilger [ 19/Dec/17 ] |
|
This was fixed by the |