[LU-9049] DNE MDT Never completes recovery Created: 25/Jan/17 Updated: 08/May/18 Resolved: 22/Jun/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0 |
| Fix Version/s: | Lustre 2.10.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Cliff White (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | soak | ||
| Environment: |
Soak test cluster, version=lustre: 2.9.51_45_g3b3eeeb - tip of master |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
Failover of lola-11 to lola-10 2017-01-24 16:44:56,220:fsmgmt.fsmgmt:INFO Mounting soaked-MDT0003 on lola-10 ... Immediately prior to the failover mount, lola-10 reports hung processes. Jan 24 16:41:03 lola-10 kernel: Lustre: 4554:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1485304863/real 1485304863] req@ffff8804131dd680 x1557446666855008/t0(0) o38->soaked-MDT0003-osp-MDT0002@192.168.1.111@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1485304880 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 Jan 24 16:41:03 lola-10 kernel: Lustre: 4554:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 1 previous similar message Jan 24 16:41:35 lola-10 kernel: LNet: Service thread pid 6159 was inactive for 226.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jan 24 16:41:35 lola-10 kernel: Pid: 6159, comm: mdt00_005 Jan 24 16:41:35 lola-10 kernel: Jan 24 16:41:35 lola-10 kernel: Call Trace: Jan 24 16:41:35 lola-10 kernel: [<ffffffff8105e4e3>] ? __wake_up+0x53/0x70 Jan 24 16:41:35 lola-10 kernel: [<ffffffffa0a33bb2>] top_trans_stop+0x782/0xbd0 [ptlrpc] Jan 24 16:41:35 lola-10 kernel: [<ffffffff81067650>] ? default_wake_function+0x0/0x20 Jan 24 16:41:35 lola-10 kernel: [<ffffffffa11b4aac>] lod_trans_stop+0x2bc/0x330 [lod] Jan 24 16:41:35 lola-10 kernel: [<ffffffffa1259f31>] mdd_trans_stop+0x21/0xc6 [mdd] Jan 24 16:41:35 lola-10 kernel: [<ffffffffa1246e63>] mdd_create+0x1373/0x17e0 [mdd] Jan 24 16:41:35 lola-10 kernel: [<ffffffffa1105c44>] ? mdt_version_save+0x84/0x1a0 [mdt] Jan 24 16:41:35 lola-10 kernel: [<ffffffffa110fcfc>] mdt_reint_create+0xbdc/0xfe0 [mdt] Jan 24 16:41:35 lola-10 kernel: [<ffffffffa11013bc>] ? mdt_root_squash+0x2c/0x3f0 [mdt] Jan 24 16:41:35 lola-10 kernel: [<ffffffffa09bb2bb>] ? lustre_pack_reply_v2+0x1eb/0x280 [ptlrpc] Jan 24 16:41:35 lola-10 kernel: [<ffffffff81299b7a>] ? strlcpy+0x4a/0x60 Jan 24 16:41:35 lola-10 kernel: [<ffffffffa1102a2a>] ? old_init_ucred_common+0xda/0x2b0 [mdt] Jan 24 16:41:35 lola-10 kernel: [<ffffffffa1104f5d>] mdt_reint_rec+0x5d/0x200 [mdt] Jan 24 16:41:35 lola-10 kernel: [<ffffffffa10efd9b>] mdt_reint_internal+0x62b/0xa50 [mdt] Jan 24 16:41:35 lola-10 kernel: [<ffffffffa10f066b>] mdt_reint+0x6b/0x120 [mdt] Jan 24 16:41:35 lola-10 kernel: [<ffffffffa0a1f17c>] tgt_request_handle+0x8ec/0x1440 [ptlrpc] Jan 24 16:41:35 lola-10 kernel: [<ffffffffa09ca66b>] ptlrpc_server_handle_request+0x2eb/0xbd0 [ptlrpc] Jan 24 16:41:35 lola-10 kernel: [<ffffffffa067c84a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs] Jan 24 16:41:35 lola-10 kernel: [<ffffffffa09c56e9>] ? ptlrpc_wait_event+0xa9/0x2e0 [ptlrpc] Jan 24 16:41:35 lola-10 kernel: [<ffffffff81059ca9>] ? __wake_up_common+0x59/0x90 Jan 24 16:41:35 lola-10 kernel: [<ffffffffa09cba11>] ptlrpc_main+0xac1/0x18d0 [ptlrpc] Jan 24 16:41:35 lola-10 kernel: [<ffffffffa09caf50>] ? ptlrpc_main+0x0/0x18d0 [ptlrpc] Jan 24 16:41:35 lola-10 kernel: [<ffffffff810a138e>] kthread+0x9e/0xc0 Jan 24 16:41:35 lola-10 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20 Jan 24 16:41:35 lola-10 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0 Jan 24 16:41:35 lola-10 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 Jan 24 16:41:35 lola-10 kernel: Jan 24 16:41:35 lola-10 kernel: LustreError: dumping log to /tmp/lustre-log.1485304895.6159 Mount occurs Jan 24 16:46:22 lola-10 kernel: Lustre: soaked-MDT0003: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
...
Jan 24 16:46:27 lola-10 kernel: Lustre: soaked-MDT0003: Will be in recovery for at least 2:30, or until 22 clients reconnect
...
Jan 24 16:47:44 lola-10 kernel: Lustre: 6025:0:(service.c:1331:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
Jan 24 16:47:44 lola-10 kernel: req@ffff88041171b980 x1557448608035648/t21476913294(0) o36->57c15fb3-4c15-952f-3686-35f4c9caa941@192.168.1.126@o2ib100:-1/-1 lens 632/568 e 19 to 0 dl 1485305269 ref 2 fl Interpret:/0/0 rc 0/0
Jan 24 16:47:47 lola-10 kernel: Lustre: 6025:0:(service.c:1331:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
Jan 24 16:47:47 lola-10 kernel: req@ffff880737e13cc0 x1557448490734944/t21476914413(0) o36->1e7aaa7d-f417-ebd3-eff8-7787d5dac449@192.168.1.123@o2ib100:-1/-1 lens 616/3128 e 19 to 0 dl 1485305272 ref 2 fl Interpret:/0/0 rc 0/0
Jan 24 16:47:47 lola-10 kernel: Lustre: 6025:0:(service.c:1331:ptlrpc_at_send_early_reply()) Skipped 22 previous similar messages
Jan 24 16:47:50 lola-10 kernel: Lustre: soaked-MDT0002: Client ffcd9b78-a534-1d1b-1494-f25c5b00edf1 (at 192.168.1.131@o2ib100) reconnecting
Recovery appears to stall Jan 24 16:50:18 lola-10 kernel: Lustre: 7291:0:(ldlm_lib.c:1787:extend_recovery_timer()) soaked-MDT0003: extended recovery timer reaching hard limit: 900, extend: 1 Jan 24 16:50:19 lola-10 kernel: Lustre: 7291:0:(ldlm_lib.c:1787:extend_recovery_timer()) soaked-MDT0003: extended recovery timer reaching hard limit: 900, extend: 1 Jan 24 16:50:19 lola-10 kernel: Lustre: 7291:0:(ldlm_lib.c:1787:extend_recovery_timer()) Skipped 2 previous similar messages Jan 24 16:50:21 lola-10 kernel: Lustre: 7291:0:(ldlm_lib.c:1787:extend_recovery_timer()) soaked-MDT0003: extended recovery timer reaching hard limit: 900, extend: 1 System is still spinning at the present time. |
| Comments |
| Comment by Cliff White (Inactive) [ 25/Jan/17 ] |
|
Stack dump from lola-10 and first lustre-log dumped. The system is available if anyone wishes to have a look. |
| Comment by Cliff White (Inactive) [ 25/Jan/17 ] |
|
At the same time, lola-8 (MGS/MDT0) is having hangs: an 24 16:40:04 lola-8 kernel: Lustre: 5828:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1485304804/real 1485304804] req@ffff8804e8eb1cc0 x1557447056153856/t0(0) o38->soaked-MDT0003-osp-MDT0000@192.168.1.111@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1485304815 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 Jan 24 16:40:04 lola-8 kernel: Lustre: 5828:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 1 previous similar message Jan 24 16:41:09 lola-8 kernel: LNet: Service thread pid 6336 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jan 24 16:41:09 lola-8 kernel: Pid: 6336, comm: mdt00_014 Jan 24 16:41:09 lola-8 kernel: Jan 24 16:41:09 lola-8 kernel: Call Trace: Jan 24 16:41:09 lola-8 kernel: [<ffffffffa0baebb2>] top_trans_stop+0x782/0xbd0 [ptlrpc] Jan 24 16:41:09 lola-8 kernel: [<ffffffff81067650>] ? default_wake_function+0x0/0x20 Jan 24 16:41:09 lola-8 kernel: [<ffffffffa134caac>] lod_trans_stop+0x2bc/0x330 [lod] Jan 24 16:41:09 lola-8 kernel: [<ffffffffa136dc3c>] ? lod_attr_get+0x9c/0x1b0 [lod] Jan 24 16:41:09 lola-8 kernel: [<ffffffffa13f1f31>] mdd_trans_stop+0x21/0xc6 [mdd] Jan 24 16:41:09 lola-8 kernel: [<ffffffffa13d67f2>] mdd_unlink+0x532/0x1020 [mdd] Jan 24 16:41:09 lola-8 kernel: [<ffffffffa129cec8>] mdo_unlink+0x18/0x50 [mdt] Jan 24 16:41:09 lola-8 kernel: [<ffffffffa12a5fcf>] mdt_reint_unlink+0xcaf/0x10c0 [mdt] Jan 24 16:41:09 lola-8 kernel: [<ffffffffa129cf5d>] mdt_reint_rec+0x5d/0x200 [mdt] Jan 24 16:41:09 lola-8 kernel: [<ffffffffa1287d9b>] mdt_reint_internal+0x62b/0xa50 [mdt] Jan 24 16:41:09 lola-8 kernel: [<ffffffffa128866b>] mdt_reint+0x6b/0x120 [mdt] Jan 24 16:41:09 lola-8 kernel: [<ffffffffa0b9a17c>] tgt_request_handle+0x8ec/0x1440 [ptlrpc] Jan 24 16:41:09 lola-8 kernel: [<ffffffffa0b4566b>] ptlrpc_server_handle_request+0x2eb/0xbd0 [ptlrpc] Jan 24 16:41:09 lola-8 kernel: [<ffffffffa082a84a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs] Jan 24 16:41:09 lola-8 kernel: [<ffffffffa0b406e9>] ? ptlrpc_wait_event+0xa9/0x2e0 [ptlrpc] Jan 24 16:41:09 lola-8 kernel: [<ffffffff81059ca9>] ? __wake_up_common+0x59/0x90 Jan 24 16:41:09 lola-8 kernel: [<ffffffffa0b46a11>] ptlrpc_main+0xac1/0x18d0 [ptlrpc] Jan 24 16:41:09 lola-8 kernel: [<ffffffffa0b45f50>] ? ptlrpc_main+0x0/0x18d0 [ptlrpc] Jan 24 16:41:09 lola-8 kernel: [<ffffffff810a138e>] kthread+0x9e/0xc0 Jan 24 16:41:09 lola-8 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20 Jan 24 16:41:09 lola-8 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0 Jan 24 16:41:09 lola-8 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 Jan 24 16:41:09 lola-8 kernel: Mount occurs Jan 24 16:41:44 lola-8 kernel: Lustre: 5828:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1485304904/real 1485304904] req@ffff8804e8e9a3c0 x1557447056161856/t0(0) o38->soaked-MDT0003-osp-MDT0000@192.168.1.111@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1485304925 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 Jan 24 16:41:44 lola-8 kernel: Lustre: 5828:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Jan 24 16:42:49 lola-8 kernel: Lustre: soaked-MDT0000: Received new LWP connection from 192.168.1.110@o2ib10, removing former export from same NID Jan 24 16:42:49 lola-8 kernel: Lustre: soaked-MDT0000: Connection restored to soaked-MDT0000-lwp-MDT0002_UUID (at 192.168.1.110@o2ib10) Jan 24 16:44:39 lola-8 kernel: Lustre: 5828:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1485305079/real 1485305079] req@ffff8804eb39d080 x1557447056175808/t0(0) o38->soaked-MDT0003-osp-MDT0000@192.168.1.111@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1485305110 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 Jan 24 16:44:39 lola-8 kernel: Lustre: 5828:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 1 previous similar message Jan 24 16:45:01 lola-8 CROND[7107]: (root) CMD (logger -p local0.notice -t TIME "Time stamp for console") Jan 24 16:45:01 lola-8 TIME: Time stamp for console Jan 24 16:46:22 lola-8 kernel: Lustre: soaked-MDT0000: Received LWP connection from 192.168.1.110@o2ib10, removing former export from 192.168.1.111@o2ib10 Jan 24 16:46:22 lola-8 kernel: Lustre: soaked-MDT0000: Connection restored to soaked-MDT0003-mdtlov_UUID (at 192.168.1.110@o2ib10) Jan 24 16:47:44 lola-8 kernel: Lustre: 6298:0:(service.c:1331:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply Jan 24 16:47:44 lola-8 kernel: req@ffff8805945686c0 x1557448469206752/t21499374434(0) o36->905c8d4a-b076-6b11-d52f-9bf74da1631a@192.168.1.120@o2ib100:-1/-1 lens 624/3128 e 24 to 0 dl 1485305269 ref 2 fl Interpret:/0/0 rc 0/0 Jan 24 16:47:50 lola-8 kernel: Lustre: soaked-MDT0000: Client 905c8d4a-b076-6b11-d52f-9bf74da1631a (at 192.168.1.120@o2ib100) reconnecting Jan 24 16:47:50 lola-8 kernel: Lustre: soaked-MDT0000: Connection restored to 905c8d4a-b076-6b11-d52f-9bf74da1631a (at 192.168.1.120@o2ib100) System continues to have blocked processes Jan 24 16:53:38 lola-8 kernel: INFO: task mdt_out00_002:6070 blocked for more than 120 seconds. Jan 24 16:53:38 lola-8 kernel: Tainted: P -- ------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1 Jan 24 16:53:38 lola-8 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 24 16:53:38 lola-8 kernel: mdt_out00_002 D 0000000000000016 0 6070 2 0x00000080 Jan 24 16:53:38 lola-8 kernel: ffff88082230bb00 0000000000000046 0000000000000000 ffff88082230bcc0 Jan 24 16:53:38 lola-8 kernel: ffff8803c91c2f00 ffff8803eb000118 000007f09680dbc0 ffff8803eb000118 Jan 24 16:53:38 lola-8 kernel: ffff88082230bb00 000000010080a9dc ffff8808276045f8 ffff88082230bfd8 Jan 24 16:53:38 lola-8 kernel: Call Trace: Jan 24 16:53:38 lola-8 kernel: [<ffffffffa0939f8d>] lu_object_find_at+0x3d/0xe0 [obdclass] Jan 24 16:53:38 lola-8 kernel: [<ffffffffa0b5c612>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc] Jan 24 16:53:38 lola-8 kernel: [<ffffffff81067650>] ? default_wake_function+0x0/0x20 Jan 24 16:53:38 lola-8 kernel: [<ffffffffa0b31e10>] ? lustre_swab_object_update_reply+0x0/0xc0 [ptlrpc] Jan 24 16:53:38 lola-8 kernel: [<ffffffffa093ad1c>] dt_locate_at+0x1c/0xa0 [obdclass] Jan 24 16:53:38 lola-8 kernel: [<ffffffffa0ba2e07>] out_handle+0x1067/0x18d0 [ptlrpc] Jan 24 16:53:38 lola-8 kernel: [<ffffffff8153afce>] ? mutex_lock+0x1e/0x50 Jan 24 16:53:38 lola-8 kernel: [<ffffffffa0b9308a>] ? req_can_reconstruct+0x6a/0x120 [ptlrpc] Jan 24 16:53:38 lola-8 kernel: [<ffffffffa0b9a17c>] tgt_request_handle+0x8ec/0x1440 [ptlrpc] Jan 24 16:53:38 lola-8 kernel: [<ffffffffa0b4566b>] ptlrpc_server_handle_request+0x2eb/0xbd0 [ptlrpc] Jan 24 16:53:38 lola-8 kernel: [<ffffffffa082a84a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs] Jan 24 16:53:38 lola-8 kernel: [<ffffffffa0b406e9>] ? ptlrpc_wait_event+0xa9/0x2e0 [ptlrpc] Jan 24 16:53:38 lola-8 kernel: [<ffffffff81059ca9>] ? __wake_up_common+0x59/0x90 Jan 24 16:53:38 lola-8 kernel: [<ffffffffa0b46a11>] ptlrpc_main+0xac1/0x18d0 [ptlrpc] Jan 24 16:53:38 lola-8 kernel: [<ffffffffa0b45f50>] ? ptlrpc_main+0x0/0x18d0 [ptlrpc] Jan 24 16:53:38 lola-8 kernel: [<ffffffff810a138e>] kthread+0x9e/0xc0 Jan 24 16:53:38 lola-8 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20 Jan 24 16:53:38 lola-8 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0 Jan 24 16:53:38 lola-8 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 |
| Comment by Di Wang [ 25/Jan/17 ] |
|
There are a few interesting process on lola-8. It maybe a deadlock, need investigate a bit. Nasf: could you please check if this lfsck thread is suspicious here? thanks. lfsck S 0000000000000010 0 6061 2 0x00000080 ffff880821d9f710 0000000000000046 0000000000000000 ffff880821d9f6d4 0000000000000000 ffff88043fe83e00 00003c78463d5afe ffff880038615a00 0000000000000200 0000000103f26d04 ffff880821d07068 ffff880821d9ffd8 Call Trace: [<ffffffff8153a9b2>] schedule_timeout+0x192/0x2e0 [<ffffffff81089fa0>] ? process_timeout+0x0/0x10 [<ffffffffa0b2ae61>] ptlrpc_set_wait+0x321/0x960 [ptlrpc] [<ffffffffa0b200d0>] ? ptlrpc_interrupted_set+0x0/0x120 [ptlrpc] [<ffffffff81067650>] ? default_wake_function+0x0/0x20 [<ffffffffa0b373c5>] ? lustre_msg_set_jobid+0xf5/0x130 [ptlrpc] [<ffffffffa0b2b521>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc] [<ffffffffa1458322>] osp_remote_sync+0xf2/0x1e0 [osp] [<ffffffffa143b677>] osp_attr_get+0x447/0x710 [osp] [<ffffffffa09167be>] ? lprocfs_counter_add+0x9e/0x140 [obdclass] [<ffffffffa143cd45>] osp_object_init+0x1f5/0x360 [osp] [<ffffffffa0938a58>] lu_object_alloc+0xd8/0x320 [obdclass] [<ffffffffa0939e41>] lu_object_find_try+0x151/0x260 [obdclass] [<ffffffffa0db4f3a>] ? fld_cache_lookup+0x3a/0x1e0 [fld] [<ffffffffa093a001>] lu_object_find_at+0xb1/0xe0 [obdclass] [<ffffffffa0db8cf3>] ? fld_server_lookup+0x53/0x330 [fld] [<ffffffffa093a06f>] lu_object_find_slice+0x1f/0x80 [obdclass] [<ffffffffa120f522>] lfsck_namespace_verify_stripe_slave+0x232/0xbf0 [lfsck] [<ffffffffa10628cb>] ? ldiskfs_xattr_trusted_get+0x2b/0x30 [ldiskfs] [<ffffffff8153c200>] ? hrtimer_nanosleep_restart+0x10/0x90 [<ffffffffa11cd375>] lfsck_namespace_open_dir+0x175/0x1a0 [lfsck] [<ffffffffa11c3466>] lfsck_open_dir+0xa6/0x3a0 [lfsck] [<ffffffffa11c6267>] lfsck_exec_oit+0x667/0xb30 [lfsck] [<ffffffffa0db4f3a>] ? fld_cache_lookup+0x3a/0x1e0 [fld] [<ffffffffa11c6f55>] lfsck_master_oit_engine+0x825/0x15b0 [lfsck] [<ffffffff81067662>] ? default_wake_function+0x12/0x20 [<ffffffffa11ca650>] lfsck_master_engine+0xb50/0x1460 [lfsck] [<ffffffff81067650>] ? default_wake_function+0x0/0x20 [<ffffffffa11c9b00>] ? lfsck_master_engine+0x0/0x1460 [lfsck] [<ffffffff810a138e>] kthread+0x9e/0xc0 [<ffffffff8100c28a>] child_rip+0xa/0x20 [<ffffffff810a12f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20 mdt_out00_002 D 0000000000000016 0 6070 2 0x00000080 ffff88082230bb00 0000000000000046 0000000000000000 ffff88082230bcc0 ffff8803c91c2f00 ffff8803eb000118 000007f09680dbc0 ffff8803eb000118 ffff88082230bb00 000000010080a9dc ffff8808276045f8 ffff88082230bfd8 Call Trace: [<ffffffffa0939f8d>] lu_object_find_at+0x3d/0xe0 [obdclass] [<ffffffffa0b5c612>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc] [<ffffffff81067650>] ? default_wake_function+0x0/0x20 [<ffffffffa0b31e10>] ? lustre_swab_object_update_reply+0x0/0xc0 [ptlrpc] [<ffffffffa093ad1c>] dt_locate_at+0x1c/0xa0 [obdclass] [<ffffffffa0ba2e07>] out_handle+0x1067/0x18d0 [ptlrpc] [<ffffffff8153afce>] ? mutex_lock+0x1e/0x50 [<ffffffffa0b9308a>] ? req_can_reconstruct+0x6a/0x120 [ptlrpc] [<ffffffffa0b9a17c>] tgt_request_handle+0x8ec/0x1440 [ptlrpc] [<ffffffffa0b4566b>] ptlrpc_server_handle_request+0x2eb/0xbd0 [ptlrpc] [<ffffffffa082a84a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs] [<ffffffffa0b406e9>] ? ptlrpc_wait_event+0xa9/0x2e0 [ptlrpc] [<ffffffff81059ca9>] ? __wake_up_common+0x59/0x90 [<ffffffffa0b46a11>] ptlrpc_main+0xac1/0x18d0 [ptlrpc] [<ffffffffa0b45f50>] ? ptlrpc_main+0x0/0x18d0 [ptlrpc] [<ffffffff810a138e>] kthread+0x9e/0xc0 [<ffffffff8100c28a>] child_rip+0xa/0x20 [<ffffffff810a12f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20 mdt00_004 S 0000000000000013 0 6077 2 0x00000080 ffff880837067870 0000000000000046 0000000000000000 ffffffff810a1c1c ffff880836aaf520 ffff8808370677f0 0000240d5abd2b10 0000000000000000 ffff880837067830 0000000102588d0f ffff880836aafad8 ffff880837067fd8 Call Trace: [<ffffffff810a1c1c>] ? remove_wait_queue+0x3c/0x50 [<ffffffffa0b04170>] ? ldlm_expired_completion_wait+0x0/0x230 [ptlrpc] [<ffffffffa0b094dd>] ldlm_completion_ast+0x67d/0x9a0 [ptlrpc] [<ffffffff81067650>] ? default_wake_function+0x0/0x20 [<ffffffffa0b0886e>] ldlm_cli_enqueue_local+0x21e/0x810 [ptlrpc] [<ffffffffa0b08e60>] ? ldlm_completion_ast+0x0/0x9a0 [ptlrpc] [<ffffffffa12807a0>] ? mdt_blocking_ast+0x0/0x2e0 [mdt] [<ffffffffa128d4e4>] mdt_object_local_lock+0x3a4/0xb00 [mdt] [<ffffffffa12807a0>] ? mdt_blocking_ast+0x0/0x2e0 [mdt] [<ffffffffa0b08e60>] ? ldlm_completion_ast+0x0/0x9a0 [ptlrpc] [<ffffffffa093a001>] ? lu_object_find_at+0xb1/0xe0 [obdclass] [<ffffffffa128e143>] mdt_object_lock_internal+0x63/0x320 [mdt] [<ffffffffa128e614>] mdt_object_lock+0x14/0x20 [mdt] [<ffffffffa1295266>] mdt_getattr_name_lock+0xdd6/0x1910 [mdt] [<ffffffffa12962c2>] mdt_intent_getattr+0x292/0x470 [mdt] [<ffffffffa128691e>] mdt_intent_policy+0x4be/0xc70 [mdt] [<ffffffffa0ae76d7>] ldlm_lock_enqueue+0x127/0x990 [ptlrpc] [<ffffffffa0b134bb>] ldlm_handle_enqueue0+0x98b/0x14e0 [ptlrpc] [<ffffffffa0b98fd1>] tgt_enqueue+0x61/0x230 [ptlrpc] [<ffffffffa0b9a17c>] tgt_request_handle+0x8ec/0x1440 [ptlrpc] [<ffffffffa0b4566b>] ptlrpc_server_handle_request+0x2eb/0xbd0 [ptlrpc] [<ffffffffa082a84a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs] [<ffffffffa0b406e9>] ? ptlrpc_wait_event+0xa9/0x2e0 [ptlrpc] [<ffffffff81059ca9>] ? __wake_up_common+0x59/0x90 [<ffffffffa0b46a11>] ptlrpc_main+0xac1/0x18d0 [ptlrpc] [<ffffffffa0b45f50>] ? ptlrpc_main+0x0/0x18d0 [ptlrpc] [<ffffffff810a138e>] kthread+0x9e/0xc0 [<ffffffff8100c28a>] child_rip+0xa/0x20 [<ffffffff810a12f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20 mdt00_009 S 0000000000000000 0 6304 2 0x00000080 ffff88063d177930 0000000000000046 ffffffff00000018 ffff88063d177900 ffff88063d177890 ffffffff8153a9ba ffff88063d1778d0 0000000000000286 0000000000000000 000000005888a5eb ffff88063d163068 ffff88063d177fd8 Call Trace: [<ffffffff8153a9ba>] ? schedule_timeout+0x19a/0x2e0 [<ffffffffa0b04170>] ? ldlm_expired_completion_wait+0x0/0x230 [ptlrpc] [<ffffffffa0b094dd>] ldlm_completion_ast+0x67d/0x9a0 [ptlrpc] [<ffffffff81067650>] ? default_wake_function+0x0/0x20 [<ffffffffa0b0886e>] ldlm_cli_enqueue_local+0x21e/0x810 [ptlrpc] [<ffffffffa0b08e60>] ? ldlm_completion_ast+0x0/0x9a0 [ptlrpc] [<ffffffffa12807a0>] ? mdt_blocking_ast+0x0/0x2e0 [mdt] [<ffffffffa128d4e4>] mdt_object_local_lock+0x3a4/0xb00 [mdt] [<ffffffffa12807a0>] ? mdt_blocking_ast+0x0/0x2e0 [mdt] [<ffffffffa0b08e60>] ? ldlm_completion_ast+0x0/0x9a0 [ptlrpc] [<ffffffffa128e143>] mdt_object_lock_internal+0x63/0x320 [mdt] [<ffffffffa128e614>] mdt_object_lock+0x14/0x20 [mdt] [<ffffffffa12a781a>] mdt_reint_create+0x6fa/0xfe0 [mdt] [<ffffffffa12993bc>] ? mdt_root_squash+0x2c/0x3f0 [mdt] [<ffffffffa0b362bb>] ? lustre_pack_reply_v2+0x1eb/0x280 [ptlrpc] [<ffffffff81299b7a>] ? strlcpy+0x4a/0x60 [<ffffffffa12997df>] ? ucred_set_jobid+0x5f/0x70 [mdt] [<ffffffffa129aa2a>] ? old_init_ucred_common+0xda/0x2b0 [mdt] [<ffffffffa129cf5d>] mdt_reint_rec+0x5d/0x200 [mdt] [<ffffffffa1287d9b>] mdt_reint_internal+0x62b/0xa50 [mdt] [<ffffffffa128866b>] mdt_reint+0x6b/0x120 [mdt] [<ffffffffa0b9a17c>] tgt_request_handle+0x8ec/0x1440 [ptlrpc] [<ffffffffa0b4566b>] ptlrpc_server_handle_request+0x2eb/0xbd0 [ptlrpc] [<ffffffffa082a84a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs] [<ffffffffa0b406e9>] ? ptlrpc_wait_event+0xa9/0x2e0 [ptlrpc] [<ffffffff81059ca9>] ? __wake_up_common+0x59/0x90 [<ffffffffa0b46a11>] ptlrpc_main+0xac1/0x18d0 [ptlrpc] [<ffffffffa0b45f50>] ? ptlrpc_main+0x0/0x18d0 [ptlrpc] [<ffffffff810a138e>] kthread+0x9e/0xc0 [<ffffffff8100c28a>] child_rip+0xa/0x20 [<ffffffff810a12f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20 mdt00_014 S 0000000000000012 0 6336 2 0x00000080 ffff88062f783990 0000000000000046 0000000000000000 ffff8804137e8398 0000000000000282 0000000000000003 0000073705fbcc95 0000000000000282 ffff88062f783940 0000000100748091 ffff88063393c5f8 ffff88062f783fd8 Call Trace: [<ffffffffa0baebb2>] top_trans_stop+0x782/0xbd0 [ptlrpc] [<ffffffff81067650>] ? default_wake_function+0x0/0x20 [<ffffffffa134caac>] lod_trans_stop+0x2bc/0x330 [lod] [<ffffffffa136dc3c>] ? lod_attr_get+0x9c/0x1b0 [lod] [<ffffffffa13f1f31>] mdd_trans_stop+0x21/0xc6 [mdd] [<ffffffffa13d67f2>] mdd_unlink+0x532/0x1020 [mdd] [<ffffffffa129cec8>] mdo_unlink+0x18/0x50 [mdt] [<ffffffffa12a5fcf>] mdt_reint_unlink+0xcaf/0x10c0 [mdt] [<ffffffffa129cf5d>] mdt_reint_rec+0x5d/0x200 [mdt] [<ffffffffa1287d9b>] mdt_reint_internal+0x62b/0xa50 [mdt] [<ffffffffa128866b>] mdt_reint+0x6b/0x120 [mdt] [<ffffffffa0b9a17c>] tgt_request_handle+0x8ec/0x1440 [ptlrpc] [<ffffffffa0b4566b>] ptlrpc_server_handle_request+0x2eb/0xbd0 [ptlrpc] [<ffffffffa082a84a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs] [<ffffffffa0b406e9>] ? ptlrpc_wait_event+0xa9/0x2e0 [ptlrpc] [<ffffffff81059ca9>] ? __wake_up_common+0x59/0x90 [<ffffffffa0b46a11>] ptlrpc_main+0xac1/0x18d0 [ptlrpc] [<ffffffffa0b45f50>] ? ptlrpc_main+0x0/0x18d0 [ptlrpc] [<ffffffff810a138e>] kthread+0x9e/0xc0 [<ffffffff8100c28a>] child_rip+0xa/0x20 [<ffffffff810a12f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20 |
| Comment by Joseph Gmitter (Inactive) [ 26/Jan/17 ] |
|
Hi Alex, |
| Comment by Alex Zhuravlev [ 26/Jan/17 ] |
|
oK.. |
| Comment by nasf (Inactive) [ 27/Jan/17 ] |
|
As explained in the 1) what the dying object is. In fact, if we can the answers for 1) and/or 2), then get the answers for the 3) may be not difficult. Unfortunately, I still has no answers for none of them. |
| Comment by Alex Zhuravlev [ 27/Jan/17 ] |
|
I think we're ready to get rid of check for LU_OBJECT_BANSHEE from htable_lookup() - the only reason was to prevent an access to such objects, but we've got the check in the every method. |
| Comment by Jian Yu [ 29/Jan/17 ] |
|
sanity-lfsck test 30 hit the similar failure on master branch in review-dne-part-2 test session: Console log on one of the two MDSs: Lustre: DEBUG MARKER: == sanity-lfsck test 30: LFSCK can recover the orphans from backend /lost+found ====================== 20:00:39 (1485633639) Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x161d Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0 Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x1624 general protection fault: 0000 [#1] SMP Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_ssse3 sha512_generic crypto_null libcfs(OE) ldiskfs(OE) dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ppdev parport_pc parport pcspkr i2c_piix4 virtio_balloon nfsd nfs_acl lockd auth_rpcgss grace sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi cirrus virtio_blk drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crct10dif_pclmul crct10dif_common ttm ata_piix crc32c_intel 8139too virtio_pci virtio_ring serio_raw virtio drm i2c_core 8139cp mii libata floppy CPU: 0 PID: 11163 Comm: mdt00_001 Tainted: G OE ------------ 3.10.0-514.2.2.el7_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 task: ffff880043db5e20 ti: ffff880044818000 task.ti: ffff880044818000 RIP: 0010:[<ffffffff811de300>] [<ffffffff811de300>] kmem_cache_alloc_trace+0x80/0x200 RSP: 0018:ffff88004481b820 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff880036ad1c00 RCX: 000000000068969a RDX: 0000000000689699 RSI: 0000000000008050 RDI: ffff88007d001a00 RBP: ffff88004481b858 R08: 0000000000019aa0 R09: ffff88007d001a00 R10: ffffffffa0aeb29e R11: 64657474696d6d6f R12: 0002680000000100 R13: 0000000000008050 R14: 0000000000000058 R15: ffff88007d001a00 FS: 0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f203169a000 CR3: 000000007b855000 CR4: 00000000000406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: ffff88007d001a00 ffffffffa0aeb29e ffff880036ad1c00 ffff8800787d3000 ffff880078d57840 ffff88003f82d800 0000000000000000 ffff88004481b8d0 ffffffffa0aeb29e ffff8800395034bb 0000000900000659 0000000000000000 Call Trace: [<ffffffffa0aeb29e>] ? tgt_last_rcvd_update+0x64e/0xc80 [ptlrpc] [<ffffffffa0aeb29e>] tgt_last_rcvd_update+0x64e/0xc80 [ptlrpc] [<ffffffffa0aef49d>] tgt_txn_stop_cb+0x19d/0x490 [ptlrpc] [<ffffffffa07308b8>] ? libcfs_debug_vmsg2+0x708/0xb70 [libcfs] [<ffffffffa087063e>] dt_txn_hook_stop+0x7e/0xd0 [obdclass] [<ffffffffa0cb71fb>] osd_trans_stop+0x12b/0x590 [osd_ldiskfs] [<ffffffffa0b09509>] top_trans_stop+0x99/0x970 [ptlrpc] [<ffffffffa0f27399>] lod_trans_stop+0x259/0x340 [lod] [<ffffffffa0fb5bad>] mdd_trans_stop+0x1d/0x28 [mdd] [<ffffffffa0fa5493>] mdd_unlink+0x523/0xd80 [mdd] [<ffffffffa0ea3145>] mdo_unlink+0x43/0x45 [mdt] [<ffffffffa0e770c4>] mdt_reint_unlink+0xc04/0x11b0 [mdt] [<ffffffffa0e7a500>] mdt_reint_rec+0x80/0x210 [mdt] [<ffffffffa0e5cdb2>] mdt_reint_internal+0x5b2/0x9b0 [mdt] [<ffffffffa0e680f7>] mdt_reint+0x67/0x140 [mdt] [<ffffffffa0af57b5>] tgt_request_handle+0x915/0x1320 [ptlrpc] [<ffffffffa0aa282b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [<ffffffffa0aa03e8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [<ffffffff810c4ff2>] ? default_wake_function+0x12/0x20 [<ffffffff810ba238>] ? __wake_up_common+0x58/0x90 [<ffffffffa0aa6830>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] [<ffffffffa0aa5d90>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [<ffffffff810b064f>] kthread+0xcf/0xe0 [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140 [<ffffffff81696898>] ret_from_fork+0x58/0x90 [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140 Code: d0 00 00 49 8b 50 08 4d 8b 20 49 8b 40 10 4d 85 e4 0f 84 24 01 00 00 48 85 c0 0f 84 1b 01 00 00 49 63 47 20 48 8d 4a 01 4d 8b 07 <49> 8b 1c 04 4c 89 e0 65 49 0f c7 08 0f 94 c0 84 c0 74 b9 49 63 RIP [<ffffffff811de300>] kmem_cache_alloc_trace+0x80/0x200 RSP <ffff88004481b820> |
| Comment by Cliff White (Inactive) [ 19/Apr/17 ] |
|
restarted soak with latest lustre-master (3561) mounted all nodes. |
| Comment by Cliff White (Inactive) [ 20/Apr/17 ] |
|
I left the system overnight, and recovery did eventually complete. [17166.319310] Lustre: 3626:0:(ldlm_lib.c:1781:extend_recovery_timer()) soaked-MDT0002: extended recovery timer reaching hard limit: 900, extend: 1
[17166.339932] Lustre: 3626:0:(ldlm_lib.c:1781:extend_recovery_timer()) Skipped 211354 previous similar messages
[17462.141184] Lustre: 3626:0:(ldlm_lib.c:2046:target_recovery_overseer()) soaked-MDT0002 recovery is aborted by hard timeout
[17462.156428] Lustre: 3626:0:(ldlm_lib.c:2056:target_recovery_overseer()) recovery is aborted, evict exports in recovery
[17462.174477] Lustre: 3626:0:(ldlm_lib.c:2552:target_recovery_thread()) too long recovery - read logs
[17462.187265] Lustre: soaked-MDT0002: Recovery over after 283:48, of 3 clients 3 recovered and 0 were evicted.
[17462.187267] LustreError: dumping log to /tmp/lustre-log.1492647191.3626
I would still consider this an issue. |
| Comment by Peter Jones [ 20/Apr/17 ] |
|
Lai Could you please advise on this one Peter |
| Comment by Gerrit Updater [ 05/May/17 ] |
|
Lai Siyao (lai.siyao@intel.com) uploaded a new patch: https://review.whamcloud.com/26965 |
| Comment by Cliff White (Inactive) [ 13/Jun/17 ] |
|
Tested patch on soak. Used patch set 7, build 48150
|
| Comment by Lai Siyao [ 13/Jun/17 ] |
|
This looks to be caused by OOM, and it's not directly related with the patch. |
| Comment by Lai Siyao [ 13/Jun/17 ] |
|
I just rebased patch to latest master, Cliff, will you test the latest update again to make sure it's caused by this patch? |
| Comment by Cliff White (Inactive) [ 16/Jun/17 ] |
|
Tested latest version of the patch. Had a hard failure on OSS after OSS failover. May not be related |
| Comment by Lai Siyao [ 20/Jun/17 ] |
|
Cliff, could you test soak again with both https://review.whamcloud.com/26965 and https://review.whamcloud.com/#/c/27708/ , the latter is the fix for |
| Comment by Cliff White (Inactive) [ 22/Jun/17 ] |
|
Ran 40+ hours in soak, 12+ hours with LFSCK. Did not hit the LBUG. |
| Comment by Cliff White (Inactive) [ 22/Jun/17 ] |
|
May have spoken too soon. Right after I posted the above, we had this: Jun 22 01:13:20 soak-8 kernel: mdt_out01_020: page allocation failure: order:4, mode:0x10c050 Jun 22 01:13:20 soak-8 kernel: CPU: 10 PID: 6106 Comm: mdt_out01_020 Tainted: P OE ------------ 3.10.0-514.21.1.el7_lustre.x86_64 #1 Jun 22 01:13:20 soak-8 kernel: Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 Jun 22 01:13:20 soak-8 kernel: 000000000010c050 000000005af41680 ffff8803eb6c38a8 ffffffff81687163 Jun 22 01:13:20 soak-8 kernel: ffff8803eb6c3938 ffffffff81187090 0000000000000000 ffff88083ffd8000 Jun 22 01:13:20 soak-8 kernel: 0000000000000004 000000000010c050 ffff8803eb6c3938 000000005af41680 Jun 22 01:13:20 soak-8 kernel: Call Trace: Jun 22 01:13:20 soak-8 kernel: [<ffffffff81687163>] dump_stack+0x19/0x1b Jun 22 01:13:20 soak-8 kernel: [<ffffffff81187090>] warn_alloc_failed+0x110/0x180 Jun 22 01:13:20 soak-8 kernel: [<ffffffff81682cf7>] __alloc_pages_slowpath+0x6b7/0x725 Jun 22 01:13:20 soak-8 kernel: [<ffffffff8118b645>] __alloc_pages_nodemask+0x405/0x420 Jun 22 01:13:20 soak-8 kernel: [<ffffffff811cf7fa>] alloc_pages_current+0xaa/0x170 Jun 22 01:13:20 soak-8 kernel: [<ffffffff81185f6e>] __get_free_pages+0xe/0x50 Jun 22 01:13:20 soak-8 kernel: [<ffffffff811db09e>] kmalloc_order_trace+0x2e/0xa0 Jun 22 01:13:20 soak-8 kernel: [<ffffffff811dd871>] __kmalloc+0x221/0x240 Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0cd0399>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0f82e1d>] out_handle+0xa5d/0x1920 [ptlrpc] Jun 22 01:13:20 soak-8 kernel: [<ffffffff810ce4c4>] ? update_curr+0x104/0x190 Jun 22 01:13:20 soak-8 kernel: [<ffffffff810c9bf8>] ? __enqueue_entity+0x78/0x80 Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0f15b82>] ? lustre_msg_get_opc+0x22/0xf0 [ptlrpc] Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0f78a99>] ? tgt_request_preprocess.isra.26+0x299/0x790 [ptlrpc] Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0f798a5>] tgt_request_handle+0x915/0x1360 [ptlrpc] Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0f23143>] ptlrpc_server_handle_request+0x233/0xa90 [ptlrpc] Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0f20938>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] Jun 22 01:13:20 soak-8 kernel: [<ffffffff810c54f2>] ? default_wake_function+0x12/0x20 Jun 22 01:13:20 soak-8 kernel: [<ffffffff810ba628>] ? __wake_up_common+0x58/0x90 Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0f27120>] ptlrpc_main+0xaa0/0x1dd0 [ptlrpc] Jun 22 01:13:20 soak-8 kernel: [<ffffffffa0f26680>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc] Jun 22 01:13:20 soak-8 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0 Jun 22 01:13:20 soak-8 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 Jun 22 01:13:20 soak-8 kernel: [<ffffffff81697798>] ret_from_fork+0x58/0x90 Jun 22 01:13:20 soak-8 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 Jun 22 01:13:20 soak-8 kernel: Mem-Info: Jun 22 01:13:20 soak-8 kernel: active_anon:8213 inactive_anon:9497 isolated_anon:0#012 active_file:4337450 inactive_file:1514906 isolated_file:0#012 unevictable:2118 dirty:265 writeback:1 unstable:0#012 slab_reclaimable:1636391 slab_unreclaimable:224759#012 mapped:10001 shmem:4325 pagetables:1435 bounce:0#012 free:77455 free_pcp:174 free_cma:0 Jun 22 01:13:20 soak-8 kernel: Node 1 Normal free:129652kB min:45728kB low:57160kB high:68592kB active_anon:22588kB inactive_anon:24092kB active_file:9510128kB inactive_file:2799920kB unevictable:6448kB isolated(anon):0kB isolated(file):0kB present:16777216kB managed:16498508kB mlocked:6448kB dirty:1388kB writeback:0kB mapped:19960kB shmem:16728kB slab_reclaimable:2991008kB slab_unreclaimable:678664kB kernel_stack:6800kB pagetables:3076kB unstable:0kB bounce:0kB free_pcp:812kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Jun 22 01:13:20 soak-8 kernel: lowmem_reserve[]: 0 0 0 0 Jun 22 01:13:20 soak-8 kernel: Node 1 Normal: 7564*4kB (UEM) 4942*8kB (UEM) 3212*16kB (UEM) 192*32kB (EM) 12*64kB (EM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 128096kB Jun 22 01:13:20 soak-8 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Jun 22 01:13:20 soak-8 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Jun 22 01:13:20 soak-8 kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Jun 22 01:13:20 soak-8 kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Jun 22 01:13:20 soak-8 kernel: 5859102 total pagecache pages Jun 22 01:13:20 soak-8 kernel: 4 pages in swap cache Jun 22 01:13:20 soak-8 kernel: Swap cache stats: add 16, delete 12, find 0/0 Jun 22 01:13:20 soak-8 kernel: Free swap = 16319420kB Jun 22 01:13:21 soak-8 kernel: Total swap = 16319484kB Jun 22 01:13:21 soak-8 kernel: 8369064 pages RAM Jun 22 01:13:21 soak-8 kernel: 0 pages HighMem/MovableOnly Jun 22 01:13:21 soak-8 kernel: 241204 pages reserved Jun 22 01:13:21 soak-8 kernel: BUG: unable to handle kernel paging request at ffffeb04007e4c40 Jun 22 01:13:21 soak-8 kernel: IP: [<ffffffffa0c3c2ff>] lnet_cpt_of_md+0xdf/0x120 [lnet] Jun 22 01:13:21 soak-8 kernel: PGD 0 Jun 22 01:13:21 soak-8 kernel: Oops: 0000 [#1] SMP Jun 22 01:13:21 soak-8 kernel: Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd dm_round_robin sb_edac ioatdma edac_core ipmi_devintf sg ntb mei_me mei iTCO_wdt iTCO_vendor_support i2c_i801 lpc_ich ipmi_ssif shpchp pcspkr wmi Jun 22 01:13:21 soak-8 kernel: ipmi_si ipmi_msghandler dm_multipath nfsd dm_mod nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_en mgag200 drm_kms_helper isci syscopyarea sysfillrect sysimgblt igb fb_sys_fops ahci crct10dif_pclmul ttm crct10dif_common ptp crc32c_intel libsas libahci pps_core drm mlx4_core mpt2sas dca libata raid_class i2c_algo_bit scsi_transport_sas devlink i2c_core fjes Jun 22 01:13:21 soak-8 kernel: CPU: 9 PID: 5062 Comm: mdt_out01_014 Tainted: P OE ------------ 3.10.0-514.21.1.el7_lustre.x86_64 #1 Jun 22 01:13:21 soak-8 kernel: Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 Jun 22 01:13:21 soak-8 kernel: task: ffff880829e50fb0 ti: ffff88080f344000 task.ti: ffff88080f344000 Jun 22 01:13:21 soak-8 kernel: RIP: 0010:[<ffffffffa0c3c2ff>] [<ffffffffa0c3c2ff>] lnet_cpt_of_md+0xdf/0x120 [lnet] Jun 22 01:13:21 soak-8 kernel: RSP: 0018:ffff88080f347a18 EFLAGS: 00010202 Jun 22 01:13:21 soak-8 kernel: RAX: 00000104007e4c40 RBX: 00050000c0a8016d RCX: 000077ff80000000 Jun 22 01:13:21 soak-8 kernel: RDX: ffffea0000000000 RSI: 0000000000000001 RDI: ffff8804182023c0 Jun 22 01:13:21 soak-8 kernel: RBP: ffff88080f347a18 R08: 0000000000000009 R09: 00000000000090c0 Jun 22 01:13:21 soak-8 kernel: R10: ffffffffa0c46171 R11: ffffc9001f931100 R12: ffff880503b5c000 Jun 22 01:13:21 soak-8 kernel: R13: 00050000c0a8016d R14: 0000000000000001 R15: ffff8804acad0600 Jun 22 01:13:21 soak-8 kernel: FS: 0000000000000000(0000) GS:ffff88082d840000(0000) knlGS:0000000000000000 Jun 22 01:13:21 soak-8 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jun 22 01:13:21 soak-8 kernel: CR2: ffffeb04007e4c40 CR3: 00000000019be000 CR4: 00000000000407e0 Jun 22 01:13:21 soak-8 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jun 22 01:13:21 soak-8 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Jun 22 01:13:21 soak-8 kernel: Stack: Jun 22 01:13:21 soak-8 kernel: ffff88080f347ad8 ffffffffa0c437dc ffff88080f347a90 ffff8804acad0638 Jun 22 01:13:21 soak-8 kernel: ffffea0020b0de00 ffff88080f347af8 ffffffffffffffff ffff880829e50fb0 Jun 22 01:13:21 soak-8 kernel: ffff880829e50fb0 ffff88082c379000 ffff88082c379000 0000000100017a88 Jun 22 01:13:21 soak-8 kernel: Call Trace: Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0c437dc>] lnet_select_pathway+0x5c/0x1140 [lnet] Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0c45fb1>] lnet_send+0x51/0x180 [lnet] Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0c46325>] LNetPut+0x245/0x7a0 [lnet] Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f0ebc6>] ptl_send_buf+0x146/0x530 [ptlrpc] Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0baacde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f31fb7>] ? at_measured+0x1c7/0x380 [ptlrpc] Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f11e6b>] ptlrpc_send_reply+0x29b/0x840 [ptlrpc] Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0ed098e>] target_send_reply_msg+0x8e/0x170 [ptlrpc] Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0edb476>] target_send_reply+0x306/0x730 [ptlrpc] Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f18b07>] ? lustre_msg_set_last_committed+0x27/0xa0 [ptlrpc] Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f79517>] tgt_request_handle+0x587/0x1360 [ptlrpc] Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f23143>] ptlrpc_server_handle_request+0x233/0xa90 [ptlrpc] Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f20938>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] Jun 22 01:13:21 soak-8 kernel: [<ffffffff810c54f2>] ? default_wake_function+0x12/0x20 Jun 22 01:13:21 soak-8 kernel: [<ffffffff810ba628>] ? __wake_up_common+0x58/0x90 Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f27120>] ptlrpc_main+0xaa0/0x1dd0 [ptlrpc] Jun 22 01:13:21 soak-8 kernel: [<ffffffffa0f26680>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc] Jun 22 01:13:21 soak-8 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0 Jun 22 01:13:21 soak-8 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 Jun 22 01:13:21 soak-8 kernel: [<ffffffff81697798>] ret_from_fork+0x58/0x90 Jun 22 01:13:21 soak-8 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 Jun 22 01:13:21 soak-8 kernel: Code: ff 77 00 00 48 8b 3d 21 53 03 00 48 01 d0 48 0f 42 0d 26 8d d8 e0 48 ba 00 00 00 00 00 ea ff ff 48 01 c8 48 c1 e8 0c 48 c1 e0 06 <48> 8b 34 10 48 c1 ee 36 e8 e4 ec f6 ff 5d c3 66 90 b8 ff ff ff Jun 22 01:13:21 soak-8 kernel: RIP [<ffffffffa0c3c2ff>] lnet_cpt_of_md+0xdf/0x120 [lnet] Jun 22 01:13:21 soak-8 kernel: RSP <ffff88080f347a18> Jun 22 01:13:21 soak-8 kernel: CR2: ffffeb04007e4c40 And on MDT001 Jun 22 01:13:20 soak-9 kernel: LustreError: 11-0: soaked-MDT0000-osp-MDT0001: operation out_update to node 192.168.1.108@o2ib failed: rc = -12 Jun 22 01:13:20 soak-9 kernel: LustreError: 4612:0:(layout.c:2085:__req_capsule_get()) @@@ Wrong buffer for field `object_update_reply' (1 of 1) in format `OUT_UPDATE': 0 vs. 4096 (server)#012 req@ffff88081121e300 x1570858955230384/t0(0) o1000->soaked-MDT0000-osp-MDT0001@192.168.1.108@o2ib:24/4 lens 376/192 e 0 to 0 dl 1498094043 ref 2 fl Interpret:ReM/0/0 rc -12/-12 Jun 22 01:13:20 soak-9 kernel: general protection fault: 0000 [#1] SMP Jun 22 01:13:20 soak-9 kernel: Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm dm_round_robin irqbypass crc32_pclmul ghash_clmulni_intel ipmi_ssif aesni_intel sg lrw gf128mul ipmi_devintf glue_helper ablk_helper cryptd mei_me mei ntb ipmi_si ioatdma ipmi_msghandler iTCO_wdt iTCO_vendor_support lpc_ich wmi pcspkr sb_edac i2c_i801 Jun 22 01:13:20 soak-9 kernel: edac_core shpchp dm_multipath dm_mod nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_en mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops igb ttm isci crct10dif_pclmul crct10dif_common ptp ahci libsas crc32c_intel libahci pps_core drm mlx4_core mpt2sas libata dca raid_class i2c_algo_bit scsi_transport_sas devlink i2c_core fjes Jun 22 01:13:20 soak-9 kernel: CPU: 11 PID: 4612 Comm: osp_up0-1 Tainted: P OE ------------ 3.10.0-514.21.1.el7_lustre.x86_64 #1 Jun 22 01:13:20 soak-9 kernel: Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 Jun 22 01:13:20 soak-9 kernel: task: ffff8808155c0fb0 ti: ffff8808155c8000 task.ti: ffff8808155c8000 Jun 22 01:13:20 soak-9 kernel: RIP: 0010:[<ffffffffa14fe0d3>] [<ffffffffa14fe0d3>] osp_send_update_thread+0x1d3/0x600 [osp] Jun 22 01:13:20 soak-9 kernel: RSP: 0018:ffff8808155cbe00 EFLAGS: 00010282 Jun 22 01:13:20 soak-9 kernel: RAX: ffff8803f3ad1900 RBX: ffff8804014772a0 RCX: 000000018040003e Jun 22 01:13:20 soak-9 kernel: RDX: 000000018040003f RSI: 5a5a5a5a5a5a5a5a RDI: 0000000040000000 Jun 22 01:13:20 soak-9 kernel: RBP: ffff8808155cbec0 R08: ffff8807c6b12600 R09: 000000018040003e Jun 22 01:13:20 soak-9 kernel: R10: 00000000c6b11e01 R11: ffffea001f1ac400 R12: ffff8808155c0fb0 Jun 22 01:13:20 soak-9 kernel: R13: ffff880401b55000 R14: 00000000fffffff4 R15: ffff8804014772b0 Jun 22 01:13:20 soak-9 kernel: FS: 0000000000000000(0000) GS:ffff88082d8c0000(0000) knlGS:0000000000000000 Jun 22 01:13:20 soak-9 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jun 22 01:13:20 soak-9 kernel: CR2: 00007f8d82582000 CR3: 00000000019be000 CR4: 00000000000407e0 Jun 22 01:13:20 soak-9 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jun 22 01:13:20 soak-9 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Jun 22 01:13:20 soak-9 kernel: Stack: Jun 22 01:13:20 soak-9 kernel: ffff880401b551e8 ffff8808155c0fb0 ffff8808155c0fb0 ffff8803f3ad1900 Jun 22 01:13:20 soak-9 kernel: 0000000000000000 0000000000000000 0000000000000000 ffff8808155c0fb0 Jun 22 01:13:20 soak-9 kernel: ffffffff810c54e0 dead000000000100 dead000000000200 0000000210000003 Jun 22 01:13:20 soak-9 kernel: Call Trace: Jun 22 01:13:20 soak-9 kernel: [<ffffffff810c54e0>] ? wake_up_state+0x20/0x20 Jun 22 01:13:20 soak-9 kernel: [<ffffffffa14fdf00>] ? osp_invalidate_request+0x390/0x390 [osp] Jun 22 01:13:20 soak-9 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0 Jun 22 01:13:20 soak-9 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 Jun 22 01:13:20 soak-9 kernel: [<ffffffff81697798>] ret_from_fork+0x58/0x90 Jun 22 01:13:20 soak-9 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 Jun 22 01:13:20 soak-9 kernel: Code: 19 e0 48 8b 95 58 ff ff ff 48 8b 52 48 48 3b 53 30 74 5d 4c 89 ff e8 fd 08 19 e0 45 85 f6 78 65 48 8b 85 58 ff ff ff 48 8b 70 40 <f0> ff 8e 88 00 00 00 0f 94 c0 84 c0 0f 84 db fe ff ff 48 8d 7d Jun 22 01:13:20 soak-9 kernel: RIP [<ffffffffa14fe0d3>] osp_send_update_thread+0x1d3/0x600 [osp] Jun 22 01:13:20 soak-9 kernel: RSP <ffff8808155cbe00> Jun 22 01:13:20 soak-9 kernel: ---[ end trace 6ac75bb1f736a48a ]--- Jun 22 01:13:20 soak-9 kernel: Kernel panic - not syncing: Fatal exception ~ ~ ~ Is this related or new? |
| Comment by nasf (Inactive) [ 22/Jun/17 ] |
|
The new lnet_cpt_of_md trouble looks more like |
| Comment by Gerrit Updater [ 22/Jun/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26965/ |
| Comment by Peter Jones [ 22/Jun/17 ] |
|
Landed for 2.10 |
| Comment by James A Simmons [ 07/May/18 ] |
|
While porting this to upstream kernel the question came up what dependencies need to land before this patch lands. |
| Comment by Lai Siyao [ 08/May/18 ] |
|
https://review.whamcloud.com/#/c/27708/ is fix for I'd suggest porting the latter two patches, and there is no direct dependency between them. |