[LU-7372] replay-dual test_26: test failed to respond and timed out Created: 02/Nov/15 Updated: 23/Dec/21 Resolved: 22/Jul/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0, Lustre 2.9.0, Lustre 2.10.0, Lustre 2.11.0, Lustre 2.12.0, Lustre 2.10.3, Lustre 2.10.4, Lustre 2.10.5, Lustre 2.12.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Server/Client : master, build # 3225 RHEL 6.7 |
||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for Saurabh Tandan <saurabh.tandan@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/1e79d2a6-7d21-11e5-a254-5254006e85c2. The sub-test test_26 failed with the following error: test failed to respond and timed out Client dmesg: Lustre: DEBUG MARKER: test_26 fail mds1 1 times LustreError: 980:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1445937610, 300s ago), entering recovery for MGS@10.2.4.140@tcp ns: MGC10.2.4.140@tcp lock: ffff88007bdd82c0/0x956ab2c8047544d6 lrc: 4/1,0 mode: --/CR res: [0x65727473756c:0x2:0x0].0x0 rrc: 1 type: PLN flags: 0x1000000000000 nid: local remote: 0x223a79061b204538 expref: -99 pid: 980 timeout: 0 lvb_type: 0 Lustre: 29433:0:(client.c:2039:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1445937910/real 1445937910] req@ffff880028347980 x1516173751413108/t0(0) o250->MGC10.2.4.140@tcp@10.2.4.140@tcp:26/25 lens 520/544 e 0 to 1 dl 1445937916 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 29433:0:(client.c:2039:ptlrpc_expire_one_request()) Skipped 67 previous similar messages MDS console: 09:22:17:LustreError: 24638:0:(client.c:1138:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff88004d92c980 x1516158358024328/t0(0) o101->lustre-MDT0000-lwp-MDT0000@0@lo:23/10 lens 456/496 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1 09:25:19:LustreError: 24638:0:(client.c:1138:ptlrpc_import_delay_req()) Skipped 6 previous similar messages 09:25:19:LustreError: 24638:0:(qsd_reint.c:55:qsd_reint_completion()) lustre-MDT0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x10000:0x0], rc:-5 09:25:19:LustreError: 24638:0:(qsd_reint.c:55:qsd_reint_completion()) Skipped 1 previous similar message 09:25:19:INFO: task umount:24629 blocked for more than 120 seconds. 09:25:19: Not tainted 2.6.32-573.7.1.el6_lustre.x86_64 #1 09:25:19:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 09:25:19:umount D 0000000000000000 0 24629 24628 0x00000080 09:25:19: ffff880059e2bb48 0000000000000086 0000000000000000 00000000000708b7 09:25:20: 0000603500000000 000000ac00000000 00001c1fd9b9c014 ffff880059e2bb98 09:25:20: ffff880059e2bb58 0000000101d3458a ffff880076ee3ad8 ffff880059e2bfd8 09:25:20:Call Trace: 09:25:20: [<ffffffff8153a756>] __mutex_lock_slowpath+0x96/0x210 09:25:20: [<ffffffff8153a27b>] mutex_lock+0x2b/0x50 09:25:20: [<ffffffffa02cb30d>] mgc_process_config+0x1dd/0x1210 [mgc] 09:25:20: [<ffffffffa0476b61>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 09:25:20: [<ffffffffa07fe28d>] obd_process_config.clone.0+0x8d/0x2e0 [obdclass] 09:25:20: [<ffffffffa0476b61>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 09:25:20: [<ffffffffa08024c2>] lustre_end_log+0x262/0x6a0 [obdclass] 09:25:20: [<ffffffffa082efb1>] server_put_super+0x911/0xed0 [obdclass] 09:25:20: [<ffffffff811b0116>] ? invalidate_inodes+0xf6/0x190 09:25:20: [<ffffffff8119437b>] generic_shutdown_super+0x5b/0xe0 09:25:20: [<ffffffff81194466>] kill_anon_super+0x16/0x60 09:25:20: [<ffffffffa07fa096>] lustre_kill_super+0x36/0x60 [obdclass] 09:25:20: [<ffffffff81194c07>] deactivate_super+0x57/0x80 09:25:20: [<ffffffff811b4a7f>] mntput_no_expire+0xbf/0x110 09:25:20: [<ffffffff811b55cb>] sys_umount+0x7b/0x3a0 09:25:20: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b Info required for matching: replay-dual test_26 |
| Comments |
| Comment by Peter Jones [ 04/Dec/15 ] |
|
Bobijam Could you please advise on this one? Thanks Peter |
| Comment by Saurabh Tandan (Inactive) [ 11/Dec/15 ] |
|
master, build# 3264, 2.7.64 tag |
| Comment by Saurabh Tandan (Inactive) [ 11/Dec/15 ] |
|
master, build# 3264, 2.7.64 tag |
| Comment by Saurabh Tandan (Inactive) [ 11/Dec/15 ] |
|
master, build# 3264, 2.7.64 tag |
| Comment by Saurabh Tandan (Inactive) [ 11/Dec/15 ] |
|
master, build# 3264, 2.7.64 tag |
| Comment by Sarah Liu [ 14/Dec/15 ] |
|
another instance seen on RHEL6.7 server/SLES11 SP3 client |
| Comment by Zhenyu Xu [ 16/Dec/15 ] |
|
mgs_ir_fini_fs() was waiting for the completion signal supposed to be issued by mgs_ir_notify(), while mgs_ir_notify() was waiting in the previous loop for the revoking of a CONFIG_T_RECOVERY lock indefinitely. mgs_ir_fini_fs [11146.534027] umount D ffff88007fc13680 0 416 415 0x00000080 [11146.534027] ffff88007a183a00 0000000000000086 ffff88007a183fd8 0000000000013680 [11146.534027] ffff88007a183fd8 0000000000013680 ffff880078725b00 ffff880061536728 [11146.534027] ffff880061536730 7fffffffffffffff ffff880078725b00 ffff880061536600 [11146.534027] Call Trace: [11146.534027] [<ffffffff8160a709>] schedule+0x29/0x70 [11146.534027] [<ffffffff81608649>] schedule_timeout+0x209/0x2d0 [11146.534027] [<ffffffff81601c94>] ? __slab_free+0x10e/0x277 [11146.534027] [<ffffffff8160ac16>] wait_for_completion+0x116/0x170 [11146.534027] [<ffffffff810a9660>] ? wake_up_state+0x20/0x20 [11146.534027] [<ffffffffa0cf0930>] mgs_ir_fini_fs+0x250/0x46c [mgs] [11146.534027] [<ffffffffa0cd8fac>] mgs_free_fsdb+0x4c/0xcd0 [mgs] [11146.534027] [<ffffffffa0ce5502>] mgs_cleanup_fsdb_list+0x52/0x70 [mgs] [11146.534027] [<ffffffffa0cc9f27>] mgs_device_fini+0x97/0x5b0 [mgs] [11146.534027] [<ffffffffa079b4e4>] class_cleanup+0x734/0xcc0 [obdclass] [11146.534027] [<ffffffffa079dd83>] class_process_config+0x1bf3/0x2cf0 [obdclass] [11146.534027] [<ffffffffa067b3c7>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [11146.534027] [<ffffffffa079ef6f>] class_manual_cleanup+0xef/0xba0 [obdclass] [11146.534027] [<ffffffffa07d6880>] server_put_super+0xcc0/0xea0 [obdclass] [11146.534027] [<ffffffff811c9426>] generic_shutdown_super+0x56/0xe0 [11146.534027] [<ffffffff811c9692>] kill_anon_super+0x12/0x20 [11146.534027] [<ffffffffa07a2c42>] lustre_kill_super+0x32/0x50 [obdclass] [11146.534027] [<ffffffff811c9a3d>] deactivate_locked_super+0x3d/0x60 [11146.534027] [<ffffffff811ca046>] deactivate_super+0x46/0x60 [11146.534027] [<ffffffff811e6f35>] mntput_no_expire+0xc5/0x120 [11146.534027] [<ffffffff811e806f>] SyS_umount+0x9f/0x3c0 [11146.534027] [<ffffffff81615309>] system_call_fastpath+0x16/0x1b mgs_ir_notify [11146.534027] mgs_lustre_noti S ffff88007fd13680 0 28657 2 0x00000080 [11146.534027] ffff880060a43ba8 0000000000000046 ffff880060a43fd8 0000000000013680 [11146.534027] ffff880060a43fd8 0000000000013680 ffff88005b4b16c0 ffff88005fddba00 [11146.534027] ffff88005b4b16c0 0000000000000000 ffffffffa09d4210 ffff88005b4b16c0 [11146.534027] Call Trace: [11146.534027] [<ffffffffa09d4210>] ? ldlm_completion_ast_async+0x300/0x300 [ptlrpc] [11146.534027] [<ffffffff8160a709>] schedule+0x29/0x70 [11146.534027] [<ffffffffa09d4a8d>] ldlm_completion_ast+0x62d/0x910 [ptlrpc] [11146.534027] [<ffffffff810a9660>] ? wake_up_state+0x20/0x20 [11146.534027] [<ffffffffa0cc8721>] mgs_completion_ast_generic+0xb1/0x1d0 [mgs] [11146.534027] [<ffffffffa0cc8853>] mgs_completion_ast_ir+0x13/0x20 [mgs] [11146.534027] [<ffffffffa09d7150>] ldlm_cli_enqueue_local+0x230/0x940 [ptlrpc] [11146.534027] [<ffffffffa0cc8840>] ? mgs_completion_ast_generic+0x1d0/0x1d0 [mgs] [11146.534027] [<ffffffffa09d9f40>] ? ldlm_blocking_ast_nocheck+0x310/0x310 [ptlrpc] [11146.534027] [<ffffffffa0ccfdcc>] mgs_revoke_lock+0x1dc/0x360 [mgs] [11146.534027] [<ffffffffa09d9f40>] ? ldlm_blocking_ast_nocheck+0x310/0x310 [ptlrpc] [11146.534027] [<ffffffffa0cc8840>] ? mgs_completion_ast_generic+0x1d0/0x1d0 [mgs] [11146.534027] [<ffffffffa0cece4a>] mgs_ir_notify+0x14a/0x2b0 [mgs] [11146.534027] [<ffffffff810a9660>] ? wake_up_state+0x20/0x20 [11146.534027] [<ffffffffa0cecd00>] ? lprocfs_ir_set_state+0x170/0x170 [mgs] [11146.534027] [<ffffffff810973af>] kthread+0xcf/0xe0 [11146.534027] [<ffffffff810972e0>] ? kthread_create_on_node+0x140/0x140 [11146.534027] [<ffffffff81615258>] ret_from_fork+0x58/0x90 [11146.534027] [<ffffffff810972e0>] ? kthread_create_on_node+0x140/0x140 |
| Comment by Jinshan Xiong (Inactive) [ 17/Dec/15 ] |
|
Hi Bobijam, Do you know which lock is this process waiting for? "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. umount D 0000000000000000 0 24629 24628 0x00000080 ffff880059e2bb48 0000000000000086 0000000000000000 00000000000708b7 0000603500000000 000000ac00000000 00001c1fd9b9c014 ffff880059e2bb98 ffff880059e2bb58 0000000101d3458a ffff880076ee3ad8 ffff880059e2bfd8 Call Trace: [<ffffffff8153a756>] __mutex_lock_slowpath+0x96/0x210 [<ffffffff8153a27b>] mutex_lock+0x2b/0x50 [<ffffffffa02cb30d>] mgc_process_config+0x1dd/0x1210 [mgc] [<ffffffffa0476b61>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa07fe28d>] obd_process_config.clone.0+0x8d/0x2e0 [obdclass] [<ffffffffa0476b61>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa08024c2>] lustre_end_log+0x262/0x6a0 [obdclass] [<ffffffffa082efb1>] server_put_super+0x911/0xed0 [obdclass] [<ffffffff811b0116>] ? invalidate_inodes+0xf6/0x190 [<ffffffff8119437b>] generic_shutdown_super+0x5b/0xe0 [<ffffffff81194466>] kill_anon_super+0x16/0x60 [<ffffffffa07fa096>] lustre_kill_super+0x36/0x60 [obdclass] [<ffffffff81194c07>] deactivate_super+0x57/0x80 [<ffffffff811b4a7f>] mntput_no_expire+0xbf/0x110 [<ffffffff811b55cb>] sys_umount+0x7b/0x3a0 It looks like the lock cancellation was blocked by this lock somehow. |
| Comment by Zhenyu Xu [ 17/Dec/15 ] |
|
but in the MDS debug log, the same process later shows a different backtrace, and hung in that situation ever since. 09:25:21:Lustre: lock timed out (enqueued at 1445937602, 300s ago) .... 09:31:08:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 09:31:08:umount D 0000000000000000 0 24629 24628 0x00000080 09:31:08: ffff880059e2b948 0000000000000086 0000000000000000 ffffffffa046a27b 09:31:08: 00000000562f42f5 ffff8800378fd480 00001c639f546de4 ffff880052840c20 09:31:08: ffff8800378fd480 0000000101d7b690 ffff880076ee3ad8 ffff880059e2bfd8 09:31:08:Call Trace: 09:31:08: [<ffffffffa046a27b>] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs] 09:31:08: [<ffffffff81539cd5>] schedule_timeout+0x215/0x2e0 09:31:08: [<ffffffff81539953>] wait_for_common+0x123/0x180 09:31:08: [<ffffffff810672b0>] ? default_wake_function+0x0/0x20 09:31:08: [<ffffffff81539a6d>] wait_for_completion+0x1d/0x20 09:31:08: [<ffffffffa0640fb1>] mgs_ir_fini_fs+0x1e1/0x340 [mgs] 09:31:08: [<ffffffffa0637348>] mgs_free_fsdb+0x48/0xf40 [mgs] 09:31:08: [<ffffffffa063828f>] mgs_cleanup_fsdb_list+0x4f/0x70 [mgs] 09:31:08: [<ffffffffa06209a0>] mgs_device_fini+0x120/0x5b0 [mgs] 09:31:08: [<ffffffffa07f3d42>] class_cleanup+0x572/0xd20 [obdclass] 09:31:08: [<ffffffffa07d4906>] ? class_name2dev+0x56/0xe0 [obdclass] 09:31:08: [<ffffffffa07f63c6>] class_process_config+0x1ed6/0x2830 [obdclass] 09:31:08: [<ffffffffa0476b61>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 09:31:08: [<ffffffffa07f71df>] class_manual_cleanup+0x4bf/0x8e0 [obdclass] 09:31:08: [<ffffffffa07d4906>] ? class_name2dev+0x56/0xe0 [obdclass] 09:31:08: [<ffffffffa082f374>] server_put_super+0xcd4/0xed0 [obdclass] 09:31:08: [<ffffffff811b0116>] ? invalidate_inodes+0xf6/0x190 09:31:08: [<ffffffff8119437b>] generic_shutdown_super+0x5b/0xe0 09:31:08: [<ffffffff81194466>] kill_anon_super+0x16/0x60 09:31:08: [<ffffffffa07fa096>] lustre_kill_super+0x36/0x60 [obdclass] 09:31:08: [<ffffffff81194c07>] deactivate_super+0x57/0x80 09:31:09: [<ffffffff811b4a7f>] mntput_no_expire+0xbf/0x110 09:31:09: [<ffffffff811b55cb>] sys_umount+0x7b/0x3a0 09:31:09: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b |
| Comment by Jinshan Xiong (Inactive) [ 17/Dec/15 ] |
|
That's a different issue. The MGS should cancel the lock after lock expiration. |
| Comment by Zhenyu Xu [ 18/Dec/15 ] |
|
[<ffffffffa02cb30d>] mgc_process_config+0x1dd/0x1210 [mgc] locates config_log_end() if (cld_recover) { mutex_lock(&cld_recover->cld_lock); // -----------> here cld_recover->cld_stopping = 1; mutex_unlock(&cld_recover->cld_lock); config_log_put(cld_recover); } |
| Comment by Saurabh Tandan (Inactive) [ 18/Dec/15 ] |
|
Another instance for EL6.7 Server/EL6.7 Client - ZFS |
| Comment by Saurabh Tandan (Inactive) [ 18/Dec/15 ] |
|
Another instance forEL7.1 Server/EL7.1 Client - ZFS |
| Comment by Saurabh Tandan (Inactive) [ 19/Dec/15 ] |
|
Another instance for EL7.1 Server/EL7.1 Client - DNE |
| Comment by Jinshan Xiong (Inactive) [ 19/Dec/15 ] |
|
Hi Bobijam, Please investigate why lock log end could be blocked there, I guess there must be a deadlock/live case in lock cancellation code and log end code. For the hung issue, we should make a fix in mgs_completion_ast_generic() where if ldlm_completion_ast() returned failure due to timeout, it should cancel the lock by itself. Jinshan |
| Comment by Zhenyu Xu [ 24/Dec/15 ] |
|
Jinshan, I'm wondering whether that server does not need IR log, since it only client need it. diff --git a/lustre/mgc/mgc_request.c b/lustre/mgc/mgc_request.c index ba1a2e1..0f392dd 100644 --- a/lustre/mgc/mgc_request.c +++ b/lustre/mgc/mgc_request.c @@ -360,7 +360,8 @@ static int config_log_add(struct obd_device obd, char logname, cld->cld_params = params_cld; LASSERT(lsi->lsi_lmd); - if (!(lsi->lsi_lmd->lmd_flags & LMD_FLG_NOIR)) { + if (!(lsi->lsi_lmd->lmd_flags & LMD_FLG_NOIR) && + !lmd_is_client(lsi->lsi_lmd)) { struct config_llog_data *recover_cld; ptr = strrchr(seclogname, '-'); if (ptr != NULL) { |
| Comment by Jinshan Xiong (Inactive) [ 29/Dec/15 ] |
|
Hi Bobijam, Servers need to be notified too - for example, MDTs need to reconnect to restarted OSTs. |
| Comment by Zhenyu Xu [ 29/Dec/15 ] |
|
I'm wondering whether this has to do with the OBD_CONNECT_MNE_SWAB/OBD_CONNECT_MDS_MDS change, esp. for commit bee9c1897677473f12c0b807edd3e8fec452bc32, which "Do not evict MDS-MDS connection" while "The MNE_SWAB flag is overloading the MDS_MDS bit only for the MGS * connection". The umount is waiting for a config_llog_data::cld_lock mutex, which it is hold by the re-queue thread and the thread is waiting for the mgc lock get granted or cancelled (ldlm_completion_ast()), and that never happens. |
| Comment by Jinshan Xiong (Inactive) [ 05/Jan/16 ] |
|
It's not related - this can only be seen on heterogeneous nodes. In that case, we can make a fix to destroy the lock in mgs_completion_ast_generic() if it times out for waiting lock completion. |
| Comment by Jinshan Xiong (Inactive) [ 06/Jan/16 ] |
|
I took a further look at this issue. It turned out that when ldlm lock completion timed out in the ldlm_completion_ast() it doesn't actually time out the locks, instead it just prints an error message and wait forever. Once the completion AST times out for the MGS, it only destroys in flight RPC but leaves lock alone, this caused the stuck in this ticket. I will create a patch to fix the problems I have seen so far. |
| Comment by Gerrit Updater [ 06/Jan/16 ] |
|
Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/17853 |
| Comment by Jay Lan (Inactive) [ 09/Jan/16 ] |
|
The above patch caused compilation error in b2_5_fe. The patch made a call to ldlm_namespace_free_prior() from lustre/mgs/mgs_handler.c, but |
| Comment by Jinshan Xiong (Inactive) [ 12/Jan/16 ] |
|
Hi Jay, I pushed a new patch that should be able to be applied to b2_5_fe. Please give it a try. |
| Comment by Jay Lan (Inactive) [ 12/Jan/16 ] |
|
Hi Jinshan, The new patch is even further away from b2_5_fe |
| Comment by Zhenyu Xu [ 13/Jan/16 ] |
|
port for b2_5_fe is at http://review.whamcloud.com/17976 |
| Comment by Jinshan Xiong (Inactive) [ 18/Jan/16 ] |
|
patch set 2 failed on maloo test but it could hit a different problem. The original problem is that the node got stuck at cleanup but this time it moved forward. I will take a further look. |
| Comment by Sarah Liu [ 19/Jan/16 ] |
|
client and server: lustre-master # 3305 RHEL6.7 https://testing.hpdd.intel.com/test_sets/798b7bb0-bba4-11e5-8506-5254006e85c2 |
| Comment by Jinshan Xiong (Inactive) [ 19/Jan/16 ] |
|
I couldn't see any log from the maloo result. Is there any way to retain the log after test is complete? Or we can reproduce it on our test nodes manually. |
| Comment by Joseph Gmitter (Inactive) [ 25/Jan/16 ] |
|
Sarah, |
| Comment by Sarah Liu [ 25/Jan/16 ] |
|
Hi Joe, Working on it right now. |
| Comment by Sarah Liu [ 26/Jan/16 ] |
|
Xiong, [w3liu@shadow-1 ~]$ ls /home/autotest/logdir/test_logs/2016-01-25/lustre-reviews-el6_7-x86_64--review-dne-part-1--1_6_1__37054__-70190851309880-224528 | grep 26 replay-dual.test_26.console.shadow-4vm1.log replay-dual.test_26.console.shadow-4vm2.log replay-dual.test_26.console.shadow-4vm3.log replay-dual.test_26.console.shadow-4vm4.log replay-dual.test_26.console.shadow-4vm8.log replay-dual.test_26.test_log.shadow-4vm1.log |
| Comment by Sarah Liu [ 26/Jan/16 ] |
|
I put all logs into one file, please check |
| Comment by Jinshan Xiong (Inactive) [ 27/Jan/16 ] |
|
Finally I could reproduce the problem on my own test cluster at home. It turned out the client running dbench was evicted. Please check the log in attachment. Before it was evicted, it complained that 'version mismatch during replay', it looks like that the object was changed. I don't know this would happen. Lustre: 2818:0:(client.c:2886:ptlrpc_replay_interpret()) @@@ Version mismatch during replay req@ffff8800383a3c80 x1524469815762040/t360777260454(360777260454) o36->lustre-MDT0000-mdc-ffff88005a4a7800@172.16.0.128@tcp:12/10 lens 616/424 e 0 to 0 dl 1453855303 ref 2 fl Interpret:R/4/0 rc -75/-75 Lustre: 2818:0:(client.c:2886:ptlrpc_replay_interpret()) @@@ Version mismatch during replay req@ffff88003d87c9c0 x1524469815775168/t360777261048(360777261048) o101->lustre-MDT0000-mdc-ffff88005a4a7800@172.16.0.128@tcp:12/10 lens 808/544 e 0 to 0 dl 1453855304 ref 2 fl Interpret:R/4/0 rc -75/-75 Lustre: 2818:0:(client.c:2886:ptlrpc_replay_interpret()) Skipped 1 previous similar message Lustre: 2818:0:(client.c:2886:ptlrpc_replay_interpret()) @@@ Version mismatch during replay req@ffff880037ca60c0 x1524469815787324/t360777261724(360777261724) o36->lustre-MDT0000-mdc-ffff88005a4a7800@172.16.0.128@tcp:12/10 lens 616/424 e 0 to 0 dl 1453855306 ref 2 fl Interpret:R/4/0 rc -75/-75 Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 9 sec Lustre: 2818:0:(import.c:1339:completed_replay_interpret()) lustre-MDT0000-mdc-ffff88005a4a7800: version recovery fails, reconnecting |
| Comment by Jinshan Xiong (Inactive) [ 27/Jan/16 ] |
|
Actually the problem I saw was irrelevant to this ticket at all because I could see the same problem without the patch applied. |
| Comment by Jinshan Xiong (Inactive) [ 27/Jan/16 ] |
|
Hi Tappro, can you please take a look at the logs I posted above? |
| Comment by Di Wang [ 27/Jan/16 ] |
|
Hmm, this usually means recovery failed (most likely fid and name are not match IMHO). So you mean this failure is related with the umount timeout or not? If not, do you mind if I create a new ticket here? I saw similar failures in other recovery tests as well? |
| Comment by Di Wang [ 27/Jan/16 ] |
|
Though this might not be the real cause of "version mismatch", but replay-dual.sh 26 can pass with the fix of |
| Comment by Jinshan Xiong (Inactive) [ 27/Jan/16 ] |
|
Hi Di, the problem I mentioned above it's irrelevant to unmount issue we're addressing in this ticket. Please go ahead filing a new ticket. The problem is the testing can't be passed at all. It keeps hitting that issue. |
| Comment by Jinshan Xiong (Inactive) [ 27/Jan/16 ] |
|
btw, are you seeing the same issue as I described above? |
| Comment by Di Wang [ 27/Jan/16 ] |
|
ah, I do not know the umount issue, and can not reproduce it locally neither. I posted the patch on http://review.whamcloud.com/18172, and "version mismatch" seems disappear with this patch, at least in my local tests. But I probably need dig it further here. |
| Comment by Jinshan Xiong (Inactive) [ 27/Jan/16 ] |
|
indeed, your patch helped a lot on my new test. I only see one failure after that and there was no eviction but I didn't catch the log. I tried a lot times to reproduce it but failed - may be just a hiccup somewhere. I'll rebase my patch on Di's and see how it goes. |
| Comment by Jay Lan (Inactive) [ 28/Jan/16 ] |
|
Zhenyu Xu ported an earlier version (patch set 3, IIRC) of #17853 for b2_5_fe at Could someone from Intel update 17976 to the latest patch set #4 for us? Thanks! |
| Comment by Jinshan Xiong (Inactive) [ 30/Jan/16 ] |
|
Hi Jay, The patch set 3 of 17853 is the same thing as patch 17976. |
| Comment by Saurabh Tandan (Inactive) [ 03/Feb/16 ] |
|
Another instance failing with the same error as above for tag 2.7.66 for FULL - EL6.7 Server/EL6.7 Client - DNE , master build# 3314. Encountered another instance for tag 2.7.66 for FULL - EL7.1 Server/EL7.1 Client , master , build# 3314. Another failure for master : Tag 2.7.66 FULL - EL7.1 Server/SLES11 SP3 Client, build# 3314 |
| Comment by Jinshan Xiong (Inactive) [ 04/Feb/16 ] |
|
For the failure at this test results: It shows the error message: "dbench 31895 missing", but actually the dbench process(dbench_26_pid) had PID as 7185. There is no error message in the console log either. I don't know where the error came from. |
| Comment by Saurabh Tandan (Inactive) [ 10/Feb/16 ] |
|
Another instance found for interop tag 2.7.66 - EL7 Server/2.7.1 Client, build# 3316 |
| Comment by Saurabh Tandan (Inactive) [ 10/Feb/16 ] |
|
Another instance found for Full tag 2.7.66 - EL6.7 Server/EL6.7 Client, build# 3314 Another instance found for Full tag 2.7.66 - EL6.7 Server/EL6.7 Client - ZFS, build# 3314 Another instance found for Full tag 2.7.66 - EL6.7 Server/EL6.7 Client - DNE, build# 3314 Another instance found for Full tag 2.7.66 - EL7.1 Server/EL6.7 Client, build# 3314 Another instance found for Full tag 2.7.66 - EL7.1 Server/EL7.1 Client, build# 3314 Another instance found for Full tag 2.7.66 -EL7.1 Server/SLES11 SP3 Client, build# 3314 Another instance found for Full tag 2.7.66 -EL7.1 Server/EL7.1 Client - ZFS, build# 3314 Another instance found for Full tag 2.7.66 -EL7.1 Server/EL7.1 Client - DNE, build# 3314 |
| Comment by Andreas Dilger [ 03/Jun/16 ] |
|
Not sure if |
| Comment by Gerrit Updater [ 07/Jun/16 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/20669 |
| Comment by Gerrit Updater [ 11/Jun/16 ] |
|
Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/20669/ |
| Comment by Jian Yu [ 08/Jul/16 ] |
|
On master branch, replay-vbr test 1b also hit the same failure: |
| Comment by Li Xi (Inactive) [ 22/Feb/17 ] |
|
Hi, any conclusion on the patch https://review.whamcloud.com/#/c/17853/? We are hitting this problem really frequently. Would you please accelerate the review of that patch? |
| Comment by Peter Jones [ 21/Apr/17 ] |
|
Li Xi It would probably be best to open a DDN ticket for the issue you are hitting Peter |
| Comment by James Nunez (Inactive) [ 24/May/17 ] |
|
Using patch 17853, we are still seeing replay-dual test 26 hangs. The patch I used to test this is at https://review.whamcloud.com/#/c/27256/ and the replay-dual failure is at https://testing.hpdd.intel.com/test_sets/3d171e98-408f-11e7-b3f6-5254006e85c2. Looking at the MDS1 console, we can see that the MDS hangs in umount 07:36:47:[ 4392.087661] Lustre: 5245:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 24 previous similar messages 07:36:47:[ 4425.382238] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.4.147@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. 07:36:47:[ 4425.387696] LustreError: Skipped 901 previous similar messages 07:36:47:[ 4440.091059] INFO: task umount:13451 blocked for more than 120 seconds. 07:36:47:[ 4440.093651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 07:36:47:[ 4440.096310] umount D ffff88005f304600 0 13451 13450 0x00000080 07:36:47:[ 4440.098974] ffff880065df3a20 0000000000000082 ffff8800663cbec0 ffff880065df3fd8 07:36:47:[ 4440.101633] ffff880065df3fd8 ffff880065df3fd8 ffff8800663cbec0 ffff88005f304740 07:36:47:[ 4440.104257] ffff88005f304748 7fffffffffffffff ffff8800663cbec0 ffff88005f304600 07:36:47:[ 4440.106805] Call Trace: 07:36:47:[ 4440.108922] [<ffffffff8168c3c9>] schedule+0x29/0x70 07:36:47:[ 4440.111173] [<ffffffff81689e29>] schedule_timeout+0x239/0x2c0 07:36:47:[ 4440.113525] [<ffffffff811dafb9>] ? discard_slab+0x39/0x50 07:36:47:[ 4440.115807] [<ffffffff811dcaaa>] ? __slab_free+0x2ca/0x2f0 07:36:47:[ 4440.118100] [<ffffffff8168c7a6>] wait_for_completion+0x116/0x170 07:36:47:[ 4440.120423] [<ffffffff810c54c0>] ? wake_up_state+0x20/0x20 07:36:47:[ 4440.122724] [<ffffffffa0d8205e>] mgs_ir_fini_fs+0x27e/0x2f0 [mgs] 07:36:47:[ 4440.125044] [<ffffffffa0d753f3>] mgs_put_fsdb+0x53/0x910 [mgs] 07:36:47:[ 4440.127330] [<ffffffffa0d75e3c>] mgs_cleanup_fsdb_list+0x6c/0x90 [mgs] 07:36:47:[ 4440.129664] [<ffffffffa0d59a97>] mgs_device_fini+0x97/0x5b0 [mgs] 07:36:47:[ 4440.132081] [<ffffffffa087d584>] class_cleanup+0x784/0xd90 [obdclass] 07:36:47:[ 4440.134416] [<ffffffffa088051d>] class_process_config+0x225d/0x2b10 [obdclass] 07:36:47:[ 4440.136806] [<ffffffffa0678d87>] ? libcfs_debug_msg+0x57/0x80 [libcfs] 07:36:47:[ 4440.139136] [<ffffffffa0880ebf>] class_manual_cleanup+0xef/0x810 [obdclass] 07:36:47:[ 4440.141491] [<ffffffffa08b1580>] server_put_super+0xb20/0xcd0 [obdclass] 07:36:47:[ 4440.143755] [<ffffffff81200d22>] generic_shutdown_super+0x72/0xf0 07:36:47:[ 4440.145932] [<ffffffff812010f2>] kill_anon_super+0x12/0x20 07:36:47:[ 4440.148033] [<ffffffffa0884a92>] lustre_kill_super+0x32/0x50 [obdclass] 07:36:47:[ 4440.150217] [<ffffffff812014a9>] deactivate_locked_super+0x49/0x60 07:36:47:[ 4440.152331] [<ffffffff81201aa6>] deactivate_super+0x46/0x60 07:36:47:[ 4440.154363] [<ffffffff8121eef5>] mntput_no_expire+0xc5/0x120 07:36:47:[ 4440.156357] [<ffffffff81220030>] SyS_umount+0xa0/0x3b0 07:36:47:[ 4440.158276] [<ffffffff816973c9>] system_call_fastpath+0x16/0x1b 07:36:47:[ 4560.160066] INFO: task umount:13451 blocked for more than 120 seconds. |
| Comment by Gerrit Updater [ 29/Jul/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/17853/ |
| Comment by Peter Jones [ 29/Jul/17 ] |
|
Landed for 2.11 |
| Comment by Gerrit Updater [ 02/Aug/17 ] |
|
Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/28323 |
| Comment by Gerrit Updater [ 10/Aug/17 ] |
|
John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/28323/ |
| Comment by James Nunez (Inactive) [ 18/Dec/17 ] |
|
replay-dual test 26 is still on the ALWAYS_EXCEPT list and need to be removed to fully test that the patches fix this issue. |
| Comment by Gerrit Updater [ 31/Dec/17 ] |
|
James Casper (jamesx.casper@intel.com) uploaded a new patch: https://review.whamcloud.com/30677 |
| Comment by Gerrit Updater [ 09/Jan/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30677/ |
| Comment by Gerrit Updater [ 09/Jan/18 ] |
|
Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/30796 |
| Comment by Jian Yu [ 22/Jan/18 ] |
|
By searching on Maloo, I found replay-dual test 26 hung at least 17 times in the last week. Here is a failure instance on master branch: Console log on MDS: [51864.393931] LustreError: 14617:0:(qsd_reint.c:56:qsd_reint_completion()) lustre-MDT0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x1010000:0x0], rc:-5 [51864.395871] LustreError: 14617:0:(qsd_reint.c:56:qsd_reint_completion()) Skipped 1 previous similar message [51960.236441] INFO: task umount:14605 blocked for more than 120 seconds. [51960.237288] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [51960.238163] umount D ffff88005c01cf10 0 14605 14604 0x00000080 [51960.238987] Call Trace: [51960.239287] [<ffffffff816ac5d9>] schedule_preempt_disabled+0x29/0x70 [51960.240077] [<ffffffff816aa407>] __mutex_lock_slowpath+0xc7/0x1d0 [51960.241003] [<ffffffff816a981f>] mutex_lock+0x1f/0x2f [51960.241719] [<ffffffffc069c3e7>] mgc_process_config+0x207/0x13f0 [mgc] [51960.242608] [<ffffffffc0950a26>] obd_process_config.constprop.14+0x76/0x230 [obdclass] [51960.243674] [<ffffffffc0638d47>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [51960.244490] [<ffffffffc093d639>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [51960.245406] [<ffffffffc0951cbf>] lustre_end_log+0x1ff/0x550 [obdclass] [51960.246218] [<ffffffffc097d15e>] server_put_super+0x7de/0xcd0 [obdclass] [51960.247052] [<ffffffff812054d2>] generic_shutdown_super+0x72/0x100 [51960.247873] [<ffffffff812058a2>] kill_anon_super+0x12/0x20 [51960.248579] [<ffffffffc09500c2>] lustre_kill_super+0x32/0x50 [obdclass] [51960.249417] [<ffffffff81205c59>] deactivate_locked_super+0x49/0x60 [51960.250180] [<ffffffff812063c6>] deactivate_super+0x46/0x60 [51960.250878] [<ffffffff8122376f>] cleanup_mnt+0x3f/0x80 [51960.251555] [<ffffffff81223802>] __cleanup_mnt+0x12/0x20 [51960.252269] [<ffffffff810aee05>] task_work_run+0xc5/0xf0 [51960.252967] [<ffffffff8102ab52>] do_notify_resume+0x92/0xb0 [51960.253708] [<ffffffff816b8d37>] int_signal+0x12/0x17 |
| Comment by Minh Diep [ 04/May/18 ] |
|
+1 on b2_10 https://testing.hpdd.intel.com/test_sets/07ca0dd2-43d0-11e8-960d-52540065bddc |
| Comment by James Nunez (Inactive) [ 05/Jul/18 ] |
|
We have another case of replay-dual test 26 failing in umount. The interesting thing here is that the test just fails and continues with the next test. The test does not time out/hang due to this failure. The console logs of the MDS (vm4) have the same call trace in umount as above. Logs are at: https://testing.whamcloud.com/test_sets/f594d332-8023-11e8-97ff-52540065bddc |
| Comment by Gerrit Updater [ 30/Jul/18 ] |
|
James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32902 |
| Comment by Gerrit Updater [ 09/Aug/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32902/ |
| Comment by James Nunez (Inactive) [ 14/Aug/18 ] |
|
Looking at the MDS console logs, the following test sessions have essentially the same stack trace as what is described in this ticket. Looking at the kernel crash logs shows oom-killer. For https://testing.whamcloud.com/test_sets/a498de80-9ade-11e8-8ee3-52540065bddc, the kernel crash log shows that tar envokes the oom-killer: [60604.459486] tar invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0 [60604.460400] tar cpuset=/ mems_allowed=0 [60604.460823] CPU: 0 PID: 16324 Comm: tar Kdump: loaded Tainted: G OE ------------ 3.10.0-862.9.1.el7.x86_64 #1 [60604.461874] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [60604.462439] Call Trace: [60604.462724] [<ffffffff81b0e84e>] dump_stack+0x19/0x1b [60604.463220] [<ffffffff81b0a1d0>] dump_header+0x90/0x229 [60604.463737] [<ffffffff81b1badf>] ? notifier_call_chain+0x4f/0x70 [60604.464338] [<ffffffff814c17b8>] ? __blocking_notifier_call_chain+0x58/0x70 [60604.465018] [<ffffffff8159826e>] check_panic_on_oom+0x2e/0x60 [60604.465589] [<ffffffff8159868b>] out_of_memory+0x23b/0x4f0 [60604.466124] [<ffffffff8159f224>] __alloc_pages_nodemask+0xaa4/0xbb0 [60604.466735] [<ffffffff815ec525>] alloc_pages_vma+0xb5/0x200 [60604.467279] [<ffffffff815dae45>] __read_swap_cache_async+0x115/0x190 [60604.467886] [<ffffffff815daee6>] read_swap_cache_async+0x26/0x60 [60604.468472] [<ffffffff815dafc8>] swapin_readahead+0xa8/0x110 [60604.469034] [<ffffffff815c5f37>] handle_pte_fault+0x777/0xc30 [60604.469601] [<ffffffff815c7c3d>] handle_mm_fault+0x39d/0x9b0 [60604.470163] [<ffffffff81525092>] ? from_kgid_munged+0x12/0x20 [60604.470717] [<ffffffff81b1b557>] __do_page_fault+0x197/0x4f0 [60604.471260] [<ffffffff81b1b996>] trace_do_page_fault+0x56/0x150 [60604.471829] [<ffffffff81b1af22>] do_async_page_fault+0x22/0xf0 [60604.472405] [<ffffffff81b17788>] async_page_fault+0x28/0x30 [60604.472938] Mem-Info: [60604.473179] active_anon:0 inactive_anon:6 isolated_anon:0 active_file:274 inactive_file:308 isolated_file:0 unevictable:0 dirty:38 writeback:0 unstable:0 slab_reclaimable:3259 slab_unreclaimable:38342 mapped:181 shmem:12 pagetables:1345 bounce:0 free:12914 free_pcp:0 free_cma:0 [60604.476120] Node 0 DMA free:7020kB min:416kB low:520kB high:624kB active_anon:4kB inactive_anon:0kB active_file:4kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:12kB writeback:0kB mapped:4kB shmem:48kB slab_reclaimable:132kB slab_unreclaimable:724kB kernel_stack:16kB pagetables:124kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:1494 all_unreclaimable? yes [60604.480017] lowmem_reserve[]: 0 1660 1660 1660 [60604.480573] Node 0 DMA32 free:44636kB min:44636kB low:55792kB high:66952kB active_anon:0kB inactive_anon:24kB active_file:1092kB inactive_file:1232kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080744kB managed:1704004kB mlocked:0kB dirty:140kB writeback:0kB mapped:720kB shmem:0kB slab_reclaimable:12904kB slab_unreclaimable:152644kB kernel_stack:2848kB pagetables:5256kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:3668 all_unreclaimable? yes [60604.484717] lowmem_reserve[]: 0 0 0 0 [60604.485180] Node 0 DMA: 9*4kB (UM) 5*8kB (UM) 10*16kB (M) 9*32kB (UM) 8*64kB (UM) 5*128kB (UM) 5*256kB (UM) 4*512kB (UM) 0*1024kB 1*2048kB (M) 0*4096kB = 7052kB [60604.486948] Node 0 DMA32: 2654*4kB (EM) 2385*8kB (EM) 941*16kB (UM) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44752kB [60604.488485] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [60604.489308] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [60604.490109] 303 total pagecache pages [60604.490472] 6 pages in swap cache [60604.490790] Swap cache stats: add 86869, delete 86863, find 18036/30348 [60604.491421] Free swap = 3521276kB [60604.491749] Total swap = 3671036kB [60604.492082] 524184 pages RAM [60604.492368] 0 pages HighMem/MovableOnly [60604.492812] 94206 pages reserved [60604.493137] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [60604.493889] [ 333] 0 333 9868 0 22 108 0 systemd-journal [60604.494776] [ 359] 0 359 29149 0 26 79 0 lvmetad [60604.495542] [ 366] 0 366 11101 1 23 147 -1000 systemd-udevd [60604.496357] [ 453] 0 453 13877 0 26 119 -1000 auditd [60604.497124] [ 480] 0 480 6627 1 19 95 0 systemd-logind [60604.497929] [ 481] 999 481 134608 0 60 2165 0 polkitd [60604.498689] [ 482] 0 482 5381 0 15 59 0 irqbalance [60604.499486] [ 483] 81 483 14590 1 34 213 -900 dbus-daemon [60604.500283] [ 484] 32 484 17305 0 38 160 0 rpcbind [60604.501042] [ 485] 0 485 48770 0 36 126 0 gssproxy [60604.501815] [ 486] 0 486 137505 0 87 654 0 NetworkManager [60604.502639] [ 501] 998 501 30087 0 29 123 0 chronyd [60604.503403] [ 533] 0 533 26849 1 53 499 0 dhclient [60604.504171] [ 895] 0 895 143453 42 98 2797 0 tuned [60604.504929] [ 896] 0 896 28203 1 55 257 -1000 sshd [60604.505686] [ 900] 0 900 74575 8 73 916 0 rsyslogd [60604.506456] [ 906] 0 906 6791 1 18 62 0 xinetd [60604.507223] [ 913] 29 913 10605 0 24 209 0 rpc.statd [60604.508002] [ 917] 997 917 56469 0 23 285 0 munged [60604.508760] [ 992] 0 992 31570 1 19 155 0 crond [60604.509518] [ 993] 0 993 6476 0 19 52 0 atd [60604.510256] [ 996] 0 996 167981 0 69 580 0 automount [60604.511040] [ 1007] 0 1007 27522 1 12 33 0 agetty [60604.511804] [ 1009] 0 1009 27522 1 9 32 0 agetty [60604.512563] [ 1181] 0 1181 22408 0 43 265 0 master [60604.513318] [ 1200] 89 1200 22451 0 46 253 0 qmgr [60604.514075] [10963] 0 10963 39169 0 78 351 0 sshd [60604.514809] [10965] 0 10965 28296 1 14 58 0 run_test.sh [60604.515601] [11233] 0 11233 29536 1 16 790 0 bash [60604.516346] [21585] 0 21585 29536 0 13 790 0 bash [60604.517086] [21586] 0 21586 26988 0 10 27 0 tee [60604.517808] [21767] 0 21767 29573 1 14 840 0 bash [60604.518551] [31867] 89 31867 22434 0 45 251 0 pickup [60604.519302] [11943] 0 11943 29607 1 14 863 0 bash [60604.520033] [11944] 0 11944 26988 0 10 28 0 tee [60604.520775] [12397] 0 12397 29607 0 14 863 0 bash [60604.521528] [12398] 0 12398 29607 0 14 863 0 bash [60604.522264] [12911] 0 12911 24022 0 22 84 0 pdsh [60604.523013] [12912] 0 12912 29228 0 14 42 0 sed [60604.523754] [15184] 0 15184 29474 1 13 732 0 rundbench [60604.524547] [15195] 0 15195 1618 26 9 52 0 dbench [60604.525302] [15196] 0 15196 1620 61 9 62 0 dbench [60604.526052] [16324] 0 16324 30920 53 15 89 0 tar [60604.526795] [16325] 0 16325 30852 57 17 81 0 tar [60604.527548] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled [60604.528460] CPU: 0 PID: 16324 Comm: tar Kdump: loaded Tainted: G OE ------------ 3.10.0-862.9.1.el7.x86_64 #1 [60604.529473] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [60604.530011] Call Trace: [60604.530259] [<ffffffff81b0e84e>] dump_stack+0x19/0x1b [60604.530746] [<ffffffff81b08b50>] panic+0xe8/0x21f [60604.531196] [<ffffffff81598295>] check_panic_on_oom+0x55/0x60 [60604.531747] [<ffffffff8159868b>] out_of_memory+0x23b/0x4f0 [60604.532268] [<ffffffff8159f224>] __alloc_pages_nodemask+0xaa4/0xbb0 [60604.532865] [<ffffffff815ec525>] alloc_pages_vma+0xb5/0x200 [60604.533400] [<ffffffff815dae45>] __read_swap_cache_async+0x115/0x190 [60604.533994] [<ffffffff815daee6>] read_swap_cache_async+0x26/0x60 [60604.534565] [<ffffffff815dafc8>] swapin_readahead+0xa8/0x110 [60604.535096] [<ffffffff815c5f37>] handle_pte_fault+0x777/0xc30 [60604.535643] [<ffffffff815c7c3d>] handle_mm_fault+0x39d/0x9b0 [60604.536180] [<ffffffff81525092>] ? from_kgid_munged+0x12/0x20 [60604.536733] [<ffffffff81b1b557>] __do_page_fault+0x197/0x4f0 [60604.537268] [<ffffffff81b1b996>] trace_do_page_fault+0x56/0x150 [60604.537828] [<ffffffff81b1af22>] do_async_page_fault+0x22/0xf0 [60604.538383] [<ffffffff81b17788>] async_page_fault+0x28/0x30 The kernel-crash for https://testing.whamcloud.com/test_sets/0bb1397a-9bb9-11e8-8ee3-52540065bddc has tar envoking the oom-killer, but we have a few different *_newfstat calls in the kernel crash: [44806.676691] Lustre: DEBUG MARKER: test_26 fail mds1 1 times [44998.330738] tar invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0 [44998.332386] tar cpuset=/ mems_allowed=0 [44998.333090] CPU: 0 PID: 28253 Comm: tar Kdump: loaded Tainted: G W OE ------------ 3.10.0-862.9.1.el7.x86_64 #1 [44998.335039] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [44998.335669] Call Trace: [44998.335982] [<ffffffffabd0e84e>] dump_stack+0x19/0x1b [44998.336497] [<ffffffffabd0a1d0>] dump_header+0x90/0x229 [44998.337027] [<ffffffffabd1badf>] ? notifier_call_chain+0x4f/0x70 [44998.337645] [<ffffffffab6c17b8>] ? __blocking_notifier_call_chain+0x58/0x70 [44998.338336] [<ffffffffab79826e>] check_panic_on_oom+0x2e/0x60 [44998.338902] [<ffffffffab79868b>] out_of_memory+0x23b/0x4f0 [44998.339448] [<ffffffffab79f224>] __alloc_pages_nodemask+0xaa4/0xbb0 [44998.340074] [<ffffffffab7ec525>] alloc_pages_vma+0xb5/0x200 [44998.340639] [<ffffffffab7dae45>] __read_swap_cache_async+0x115/0x190 [44998.341254] [<ffffffffab7daee6>] read_swap_cache_async+0x26/0x60 [44998.341850] [<ffffffffab7dafc8>] swapin_readahead+0xa8/0x110 [44998.342416] [<ffffffffab7c5f37>] handle_pte_fault+0x777/0xc30 [44998.342995] [<ffffffffab7c7c3d>] handle_mm_fault+0x39d/0x9b0 [44998.343566] [<ffffffffabd1b557>] __do_page_fault+0x197/0x4f0 [44998.344126] [<ffffffffabd1b996>] trace_do_page_fault+0x56/0x150 [44998.344722] [<ffffffffabd1af22>] do_async_page_fault+0x22/0xf0 [44998.345287] [<ffffffffabd17788>] async_page_fault+0x28/0x30 [44998.345888] [<ffffffffab959730>] ? copy_user_generic_string+0x30/0x40 [44998.346533] [<ffffffffab82142f>] ? cp_new_stat+0x14f/0x180 [44998.347077] [<ffffffffab8215b4>] SYSC_newfstat+0x34/0x60 [44998.347607] [<ffffffffab82179e>] SyS_newfstat+0xe/0x10 [44998.348120] [<ffffffffabd20795>] system_call_fastpath+0x1c/0x21 [44998.348710] [<ffffffffabd206e1>] ? system_call_after_swapgs+0xae/0x146 [44998.349352] Mem-Info: [44998.349595] active_anon:2 inactive_anon:18 isolated_anon:0 active_file:15 inactive_file:1090 isolated_file:0 unevictable:0 dirty:0 writeback:30 unstable:0 slab_reclaimable:3679 slab_unreclaimable:38446 mapped:55 shmem:0 pagetables:1385 bounce:0 free:12871 free_pcp:108 free_cma:0 [44998.352588] Node 0 DMA free:6956kB min:416kB low:520kB high:624kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:72kB slab_unreclaimable:684kB kernel_stack:32kB pagetables:24kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [44998.356534] lowmem_reserve[]: 0 1660 1660 1660 [44998.357078] Node 0 DMA32 free:44528kB min:44636kB low:55792kB high:66952kB active_anon:8kB inactive_anon:72kB active_file:60kB inactive_file:4360kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080744kB managed:1704004kB mlocked:0kB dirty:0kB writeback:120kB mapped:220kB shmem:0kB slab_reclaimable:14644kB slab_unreclaimable:153100kB kernel_stack:2832kB pagetables:5516kB unstable:0kB bounce:0kB free_pcp:432kB local_pcp:4kB free_cma:0kB writeback_tmp:0kB pages_scanned:1441 all_unreclaimable? yes [44998.361327] lowmem_reserve[]: 0 0 0 0 [44998.361788] Node 0 DMA: 17*4kB (U) 11*8kB (U) 5*16kB (U) 3*32kB (U) 0*64kB 2*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 3*2048kB (M) 0*4096kB = 6988kB [44998.363459] Node 0 DMA32: 1871*4kB (EM) 2399*8kB (UEM) 1117*16kB (M) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44548kB [44998.365051] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [44998.365894] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [44998.366710] 1163 total pagecache pages [44998.367087] 0 pages in swap cache [44998.367423] Swap cache stats: add 95573, delete 95573, find 17225/30172 [44998.368063] Free swap = 3521020kB [44998.368403] Total swap = 3671036kB [44998.368744] 524184 pages RAM [44998.369026] 0 pages HighMem/MovableOnly [44998.369409] 94206 pages reserved [44998.369738] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [44998.370539] [ 337] 0 337 10055 1 24 98 0 systemd-journal [44998.371399] [ 359] 0 359 29149 0 26 79 0 lvmetad [44998.372180] [ 364] 0 364 11100 1 24 146 -1000 systemd-udevd [44998.373025] [ 455] 0 455 13877 0 27 119 -1000 auditd [44998.373804] [ 481] 0 481 5381 0 15 59 0 irqbalance [44998.374612] [ 484] 32 484 17305 0 36 139 0 rpcbind [44998.375402] [ 486] 81 486 14554 1 33 176 -900 dbus-daemon [44998.376211] [ 489] 0 489 48770 0 37 126 0 gssproxy [44998.377005] [ 492] 998 492 30087 0 29 124 0 chronyd [44998.377795] [ 499] 0 499 137506 0 87 1128 0 NetworkManager [44998.378649] [ 500] 999 500 134608 0 61 1403 0 polkitd [44998.379433] [ 501] 0 501 6594 1 17 77 0 systemd-logind [44998.380252] [ 546] 0 546 26849 1 53 498 0 dhclient [44998.381077] [ 900] 0 900 28203 1 60 257 -1000 sshd [44998.381846] [ 902] 0 902 143453 0 96 3303 0 tuned [44998.382623] [ 909] 0 909 74575 0 75 908 0 rsyslogd [44998.383418] [ 913] 29 913 10605 0 25 209 0 rpc.statd [44998.384215] [ 914] 0 914 6791 1 17 63 0 xinetd [44998.385003] [ 918] 997 918 56469 0 21 274 0 munged [44998.385798] [ 983] 0 983 6476 0 19 52 0 atd [44998.386544] [ 985] 0 985 31570 0 21 155 0 crond [44998.387319] [ 993] 0 993 167982 0 69 547 0 automount [44998.388108] [ 997] 0 997 27522 1 9 32 0 agetty [44998.388892] [ 1001] 0 1001 27522 1 12 32 0 agetty [44998.389690] [ 1320] 0 1320 22408 0 42 259 0 master [44998.390467] [ 1346] 89 1346 22451 0 45 254 0 qmgr [44998.391223] [10968] 0 10968 39169 0 77 365 0 sshd [44998.391985] [10970] 0 10970 28296 1 13 58 0 run_test.sh [44998.392810] [11240] 0 11240 29470 1 16 733 0 bash [44998.393561] [18492] 89 18492 22434 0 46 252 0 pickup [44998.394334] [ 1201] 0 1201 29470 0 13 733 0 bash [44998.395082] [ 1202] 0 1202 26988 0 9 27 0 tee [44998.395840] [ 1396] 0 1396 29538 1 12 784 0 bash [44998.396621] [23895] 0 23895 29538 1 12 806 0 bash [44998.397382] [23896] 0 23896 26988 0 10 28 0 tee [44998.398123] [24348] 0 24348 29538 0 12 806 0 bash [44998.398885] [24349] 0 24349 29538 0 12 806 0 bash [44998.399652] [24854] 0 24854 24023 0 21 87 0 pdsh [44998.400403] [24855] 0 24855 29228 0 13 42 0 sed [44998.401140] [27114] 0 27114 29438 1 14 675 0 rundbench [44998.401941] [27125] 0 27125 1618 0 9 44 0 dbench [44998.402713] [27126] 0 27126 1620 1 9 56 0 dbench [44998.403486] [28252] 0 28252 30920 0 17 95 0 tar [44998.404231] [28253] 0 28253 30852 0 16 77 0 tar [44998.404986] [28255] 0 28255 40840 6 37 209 0 crond [44998.405759] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled [44998.406685] CPU: 0 PID: 28253 Comm: tar Kdump: loaded Tainted: G W OE ------------ 3.10.0-862.9.1.el7.x86_64 #1 [44998.407727] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [44998.408277] Call Trace: [44998.408522] [<ffffffffabd0e84e>] dump_stack+0x19/0x1b [44998.409021] [<ffffffffabd08b50>] panic+0xe8/0x21f [44998.409491] [<ffffffffab798295>] check_panic_on_oom+0x55/0x60 [44998.410058] [<ffffffffab79868b>] out_of_memory+0x23b/0x4f0 [44998.410588] [<ffffffffab79f224>] __alloc_pages_nodemask+0xaa4/0xbb0 [44998.411198] [<ffffffffab7ec525>] alloc_pages_vma+0xb5/0x200 [44998.411747] [<ffffffffab7dae45>] __read_swap_cache_async+0x115/0x190 [44998.412363] [<ffffffffab7daee6>] read_swap_cache_async+0x26/0x60 [44998.412951] [<ffffffffab7dafc8>] swapin_readahead+0xa8/0x110 [44998.413498] [<ffffffffab7c5f37>] handle_pte_fault+0x777/0xc30 [44998.414062] [<ffffffffab7c7c3d>] handle_mm_fault+0x39d/0x9b0 [44998.414610] [<ffffffffabd1b557>] __do_page_fault+0x197/0x4f0 [44998.415172] [<ffffffffabd1b996>] trace_do_page_fault+0x56/0x150 [44998.415747] [<ffffffffabd1af22>] do_async_page_fault+0x22/0xf0 [44998.416313] [<ffffffffabd17788>] async_page_fault+0x28/0x30 [44998.416860] [<ffffffffab959730>] ? copy_user_generic_string+0x30/0x40 [44998.417479] [<ffffffffab82142f>] ? cp_new_stat+0x14f/0x180 [44998.418024] [<ffffffffab8215b4>] SYSC_newfstat+0x34/0x60 [44998.418542] [<ffffffffab82179e>] SyS_newfstat+0xe/0x10 [44998.419056] [<ffffffffabd20795>] system_call_fastpath+0x1c/0x21 [44998.419651] [<ffffffffabd206e1>] ? system_call_after_swapgs+0xae/0x146 |
| Comment by Gerrit Updater [ 22/Aug/18 ] |
|
James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33052 |
| Comment by Gerrit Updater [ 11/Sep/18 ] |
|
John L. Hammond (jhammond@whamcloud.com) merged in patch https://review.whamcloud.com/33052/ |
| Comment by Gerrit Updater [ 11/Jun/21 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43977 |
| Comment by Gerrit Updater [ 11/Jun/21 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43978 |
| Comment by Gerrit Updater [ 11/Jun/21 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43979 |
| Comment by Gerrit Updater [ 11/Jun/21 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43980 |
| Comment by Gerrit Updater [ 11/Jun/21 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43982 |
| Comment by Gerrit Updater [ 11/Jun/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43977/ |
| Comment by Gerrit Updater [ 27/Jun/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43978/ |
| Comment by Gerrit Updater [ 23/Dec/21 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/43982/ |