[LU-7372] replay-dual test_26: test failed to respond and timed out Created: 02/Nov/15  Updated: 23/Dec/21  Resolved: 22/Jul/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0, Lustre 2.9.0, Lustre 2.10.0, Lustre 2.11.0, Lustre 2.12.0, Lustre 2.10.3, Lustre 2.10.4, Lustre 2.10.5, Lustre 2.12.4
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Maloo Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None
Environment:

Server/Client : master, build # 3225 RHEL 6.7


Attachments: File 1453855057.tgz     HTML File log-7372    
Issue Links:
Duplicate
is duplicated by LU-10771 replay-dual test_26: Kernel panic - n... Open
is duplicated by LU-11038 replay-dual test_26: MDS crash with B... Open
is duplicated by LU-14749 runtests test 1 hangs on MDS unmount Resolved
Related
is related to LU-7716 Do not do subdir check if source and ... Resolved
is related to LU-7765 replay-dual test 26 buggy redirection Resolved
is related to LU-14406 replay-dual test 22d fails with “Remo... Open
is related to LU-14878 replay-dual test_26: dbench 5354 missing Open
is related to LU-4572 hung mdt threads Resolved
is related to LU-7640 stuck mdt thread required reboot of mds Resolved
is related to LU-7692 LNet: Service thread Hung Resolved
is related to LU-8502 replay-vbr: umount hangs waiting for ... Resolved
is related to LU-482 Test failure on test suite replay-dua... Resolved
is related to LU-7725 Error unpacking OUT message Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Saurabh Tandan <saurabh.tandan@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/1e79d2a6-7d21-11e5-a254-5254006e85c2.

The sub-test test_26 failed with the following error:

test failed to respond and timed out

Client dmesg:

Lustre: DEBUG MARKER: test_26 fail mds1 1 times
LustreError: 980:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1445937610, 300s ago), entering recovery for MGS@10.2.4.140@tcp ns: MGC10.2.4.140@tcp lock: ffff88007bdd82c0/0x956ab2c8047544d6 lrc: 4/1,0 mode: --/CR res: [0x65727473756c:0x2:0x0].0x0 rrc: 1 type: PLN flags: 0x1000000000000 nid: local remote: 0x223a79061b204538 expref: -99 pid: 980 timeout: 0 lvb_type: 0
Lustre: 29433:0:(client.c:2039:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1445937910/real 1445937910]  req@ffff880028347980 x1516173751413108/t0(0) o250->MGC10.2.4.140@tcp@10.2.4.140@tcp:26/25 lens 520/544 e 0 to 1 dl 1445937916 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 29433:0:(client.c:2039:ptlrpc_expire_one_request()) Skipped 67 previous similar messages

MDS console:

09:22:17:LustreError: 24638:0:(client.c:1138:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff88004d92c980 x1516158358024328/t0(0) o101->lustre-MDT0000-lwp-MDT0000@0@lo:23/10 lens 456/496 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
09:25:19:LustreError: 24638:0:(client.c:1138:ptlrpc_import_delay_req()) Skipped 6 previous similar messages
09:25:19:LustreError: 24638:0:(qsd_reint.c:55:qsd_reint_completion()) lustre-MDT0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x10000:0x0], rc:-5
09:25:19:LustreError: 24638:0:(qsd_reint.c:55:qsd_reint_completion()) Skipped 1 previous similar message
09:25:19:INFO: task umount:24629 blocked for more than 120 seconds.
09:25:19:      Not tainted 2.6.32-573.7.1.el6_lustre.x86_64 #1
09:25:19:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
09:25:19:umount        D 0000000000000000     0 24629  24628 0x00000080
09:25:19: ffff880059e2bb48 0000000000000086 0000000000000000 00000000000708b7
09:25:20: 0000603500000000 000000ac00000000 00001c1fd9b9c014 ffff880059e2bb98
09:25:20: ffff880059e2bb58 0000000101d3458a ffff880076ee3ad8 ffff880059e2bfd8
09:25:20:Call Trace:
09:25:20: [<ffffffff8153a756>] __mutex_lock_slowpath+0x96/0x210
09:25:20: [<ffffffff8153a27b>] mutex_lock+0x2b/0x50
09:25:20: [<ffffffffa02cb30d>] mgc_process_config+0x1dd/0x1210 [mgc]
09:25:20: [<ffffffffa0476b61>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
09:25:20: [<ffffffffa07fe28d>] obd_process_config.clone.0+0x8d/0x2e0 [obdclass]
09:25:20: [<ffffffffa0476b61>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
09:25:20: [<ffffffffa08024c2>] lustre_end_log+0x262/0x6a0 [obdclass]
09:25:20: [<ffffffffa082efb1>] server_put_super+0x911/0xed0 [obdclass]
09:25:20: [<ffffffff811b0116>] ? invalidate_inodes+0xf6/0x190
09:25:20: [<ffffffff8119437b>] generic_shutdown_super+0x5b/0xe0
09:25:20: [<ffffffff81194466>] kill_anon_super+0x16/0x60
09:25:20: [<ffffffffa07fa096>] lustre_kill_super+0x36/0x60 [obdclass]
09:25:20: [<ffffffff81194c07>] deactivate_super+0x57/0x80
09:25:20: [<ffffffff811b4a7f>] mntput_no_expire+0xbf/0x110
09:25:20: [<ffffffff811b55cb>] sys_umount+0x7b/0x3a0
09:25:20: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b

Info required for matching: replay-dual test_26



 Comments   
Comment by Peter Jones [ 04/Dec/15 ]

Bobijam

Could you please advise on this one?

Thanks

Peter

Comment by Saurabh Tandan (Inactive) [ 11/Dec/15 ]

master, build# 3264, 2.7.64 tag
Regression:EL6.7 Server/EL6.7 Client
https://testing.hpdd.intel.com/test_sets/67ff2254-9f0a-11e5-ba94-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 11/Dec/15 ]

master, build# 3264, 2.7.64 tag
Regression:EL7.1 Server/EL6.7 Client
https://testing.hpdd.intel.com/test_sets/7a981fb0-9f0a-11e5-8d81-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 11/Dec/15 ]

master, build# 3264, 2.7.64 tag
Regression:EL7.1 Server/EL7.1 Client
https://testing.hpdd.intel.com/test_sets/76a79476-9f37-11e5-ba94-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 11/Dec/15 ]

master, build# 3264, 2.7.64 tag
Regression:EL7.1 Server/SLES11 SP3 Client
https://testing.hpdd.intel.com/test_sets/17e264d6-9f2b-11e5-bf9b-5254006e85c2

Comment by Sarah Liu [ 14/Dec/15 ]

another instance seen on RHEL6.7 server/SLES11 SP3 client
https://testing.hpdd.intel.com/test_sets/52597c80-a052-11e5-a33d-5254006e85c2

Comment by Zhenyu Xu [ 16/Dec/15 ]

mgs_ir_fini_fs() was waiting for the completion signal supposed to be issued by mgs_ir_notify(), while mgs_ir_notify() was waiting in the previous loop for the revoking of a CONFIG_T_RECOVERY lock indefinitely.

mgs_ir_fini_fs
[11146.534027] umount          D ffff88007fc13680     0   416    415 0x00000080
[11146.534027]  ffff88007a183a00 0000000000000086 ffff88007a183fd8 0000000000013680
[11146.534027]  ffff88007a183fd8 0000000000013680 ffff880078725b00 ffff880061536728
[11146.534027]  ffff880061536730 7fffffffffffffff ffff880078725b00 ffff880061536600
[11146.534027] Call Trace:
[11146.534027]  [<ffffffff8160a709>] schedule+0x29/0x70
[11146.534027]  [<ffffffff81608649>] schedule_timeout+0x209/0x2d0
[11146.534027]  [<ffffffff81601c94>] ? __slab_free+0x10e/0x277
[11146.534027]  [<ffffffff8160ac16>] wait_for_completion+0x116/0x170
[11146.534027]  [<ffffffff810a9660>] ? wake_up_state+0x20/0x20
[11146.534027]  [<ffffffffa0cf0930>] mgs_ir_fini_fs+0x250/0x46c [mgs]
[11146.534027]  [<ffffffffa0cd8fac>] mgs_free_fsdb+0x4c/0xcd0 [mgs]
[11146.534027]  [<ffffffffa0ce5502>] mgs_cleanup_fsdb_list+0x52/0x70 [mgs]
[11146.534027]  [<ffffffffa0cc9f27>] mgs_device_fini+0x97/0x5b0 [mgs]
[11146.534027]  [<ffffffffa079b4e4>] class_cleanup+0x734/0xcc0 [obdclass]
[11146.534027]  [<ffffffffa079dd83>] class_process_config+0x1bf3/0x2cf0 [obdclass]
[11146.534027]  [<ffffffffa067b3c7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[11146.534027]  [<ffffffffa079ef6f>] class_manual_cleanup+0xef/0xba0 [obdclass]
[11146.534027]  [<ffffffffa07d6880>] server_put_super+0xcc0/0xea0 [obdclass]
[11146.534027]  [<ffffffff811c9426>] generic_shutdown_super+0x56/0xe0
[11146.534027]  [<ffffffff811c9692>] kill_anon_super+0x12/0x20
[11146.534027]  [<ffffffffa07a2c42>] lustre_kill_super+0x32/0x50 [obdclass]
[11146.534027]  [<ffffffff811c9a3d>] deactivate_locked_super+0x3d/0x60
[11146.534027]  [<ffffffff811ca046>] deactivate_super+0x46/0x60
[11146.534027]  [<ffffffff811e6f35>] mntput_no_expire+0xc5/0x120
[11146.534027]  [<ffffffff811e806f>] SyS_umount+0x9f/0x3c0
[11146.534027]  [<ffffffff81615309>] system_call_fastpath+0x16/0x1b
mgs_ir_notify
[11146.534027] mgs_lustre_noti S ffff88007fd13680     0 28657      2 0x00000080
[11146.534027]  ffff880060a43ba8 0000000000000046 ffff880060a43fd8 0000000000013680
[11146.534027]  ffff880060a43fd8 0000000000013680 ffff88005b4b16c0 ffff88005fddba00
[11146.534027]  ffff88005b4b16c0 0000000000000000 ffffffffa09d4210 ffff88005b4b16c0
[11146.534027] Call Trace:
[11146.534027]  [<ffffffffa09d4210>] ? ldlm_completion_ast_async+0x300/0x300 [ptlrpc]
[11146.534027]  [<ffffffff8160a709>] schedule+0x29/0x70
[11146.534027]  [<ffffffffa09d4a8d>] ldlm_completion_ast+0x62d/0x910 [ptlrpc]
[11146.534027]  [<ffffffff810a9660>] ? wake_up_state+0x20/0x20
[11146.534027]  [<ffffffffa0cc8721>] mgs_completion_ast_generic+0xb1/0x1d0 [mgs]
[11146.534027]  [<ffffffffa0cc8853>] mgs_completion_ast_ir+0x13/0x20 [mgs]
[11146.534027]  [<ffffffffa09d7150>] ldlm_cli_enqueue_local+0x230/0x940 [ptlrpc]
[11146.534027]  [<ffffffffa0cc8840>] ? mgs_completion_ast_generic+0x1d0/0x1d0 [mgs]
[11146.534027]  [<ffffffffa09d9f40>] ? ldlm_blocking_ast_nocheck+0x310/0x310 [ptlrpc]
[11146.534027]  [<ffffffffa0ccfdcc>] mgs_revoke_lock+0x1dc/0x360 [mgs]
[11146.534027]  [<ffffffffa09d9f40>] ? ldlm_blocking_ast_nocheck+0x310/0x310 [ptlrpc]
[11146.534027]  [<ffffffffa0cc8840>] ? mgs_completion_ast_generic+0x1d0/0x1d0 [mgs]
[11146.534027]  [<ffffffffa0cece4a>] mgs_ir_notify+0x14a/0x2b0 [mgs]
[11146.534027]  [<ffffffff810a9660>] ? wake_up_state+0x20/0x20
[11146.534027]  [<ffffffffa0cecd00>] ? lprocfs_ir_set_state+0x170/0x170 [mgs]
[11146.534027]  [<ffffffff810973af>] kthread+0xcf/0xe0
[11146.534027]  [<ffffffff810972e0>] ? kthread_create_on_node+0x140/0x140
[11146.534027]  [<ffffffff81615258>] ret_from_fork+0x58/0x90
[11146.534027]  [<ffffffff810972e0>] ? kthread_create_on_node+0x140/0x140
Comment by Jinshan Xiong (Inactive) [ 17/Dec/15 ]

Hi Bobijam,

Do you know which lock is this process waiting for?

"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
umount        D 0000000000000000     0 24629  24628 0x00000080
 ffff880059e2bb48 0000000000000086 0000000000000000 00000000000708b7
 0000603500000000 000000ac00000000 00001c1fd9b9c014 ffff880059e2bb98
 ffff880059e2bb58 0000000101d3458a ffff880076ee3ad8 ffff880059e2bfd8
Call Trace:
 [<ffffffff8153a756>] __mutex_lock_slowpath+0x96/0x210
 [<ffffffff8153a27b>] mutex_lock+0x2b/0x50
 [<ffffffffa02cb30d>] mgc_process_config+0x1dd/0x1210 [mgc]
 [<ffffffffa0476b61>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa07fe28d>] obd_process_config.clone.0+0x8d/0x2e0 [obdclass]
 [<ffffffffa0476b61>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa08024c2>] lustre_end_log+0x262/0x6a0 [obdclass]
 [<ffffffffa082efb1>] server_put_super+0x911/0xed0 [obdclass]
 [<ffffffff811b0116>] ? invalidate_inodes+0xf6/0x190
 [<ffffffff8119437b>] generic_shutdown_super+0x5b/0xe0
 [<ffffffff81194466>] kill_anon_super+0x16/0x60
 [<ffffffffa07fa096>] lustre_kill_super+0x36/0x60 [obdclass]
 [<ffffffff81194c07>] deactivate_super+0x57/0x80
 [<ffffffff811b4a7f>] mntput_no_expire+0xbf/0x110
 [<ffffffff811b55cb>] sys_umount+0x7b/0x3a0

It looks like the lock cancellation was blocked by this lock somehow.

Comment by Zhenyu Xu [ 17/Dec/15 ]

but in the MDS debug log, the same process later shows a different backtrace, and hung in that situation ever since.

09:25:21:Lustre: lock timed out (enqueued at 1445937602, 300s ago)
....
09:31:08:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
09:31:08:umount        D 0000000000000000     0 24629  24628 0x00000080
09:31:08: ffff880059e2b948 0000000000000086 0000000000000000 ffffffffa046a27b
09:31:08: 00000000562f42f5 ffff8800378fd480 00001c639f546de4 ffff880052840c20
09:31:08: ffff8800378fd480 0000000101d7b690 ffff880076ee3ad8 ffff880059e2bfd8
09:31:08:Call Trace:
09:31:08: [<ffffffffa046a27b>] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs]
09:31:08: [<ffffffff81539cd5>] schedule_timeout+0x215/0x2e0
09:31:08: [<ffffffff81539953>] wait_for_common+0x123/0x180
09:31:08: [<ffffffff810672b0>] ? default_wake_function+0x0/0x20
09:31:08: [<ffffffff81539a6d>] wait_for_completion+0x1d/0x20
09:31:08: [<ffffffffa0640fb1>] mgs_ir_fini_fs+0x1e1/0x340 [mgs]
09:31:08: [<ffffffffa0637348>] mgs_free_fsdb+0x48/0xf40 [mgs]
09:31:08: [<ffffffffa063828f>] mgs_cleanup_fsdb_list+0x4f/0x70 [mgs]
09:31:08: [<ffffffffa06209a0>] mgs_device_fini+0x120/0x5b0 [mgs]
09:31:08: [<ffffffffa07f3d42>] class_cleanup+0x572/0xd20 [obdclass]
09:31:08: [<ffffffffa07d4906>] ? class_name2dev+0x56/0xe0 [obdclass]
09:31:08: [<ffffffffa07f63c6>] class_process_config+0x1ed6/0x2830 [obdclass]
09:31:08: [<ffffffffa0476b61>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
09:31:08: [<ffffffffa07f71df>] class_manual_cleanup+0x4bf/0x8e0 [obdclass]
09:31:08: [<ffffffffa07d4906>] ? class_name2dev+0x56/0xe0 [obdclass]
09:31:08: [<ffffffffa082f374>] server_put_super+0xcd4/0xed0 [obdclass]
09:31:08: [<ffffffff811b0116>] ? invalidate_inodes+0xf6/0x190
09:31:08: [<ffffffff8119437b>] generic_shutdown_super+0x5b/0xe0
09:31:08: [<ffffffff81194466>] kill_anon_super+0x16/0x60
09:31:08: [<ffffffffa07fa096>] lustre_kill_super+0x36/0x60 [obdclass]
09:31:08: [<ffffffff81194c07>] deactivate_super+0x57/0x80
09:31:09: [<ffffffff811b4a7f>] mntput_no_expire+0xbf/0x110
09:31:09: [<ffffffff811b55cb>] sys_umount+0x7b/0x3a0
09:31:09: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
Comment by Jinshan Xiong (Inactive) [ 17/Dec/15 ]

That's a different issue. The MGS should cancel the lock after lock expiration.

Comment by Zhenyu Xu [ 18/Dec/15 ]

[<ffffffffa02cb30d>] mgc_process_config+0x1dd/0x1210 [mgc]

locates

config_log_end()
        if (cld_recover) {
                mutex_lock(&cld_recover->cld_lock);          // -----------> here
                cld_recover->cld_stopping = 1;
                mutex_unlock(&cld_recover->cld_lock);
                config_log_put(cld_recover);
        }
Comment by Saurabh Tandan (Inactive) [ 18/Dec/15 ]

Another instance for EL6.7 Server/EL6.7 Client - ZFS
Master, build# 3270
https://testing.hpdd.intel.com/test_sets/925fea10-a275-11e5-bdef-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 18/Dec/15 ]

Another instance forEL7.1 Server/EL7.1 Client - ZFS
Master, build# 3264
https://testing.hpdd.intel.com/test_sets/42ea859e-a135-11e5-83b8-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 19/Dec/15 ]

Another instance for EL7.1 Server/EL7.1 Client - DNE
Master , Build# 3270
https://testing.hpdd.intel.com/test_sets/69d69aba-a26d-11e5-bdef-5254006e85c2

Comment by Jinshan Xiong (Inactive) [ 19/Dec/15 ]

Hi Bobijam,

Please investigate why lock log end could be blocked there, I guess there must be a deadlock/live case in lock cancellation code and log end code.

For the hung issue, we should make a fix in mgs_completion_ast_generic() where if ldlm_completion_ast() returned failure due to timeout, it should cancel the lock by itself.

Jinshan

Comment by Zhenyu Xu [ 24/Dec/15 ]

Jinshan,

I'm wondering whether that server does not need IR log, since it only client need it.

diff --git a/lustre/mgc/mgc_request.c b/lustre/mgc/mgc_request.c
index ba1a2e1..0f392dd 100644
--- a/lustre/mgc/mgc_request.c
+++ b/lustre/mgc/mgc_request.c
@@ -360,7 +360,8 @@ static int config_log_add(struct obd_device obd, char logname,
        cld->cld_params = params_cld;
 
         LASSERT(lsi->lsi_lmd);
-        if (!(lsi->lsi_lmd->lmd_flags & LMD_FLG_NOIR)) {
+        if (!(lsi->lsi_lmd->lmd_flags & LMD_FLG_NOIR) &&
+           !lmd_is_client(lsi->lsi_lmd)) {
                struct config_llog_data *recover_cld;
                ptr = strrchr(seclogname, '-');
                if (ptr != NULL) {
Comment by Jinshan Xiong (Inactive) [ 29/Dec/15 ]

Hi Bobijam,

Servers need to be notified too - for example, MDTs need to reconnect to restarted OSTs.

Comment by Zhenyu Xu [ 29/Dec/15 ]

I'm wondering whether this has to do with the OBD_CONNECT_MNE_SWAB/OBD_CONNECT_MDS_MDS change, esp. for commit bee9c1897677473f12c0b807edd3e8fec452bc32, which "Do not evict MDS-MDS connection" while "The MNE_SWAB flag is overloading the MDS_MDS bit only for the MGS * connection".

The umount is waiting for a config_llog_data::cld_lock mutex, which it is hold by the re-queue thread and the thread is waiting for the mgc lock get granted or cancelled (ldlm_completion_ast()), and that never happens.

Comment by Jinshan Xiong (Inactive) [ 05/Jan/16 ]

It's not related - this can only be seen on heterogeneous nodes.

In that case, we can make a fix to destroy the lock in mgs_completion_ast_generic() if it times out for waiting lock completion.

Comment by Jinshan Xiong (Inactive) [ 06/Jan/16 ]

I took a further look at this issue. It turned out that when ldlm lock completion timed out in the ldlm_completion_ast() it doesn't actually time out the locks, instead it just prints an error message and wait forever.

Once the completion AST times out for the MGS, it only destroys in flight RPC but leaves lock alone, this caused the stuck in this ticket.

I will create a patch to fix the problems I have seen so far.

Comment by Gerrit Updater [ 06/Jan/16 ]

Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/17853
Subject: LU-7372 mgs: reprocess all locks at device fini
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 967963e6ef7ecb9eef140effbcec795612633e18

Comment by Jay Lan (Inactive) [ 09/Jan/16 ]

The above patch caused compilation error in b2_5_fe.

The patch made a call to ldlm_namespace_free_prior() from lustre/mgs/mgs_handler.c, but
that symbol is not known to the scope of mgs_handler.c. That symbol was not declared in
ldlm/ldlm_internal.h, which is not included in mgs_handler.c.

Comment by Jinshan Xiong (Inactive) [ 12/Jan/16 ]

Hi Jay,

I pushed a new patch that should be able to be applied to b2_5_fe. Please give it a try.

Comment by Jay Lan (Inactive) [ 12/Jan/16 ]

Hi Jinshan,

The new patch is even further away from b2_5_fe
My code does not even have ldlm_cancel_lock_for_export(). The version I have in b2_5_fe is
ldlm_cancel_locks_for_export().

Comment by Zhenyu Xu [ 13/Jan/16 ]

port for b2_5_fe is at http://review.whamcloud.com/17976

Comment by Jinshan Xiong (Inactive) [ 18/Jan/16 ]

patch set 2 failed on maloo test but it could hit a different problem. The original problem is that the node got stuck at cleanup but this time it moved forward. I will take a further look.

Comment by Sarah Liu [ 19/Jan/16 ]

client and server: lustre-master # 3305 RHEL6.7

https://testing.hpdd.intel.com/test_sets/798b7bb0-bba4-11e5-8506-5254006e85c2

Comment by Jinshan Xiong (Inactive) [ 19/Jan/16 ]

I couldn't see any log from the maloo result. Is there any way to retain the log after test is complete? Or we can reproduce it on our test nodes manually.

Comment by Joseph Gmitter (Inactive) [ 25/Jan/16 ]

Sarah,
Are you able to help Jinshan test manually and obtain the needed logs?
Thanks.
Joe

Comment by Sarah Liu [ 25/Jan/16 ]

Hi Joe,

Working on it right now.
for test only patch
http://review.whamcloud.com/#/c/18130/

Comment by Sarah Liu [ 26/Jan/16 ]

Xiong,
here is the logs for your patch set 3
please see them on shadow under this dir:

[w3liu@shadow-1 ~]$ ls /home/autotest/logdir/test_logs/2016-01-25/lustre-reviews-el6_7-x86_64--review-dne-part-1--1_6_1__37054__-70190851309880-224528 | grep 26
replay-dual.test_26.console.shadow-4vm1.log
replay-dual.test_26.console.shadow-4vm2.log
replay-dual.test_26.console.shadow-4vm3.log
replay-dual.test_26.console.shadow-4vm4.log
replay-dual.test_26.console.shadow-4vm8.log
replay-dual.test_26.test_log.shadow-4vm1.log
Comment by Sarah Liu [ 26/Jan/16 ]

I put all logs into one file, please check

Comment by Jinshan Xiong (Inactive) [ 27/Jan/16 ]

Finally I could reproduce the problem on my own test cluster at home. It turned out the client running dbench was evicted. Please check the log in attachment.

Before it was evicted, it complained that 'version mismatch during replay', it looks like that the object was changed. I don't know this would happen.

Lustre: 2818:0:(client.c:2886:ptlrpc_replay_interpret()) @@@ Version mismatch during replay
  req@ffff8800383a3c80 x1524469815762040/t360777260454(360777260454) o36->lustre-MDT0000-mdc-ffff88005a4a7800@172.16.0.128@tcp:12/10 lens 616/424 e 0 to 0 dl 1453855303 ref 2 fl Interpret:R/4/0 rc -75/-75
Lustre: 2818:0:(client.c:2886:ptlrpc_replay_interpret()) @@@ Version mismatch during replay
  req@ffff88003d87c9c0 x1524469815775168/t360777261048(360777261048) o101->lustre-MDT0000-mdc-ffff88005a4a7800@172.16.0.128@tcp:12/10 lens 808/544 e 0 to 0 dl 1453855304 ref 2 fl Interpret:R/4/0 rc -75/-75
Lustre: 2818:0:(client.c:2886:ptlrpc_replay_interpret()) Skipped 1 previous similar message
Lustre: 2818:0:(client.c:2886:ptlrpc_replay_interpret()) @@@ Version mismatch during replay
  req@ffff880037ca60c0 x1524469815787324/t360777261724(360777261724) o36->lustre-MDT0000-mdc-ffff88005a4a7800@172.16.0.128@tcp:12/10 lens 616/424 e 0 to 0 dl 1453855306 ref 2 fl Interpret:R/4/0 rc -75/-75
Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 9 sec
Lustre: 2818:0:(import.c:1339:completed_replay_interpret()) lustre-MDT0000-mdc-ffff88005a4a7800: version recovery fails, reconnecting
Comment by Jinshan Xiong (Inactive) [ 27/Jan/16 ]

Actually the problem I saw was irrelevant to this ticket at all because I could see the same problem without the patch applied.

Comment by Jinshan Xiong (Inactive) [ 27/Jan/16 ]

Hi Tappro, can you please take a look at the logs I posted above?

Comment by Di Wang [ 27/Jan/16 ]

Hmm, this usually means recovery failed (most likely fid and name are not match IMHO). So you mean this failure is related with the umount timeout or not? If not, do you mind if I create a new ticket here? I saw similar failures in other recovery tests as well?

Comment by Di Wang [ 27/Jan/16 ]

Though this might not be the real cause of "version mismatch", but replay-dual.sh 26 can pass with the fix of LU-7716 constantly in my local test.

Comment by Jinshan Xiong (Inactive) [ 27/Jan/16 ]

Hi Di, the problem I mentioned above it's irrelevant to unmount issue we're addressing in this ticket. Please go ahead filing a new ticket. The problem is the testing can't be passed at all. It keeps hitting that issue.

Comment by Jinshan Xiong (Inactive) [ 27/Jan/16 ]

btw, are you seeing the same issue as I described above?

Comment by Di Wang [ 27/Jan/16 ]

ah, I do not know the umount issue, and can not reproduce it locally neither. But I can see the "Version mismatch" issue as you described above. which seems related with rename recovery.

I posted the patch on http://review.whamcloud.com/18172, and "version mismatch" seems disappear with this patch, at least in my local tests. But I probably need dig it further here.

Comment by Jinshan Xiong (Inactive) [ 27/Jan/16 ]

indeed, your patch helped a lot on my new test. I only see one failure after that and there was no eviction but I didn't catch the log. I tried a lot times to reproduce it but failed - may be just a hiccup somewhere.

I'll rebase my patch on Di's and see how it goes.

Comment by Jay Lan (Inactive) [ 28/Jan/16 ]

Zhenyu Xu ported an earlier version (patch set 3, IIRC) of #17853 for b2_5_fe at
http://review.whamcloud.com/17976 (see comment at 12/Jan/16 9:26 PM)

Could someone from Intel update 17976 to the latest patch set #4 for us? Thanks!

Comment by Jinshan Xiong (Inactive) [ 30/Jan/16 ]

Hi Jay,

The patch set 3 of 17853 is the same thing as patch 17976.

Comment by Saurabh Tandan (Inactive) [ 03/Feb/16 ]

Another instance failing with the same error as above for tag 2.7.66 for FULL - EL6.7 Server/EL6.7 Client - DNE , master build# 3314.
https://testing.hpdd.intel.com/test_sets/86ca0268-ca83-11e5-9215-5254006e85c2

Encountered another instance for tag 2.7.66 for FULL - EL7.1 Server/EL7.1 Client , master , build# 3314.
https://testing.hpdd.intel.com/test_sets/b91000ec-ca88-11e5-84d3-5254006e85c2

Another failure for master : Tag 2.7.66 FULL - EL7.1 Server/SLES11 SP3 Client, build# 3314
https://testing.hpdd.intel.com/test_sets/a51072dc-ca7b-11e5-9609-5254006e85c2

Comment by Jinshan Xiong (Inactive) [ 04/Feb/16 ]

For the failure at this test results:
https://testing.hpdd.intel.com/test_sets/10bb2d7e-cb44-11e5-b3e8-5254006e85c2

It shows the error message: "dbench 31895 missing", but actually the dbench process(dbench_26_pid) had PID as 7185. There is no error message in the console log either. I don't know where the error came from.

Comment by Saurabh Tandan (Inactive) [ 10/Feb/16 ]

Another instance found for interop tag 2.7.66 - EL7 Server/2.7.1 Client, build# 3316
https://testing.hpdd.intel.com/test_sets/d59d88b4-cc91-11e5-b80c-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 10/Feb/16 ]

Another instance found for Full tag 2.7.66 - EL6.7 Server/EL6.7 Client, build# 3314
https://testing.hpdd.intel.com/test_sets/64aef89c-ca6e-11e5-9215-5254006e85c2

Another instance found for Full tag 2.7.66 - EL6.7 Server/EL6.7 Client - ZFS, build# 3314
https://testing.hpdd.intel.com/test_sets/999f5e28-cb47-11e5-a59a-5254006e85c2

Another instance found for Full tag 2.7.66 - EL6.7 Server/EL6.7 Client - DNE, build# 3314
https://testing.hpdd.intel.com/test_sets/86ca0268-ca83-11e5-9215-5254006e85c2

Another instance found for Full tag 2.7.66 - EL7.1 Server/EL6.7 Client, build# 3314
https://testing.hpdd.intel.com/test_sets/84997ede-ca91-11e5-9609-5254006e85c2

Another instance found for Full tag 2.7.66 - EL7.1 Server/EL7.1 Client, build# 3314
https://testing.hpdd.intel.com/test_sets/b91000ec-ca88-11e5-84d3-5254006e85c2

Another instance found for Full tag 2.7.66 -EL7.1 Server/SLES11 SP3 Client, build# 3314
https://testing.hpdd.intel.com/test_sets/a51072dc-ca7b-11e5-9609-5254006e85c2

Another instance found for Full tag 2.7.66 -EL7.1 Server/EL7.1 Client - ZFS, build# 3314
https://testing.hpdd.intel.com/test_sets/de1df1ae-cb88-11e5-b49e-5254006e85c2

Another instance found for Full tag 2.7.66 -EL7.1 Server/EL7.1 Client - DNE, build# 3314
https://testing.hpdd.intel.com/test_sets/a7b8000a-cac5-11e5-9609-5254006e85c2

Comment by Andreas Dilger [ 03/Jun/16 ]

Not sure if LU-7765 is the cause of this, but it can't be helping.

Comment by Gerrit Updater [ 07/Jun/16 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/20669
Subject: LU-7372 tests: disable replay-dual test_26
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f521bf86fd77e3af72873281decdadc8fef6eb81

Comment by Gerrit Updater [ 11/Jun/16 ]

Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/20669/
Subject: LU-7372 tests: disable replay-dual test_26
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8b8b000eede1353a879b71d052a8a6f998431cdd

Comment by Jian Yu [ 08/Jul/16 ]

On master branch, replay-vbr test 1b also hit the same failure:
https://testing.hpdd.intel.com/test_sets/95d54048-41f6-11e6-a0ce-5254006e85c2
https://testing.hpdd.intel.com/test_sets/d7876c2c-41bb-11e6-bbf5-5254006e85c2
https://testing.hpdd.intel.com/test_sets/470a0382-541f-11e6-88a7-5254006e85c2

Comment by Li Xi (Inactive) [ 22/Feb/17 ]

Hi, any conclusion on the patch https://review.whamcloud.com/#/c/17853/? We are hitting this problem really frequently. Would you please accelerate the review of that patch?

Comment by Peter Jones [ 21/Apr/17 ]

Li Xi

It would probably be best to open a DDN ticket for the issue you are hitting

Peter

Comment by James Nunez (Inactive) [ 24/May/17 ]

Using patch 17853, we are still seeing replay-dual test 26 hangs. The patch I used to test this is at https://review.whamcloud.com/#/c/27256/ and the replay-dual failure is at https://testing.hpdd.intel.com/test_sets/3d171e98-408f-11e7-b3f6-5254006e85c2.

Looking at the MDS1 console, we can see that the MDS hangs in umount

07:36:47:[ 4392.087661] Lustre: 5245:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 24 previous similar messages
07:36:47:[ 4425.382238] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.4.147@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
07:36:47:[ 4425.387696] LustreError: Skipped 901 previous similar messages
07:36:47:[ 4440.091059] INFO: task umount:13451 blocked for more than 120 seconds.
07:36:47:[ 4440.093651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
07:36:47:[ 4440.096310] umount          D ffff88005f304600     0 13451  13450 0x00000080
07:36:47:[ 4440.098974]  ffff880065df3a20 0000000000000082 ffff8800663cbec0 ffff880065df3fd8
07:36:47:[ 4440.101633]  ffff880065df3fd8 ffff880065df3fd8 ffff8800663cbec0 ffff88005f304740
07:36:47:[ 4440.104257]  ffff88005f304748 7fffffffffffffff ffff8800663cbec0 ffff88005f304600
07:36:47:[ 4440.106805] Call Trace:
07:36:47:[ 4440.108922]  [<ffffffff8168c3c9>] schedule+0x29/0x70
07:36:47:[ 4440.111173]  [<ffffffff81689e29>] schedule_timeout+0x239/0x2c0
07:36:47:[ 4440.113525]  [<ffffffff811dafb9>] ? discard_slab+0x39/0x50
07:36:47:[ 4440.115807]  [<ffffffff811dcaaa>] ? __slab_free+0x2ca/0x2f0
07:36:47:[ 4440.118100]  [<ffffffff8168c7a6>] wait_for_completion+0x116/0x170
07:36:47:[ 4440.120423]  [<ffffffff810c54c0>] ? wake_up_state+0x20/0x20
07:36:47:[ 4440.122724]  [<ffffffffa0d8205e>] mgs_ir_fini_fs+0x27e/0x2f0 [mgs]
07:36:47:[ 4440.125044]  [<ffffffffa0d753f3>] mgs_put_fsdb+0x53/0x910 [mgs]
07:36:47:[ 4440.127330]  [<ffffffffa0d75e3c>] mgs_cleanup_fsdb_list+0x6c/0x90 [mgs]
07:36:47:[ 4440.129664]  [<ffffffffa0d59a97>] mgs_device_fini+0x97/0x5b0 [mgs]
07:36:47:[ 4440.132081]  [<ffffffffa087d584>] class_cleanup+0x784/0xd90 [obdclass]
07:36:47:[ 4440.134416]  [<ffffffffa088051d>] class_process_config+0x225d/0x2b10 [obdclass]
07:36:47:[ 4440.136806]  [<ffffffffa0678d87>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
07:36:47:[ 4440.139136]  [<ffffffffa0880ebf>] class_manual_cleanup+0xef/0x810 [obdclass]
07:36:47:[ 4440.141491]  [<ffffffffa08b1580>] server_put_super+0xb20/0xcd0 [obdclass]
07:36:47:[ 4440.143755]  [<ffffffff81200d22>] generic_shutdown_super+0x72/0xf0
07:36:47:[ 4440.145932]  [<ffffffff812010f2>] kill_anon_super+0x12/0x20
07:36:47:[ 4440.148033]  [<ffffffffa0884a92>] lustre_kill_super+0x32/0x50 [obdclass]
07:36:47:[ 4440.150217]  [<ffffffff812014a9>] deactivate_locked_super+0x49/0x60
07:36:47:[ 4440.152331]  [<ffffffff81201aa6>] deactivate_super+0x46/0x60
07:36:47:[ 4440.154363]  [<ffffffff8121eef5>] mntput_no_expire+0xc5/0x120
07:36:47:[ 4440.156357]  [<ffffffff81220030>] SyS_umount+0xa0/0x3b0
07:36:47:[ 4440.158276]  [<ffffffff816973c9>] system_call_fastpath+0x16/0x1b
07:36:47:[ 4560.160066] INFO: task umount:13451 blocked for more than 120 seconds.
Comment by Gerrit Updater [ 29/Jul/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/17853/
Subject: LU-7372 mgs: reprocess all locks at device fini
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2dc19f20ba9fcc1bcac6ae7ee5169ce10caab882

Comment by Peter Jones [ 29/Jul/17 ]

Landed for 2.11

Comment by Gerrit Updater [ 02/Aug/17 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/28323
Subject: LU-7372 mgs: reprocess all locks at device fini
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 3eed63b655e18de53540b7aaa33aa69e17f78257

Comment by Gerrit Updater [ 10/Aug/17 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/28323/
Subject: LU-7372 mgs: reprocess all locks at device fini
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 6df5f062727fc216072bc8daa7dfbc545876a2b0

Comment by James Nunez (Inactive) [ 18/Dec/17 ]

replay-dual test 26 is still on the ALWAYS_EXCEPT list and need to be removed to fully test that the patches fix this issue.

Comment by Gerrit Updater [ 31/Dec/17 ]

James Casper (jamesx.casper@intel.com) uploaded a new patch: https://review.whamcloud.com/30677
Subject: LU-7372 test: Remove replay-dual test_26 from ALWAYS_EXCEPT
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 09d341f5fb7b09283cb4515b17f10905564fdf4c

Comment by Gerrit Updater [ 09/Jan/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30677/
Subject: LU-7372 test: Remove replay-dual test_26 from ALWAYS_EXCEPT
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: bda752429ad6e0d7be0eb76c8abe1bf8561ac32f

Comment by Gerrit Updater [ 09/Jan/18 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/30796
Subject: LU-7372 test: Remove replay-dual test_26 from ALWAYS_EXCEPT
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 3872c328fa09d5fbde801e5e88e783ae27bf99a6

Comment by Jian Yu [ 22/Jan/18 ]

By searching on Maloo, I found replay-dual test 26 hung at least 17 times in the last week.

Here is a failure instance on master branch:
https://testing.hpdd.intel.com/test_sets/358e9718-fda5-11e7-a7cd-52540065bddc

Console log on MDS:

[51864.393931] LustreError: 14617:0:(qsd_reint.c:56:qsd_reint_completion()) lustre-MDT0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x1010000:0x0], rc:-5
[51864.395871] LustreError: 14617:0:(qsd_reint.c:56:qsd_reint_completion()) Skipped 1 previous similar message
[51960.236441] INFO: task umount:14605 blocked for more than 120 seconds.
[51960.237288] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[51960.238163] umount          D ffff88005c01cf10     0 14605  14604 0x00000080
[51960.238987] Call Trace:
[51960.239287]  [<ffffffff816ac5d9>] schedule_preempt_disabled+0x29/0x70
[51960.240077]  [<ffffffff816aa407>] __mutex_lock_slowpath+0xc7/0x1d0
[51960.241003]  [<ffffffff816a981f>] mutex_lock+0x1f/0x2f
[51960.241719]  [<ffffffffc069c3e7>] mgc_process_config+0x207/0x13f0 [mgc]
[51960.242608]  [<ffffffffc0950a26>] obd_process_config.constprop.14+0x76/0x230 [obdclass]
[51960.243674]  [<ffffffffc0638d47>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[51960.244490]  [<ffffffffc093d639>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[51960.245406]  [<ffffffffc0951cbf>] lustre_end_log+0x1ff/0x550 [obdclass]
[51960.246218]  [<ffffffffc097d15e>] server_put_super+0x7de/0xcd0 [obdclass]
[51960.247052]  [<ffffffff812054d2>] generic_shutdown_super+0x72/0x100
[51960.247873]  [<ffffffff812058a2>] kill_anon_super+0x12/0x20
[51960.248579]  [<ffffffffc09500c2>] lustre_kill_super+0x32/0x50 [obdclass]
[51960.249417]  [<ffffffff81205c59>] deactivate_locked_super+0x49/0x60
[51960.250180]  [<ffffffff812063c6>] deactivate_super+0x46/0x60
[51960.250878]  [<ffffffff8122376f>] cleanup_mnt+0x3f/0x80
[51960.251555]  [<ffffffff81223802>] __cleanup_mnt+0x12/0x20
[51960.252269]  [<ffffffff810aee05>] task_work_run+0xc5/0xf0
[51960.252967]  [<ffffffff8102ab52>] do_notify_resume+0x92/0xb0
[51960.253708]  [<ffffffff816b8d37>] int_signal+0x12/0x17
Comment by Minh Diep [ 04/May/18 ]

+1 on b2_10 https://testing.hpdd.intel.com/test_sets/07ca0dd2-43d0-11e8-960d-52540065bddc

Comment by James Nunez (Inactive) [ 05/Jul/18 ]

We have another case of replay-dual test 26 failing in umount. The interesting thing here is that the test just fails and continues with the next test. The test does not time out/hang due to this failure.

The console logs of the MDS (vm4) have the same call trace in umount as above.

Logs are at: https://testing.whamcloud.com/test_sets/f594d332-8023-11e8-97ff-52540065bddc

Comment by Gerrit Updater [ 30/Jul/18 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32902
Subject: LU-7372 tests: stop running replay-dual test 26
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 668755f74905ae3e73f46b6eea82f54d5cdc4864

Comment by Gerrit Updater [ 09/Aug/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32902/
Subject: LU-7372 tests: stop running replay-dual test 26
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 77fc116fb230f330761767e240882c2141b30c1a

Comment by James Nunez (Inactive) [ 14/Aug/18 ]

Looking at the MDS console logs, the following test sessions have essentially the same stack trace as what is described in this ticket. Looking at the kernel crash logs shows oom-killer.

For https://testing.whamcloud.com/test_sets/a498de80-9ade-11e8-8ee3-52540065bddc, the kernel crash log shows that tar envokes the oom-killer:

[60604.459486] tar invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0
[60604.460400] tar cpuset=/ mems_allowed=0
[60604.460823] CPU: 0 PID: 16324 Comm: tar Kdump: loaded Tainted: G           OE  ------------   3.10.0-862.9.1.el7.x86_64 #1
[60604.461874] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[60604.462439] Call Trace:
[60604.462724]  [<ffffffff81b0e84e>] dump_stack+0x19/0x1b
[60604.463220]  [<ffffffff81b0a1d0>] dump_header+0x90/0x229
[60604.463737]  [<ffffffff81b1badf>] ? notifier_call_chain+0x4f/0x70
[60604.464338]  [<ffffffff814c17b8>] ? __blocking_notifier_call_chain+0x58/0x70
[60604.465018]  [<ffffffff8159826e>] check_panic_on_oom+0x2e/0x60
[60604.465589]  [<ffffffff8159868b>] out_of_memory+0x23b/0x4f0
[60604.466124]  [<ffffffff8159f224>] __alloc_pages_nodemask+0xaa4/0xbb0
[60604.466735]  [<ffffffff815ec525>] alloc_pages_vma+0xb5/0x200
[60604.467279]  [<ffffffff815dae45>] __read_swap_cache_async+0x115/0x190
[60604.467886]  [<ffffffff815daee6>] read_swap_cache_async+0x26/0x60
[60604.468472]  [<ffffffff815dafc8>] swapin_readahead+0xa8/0x110
[60604.469034]  [<ffffffff815c5f37>] handle_pte_fault+0x777/0xc30
[60604.469601]  [<ffffffff815c7c3d>] handle_mm_fault+0x39d/0x9b0
[60604.470163]  [<ffffffff81525092>] ? from_kgid_munged+0x12/0x20
[60604.470717]  [<ffffffff81b1b557>] __do_page_fault+0x197/0x4f0
[60604.471260]  [<ffffffff81b1b996>] trace_do_page_fault+0x56/0x150
[60604.471829]  [<ffffffff81b1af22>] do_async_page_fault+0x22/0xf0
[60604.472405]  [<ffffffff81b17788>] async_page_fault+0x28/0x30
[60604.472938] Mem-Info:
[60604.473179] active_anon:0 inactive_anon:6 isolated_anon:0
 active_file:274 inactive_file:308 isolated_file:0
 unevictable:0 dirty:38 writeback:0 unstable:0
 slab_reclaimable:3259 slab_unreclaimable:38342
 mapped:181 shmem:12 pagetables:1345 bounce:0
 free:12914 free_pcp:0 free_cma:0
[60604.476120] Node 0 DMA free:7020kB min:416kB low:520kB high:624kB active_anon:4kB inactive_anon:0kB active_file:4kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:12kB writeback:0kB mapped:4kB shmem:48kB slab_reclaimable:132kB slab_unreclaimable:724kB kernel_stack:16kB pagetables:124kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:1494 all_unreclaimable? yes
[60604.480017] lowmem_reserve[]: 0 1660 1660 1660
[60604.480573] Node 0 DMA32 free:44636kB min:44636kB low:55792kB high:66952kB active_anon:0kB inactive_anon:24kB active_file:1092kB inactive_file:1232kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080744kB managed:1704004kB mlocked:0kB dirty:140kB writeback:0kB mapped:720kB shmem:0kB slab_reclaimable:12904kB slab_unreclaimable:152644kB kernel_stack:2848kB pagetables:5256kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:3668 all_unreclaimable? yes
[60604.484717] lowmem_reserve[]: 0 0 0 0
[60604.485180] Node 0 DMA: 9*4kB (UM) 5*8kB (UM) 10*16kB (M) 9*32kB (UM) 8*64kB (UM) 5*128kB (UM) 5*256kB (UM) 4*512kB (UM) 0*1024kB 1*2048kB (M) 0*4096kB = 7052kB
[60604.486948] Node 0 DMA32: 2654*4kB (EM) 2385*8kB (EM) 941*16kB (UM) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44752kB
[60604.488485] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[60604.489308] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[60604.490109] 303 total pagecache pages
[60604.490472] 6 pages in swap cache
[60604.490790] Swap cache stats: add 86869, delete 86863, find 18036/30348
[60604.491421] Free swap  = 3521276kB
[60604.491749] Total swap = 3671036kB
[60604.492082] 524184 pages RAM
[60604.492368] 0 pages HighMem/MovableOnly
[60604.492812] 94206 pages reserved
[60604.493137] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[60604.493889] [  333]     0   333     9868        0      22      108             0 systemd-journal
[60604.494776] [  359]     0   359    29149        0      26       79             0 lvmetad
[60604.495542] [  366]     0   366    11101        1      23      147         -1000 systemd-udevd
[60604.496357] [  453]     0   453    13877        0      26      119         -1000 auditd
[60604.497124] [  480]     0   480     6627        1      19       95             0 systemd-logind
[60604.497929] [  481]   999   481   134608        0      60     2165             0 polkitd
[60604.498689] [  482]     0   482     5381        0      15       59             0 irqbalance
[60604.499486] [  483]    81   483    14590        1      34      213          -900 dbus-daemon
[60604.500283] [  484]    32   484    17305        0      38      160             0 rpcbind
[60604.501042] [  485]     0   485    48770        0      36      126             0 gssproxy
[60604.501815] [  486]     0   486   137505        0      87      654             0 NetworkManager
[60604.502639] [  501]   998   501    30087        0      29      123             0 chronyd
[60604.503403] [  533]     0   533    26849        1      53      499             0 dhclient
[60604.504171] [  895]     0   895   143453       42      98     2797             0 tuned
[60604.504929] [  896]     0   896    28203        1      55      257         -1000 sshd
[60604.505686] [  900]     0   900    74575        8      73      916             0 rsyslogd
[60604.506456] [  906]     0   906     6791        1      18       62             0 xinetd
[60604.507223] [  913]    29   913    10605        0      24      209             0 rpc.statd
[60604.508002] [  917]   997   917    56469        0      23      285             0 munged
[60604.508760] [  992]     0   992    31570        1      19      155             0 crond
[60604.509518] [  993]     0   993     6476        0      19       52             0 atd
[60604.510256] [  996]     0   996   167981        0      69      580             0 automount
[60604.511040] [ 1007]     0  1007    27522        1      12       33             0 agetty
[60604.511804] [ 1009]     0  1009    27522        1       9       32             0 agetty
[60604.512563] [ 1181]     0  1181    22408        0      43      265             0 master
[60604.513318] [ 1200]    89  1200    22451        0      46      253             0 qmgr
[60604.514075] [10963]     0 10963    39169        0      78      351             0 sshd
[60604.514809] [10965]     0 10965    28296        1      14       58             0 run_test.sh
[60604.515601] [11233]     0 11233    29536        1      16      790             0 bash
[60604.516346] [21585]     0 21585    29536        0      13      790             0 bash
[60604.517086] [21586]     0 21586    26988        0      10       27             0 tee
[60604.517808] [21767]     0 21767    29573        1      14      840             0 bash
[60604.518551] [31867]    89 31867    22434        0      45      251             0 pickup
[60604.519302] [11943]     0 11943    29607        1      14      863             0 bash
[60604.520033] [11944]     0 11944    26988        0      10       28             0 tee
[60604.520775] [12397]     0 12397    29607        0      14      863             0 bash
[60604.521528] [12398]     0 12398    29607        0      14      863             0 bash
[60604.522264] [12911]     0 12911    24022        0      22       84             0 pdsh
[60604.523013] [12912]     0 12912    29228        0      14       42             0 sed
[60604.523754] [15184]     0 15184    29474        1      13      732             0 rundbench
[60604.524547] [15195]     0 15195     1618       26       9       52             0 dbench
[60604.525302] [15196]     0 15196     1620       61       9       62             0 dbench
[60604.526052] [16324]     0 16324    30920       53      15       89             0 tar
[60604.526795] [16325]     0 16325    30852       57      17       81             0 tar
[60604.527548] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled

[60604.528460] CPU: 0 PID: 16324 Comm: tar Kdump: loaded Tainted: G           OE  ------------   3.10.0-862.9.1.el7.x86_64 #1
[60604.529473] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[60604.530011] Call Trace:
[60604.530259]  [<ffffffff81b0e84e>] dump_stack+0x19/0x1b
[60604.530746]  [<ffffffff81b08b50>] panic+0xe8/0x21f
[60604.531196]  [<ffffffff81598295>] check_panic_on_oom+0x55/0x60
[60604.531747]  [<ffffffff8159868b>] out_of_memory+0x23b/0x4f0
[60604.532268]  [<ffffffff8159f224>] __alloc_pages_nodemask+0xaa4/0xbb0
[60604.532865]  [<ffffffff815ec525>] alloc_pages_vma+0xb5/0x200
[60604.533400]  [<ffffffff815dae45>] __read_swap_cache_async+0x115/0x190
[60604.533994]  [<ffffffff815daee6>] read_swap_cache_async+0x26/0x60
[60604.534565]  [<ffffffff815dafc8>] swapin_readahead+0xa8/0x110
[60604.535096]  [<ffffffff815c5f37>] handle_pte_fault+0x777/0xc30
[60604.535643]  [<ffffffff815c7c3d>] handle_mm_fault+0x39d/0x9b0
[60604.536180]  [<ffffffff81525092>] ? from_kgid_munged+0x12/0x20
[60604.536733]  [<ffffffff81b1b557>] __do_page_fault+0x197/0x4f0
[60604.537268]  [<ffffffff81b1b996>] trace_do_page_fault+0x56/0x150
[60604.537828]  [<ffffffff81b1af22>] do_async_page_fault+0x22/0xf0
[60604.538383]  [<ffffffff81b17788>] async_page_fault+0x28/0x30

The kernel-crash for https://testing.whamcloud.com/test_sets/0bb1397a-9bb9-11e8-8ee3-52540065bddc has tar envoking the oom-killer, but we have a few different *_newfstat calls in the kernel crash:

[44806.676691] Lustre: DEBUG MARKER: test_26 fail mds1 1 times
[44998.330738] tar invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0
[44998.332386] tar cpuset=/ mems_allowed=0
[44998.333090] CPU: 0 PID: 28253 Comm: tar Kdump: loaded Tainted: G        W  OE  ------------   3.10.0-862.9.1.el7.x86_64 #1
[44998.335039] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[44998.335669] Call Trace:
[44998.335982]  [<ffffffffabd0e84e>] dump_stack+0x19/0x1b
[44998.336497]  [<ffffffffabd0a1d0>] dump_header+0x90/0x229
[44998.337027]  [<ffffffffabd1badf>] ? notifier_call_chain+0x4f/0x70
[44998.337645]  [<ffffffffab6c17b8>] ? __blocking_notifier_call_chain+0x58/0x70
[44998.338336]  [<ffffffffab79826e>] check_panic_on_oom+0x2e/0x60
[44998.338902]  [<ffffffffab79868b>] out_of_memory+0x23b/0x4f0
[44998.339448]  [<ffffffffab79f224>] __alloc_pages_nodemask+0xaa4/0xbb0
[44998.340074]  [<ffffffffab7ec525>] alloc_pages_vma+0xb5/0x200
[44998.340639]  [<ffffffffab7dae45>] __read_swap_cache_async+0x115/0x190
[44998.341254]  [<ffffffffab7daee6>] read_swap_cache_async+0x26/0x60
[44998.341850]  [<ffffffffab7dafc8>] swapin_readahead+0xa8/0x110
[44998.342416]  [<ffffffffab7c5f37>] handle_pte_fault+0x777/0xc30
[44998.342995]  [<ffffffffab7c7c3d>] handle_mm_fault+0x39d/0x9b0
[44998.343566]  [<ffffffffabd1b557>] __do_page_fault+0x197/0x4f0
[44998.344126]  [<ffffffffabd1b996>] trace_do_page_fault+0x56/0x150
[44998.344722]  [<ffffffffabd1af22>] do_async_page_fault+0x22/0xf0
[44998.345287]  [<ffffffffabd17788>] async_page_fault+0x28/0x30
[44998.345888]  [<ffffffffab959730>] ? copy_user_generic_string+0x30/0x40
[44998.346533]  [<ffffffffab82142f>] ? cp_new_stat+0x14f/0x180
[44998.347077]  [<ffffffffab8215b4>] SYSC_newfstat+0x34/0x60
[44998.347607]  [<ffffffffab82179e>] SyS_newfstat+0xe/0x10
[44998.348120]  [<ffffffffabd20795>] system_call_fastpath+0x1c/0x21
[44998.348710]  [<ffffffffabd206e1>] ? system_call_after_swapgs+0xae/0x146
[44998.349352] Mem-Info:
[44998.349595] active_anon:2 inactive_anon:18 isolated_anon:0
 active_file:15 inactive_file:1090 isolated_file:0
 unevictable:0 dirty:0 writeback:30 unstable:0
 slab_reclaimable:3679 slab_unreclaimable:38446
 mapped:55 shmem:0 pagetables:1385 bounce:0
 free:12871 free_pcp:108 free_cma:0
[44998.352588] Node 0 DMA free:6956kB min:416kB low:520kB high:624kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:72kB slab_unreclaimable:684kB kernel_stack:32kB pagetables:24kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[44998.356534] lowmem_reserve[]: 0 1660 1660 1660
[44998.357078] Node 0 DMA32 free:44528kB min:44636kB low:55792kB high:66952kB active_anon:8kB inactive_anon:72kB active_file:60kB inactive_file:4360kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080744kB managed:1704004kB mlocked:0kB dirty:0kB writeback:120kB mapped:220kB shmem:0kB slab_reclaimable:14644kB slab_unreclaimable:153100kB kernel_stack:2832kB pagetables:5516kB unstable:0kB bounce:0kB free_pcp:432kB local_pcp:4kB free_cma:0kB writeback_tmp:0kB pages_scanned:1441 all_unreclaimable? yes
[44998.361327] lowmem_reserve[]: 0 0 0 0
[44998.361788] Node 0 DMA: 17*4kB (U) 11*8kB (U) 5*16kB (U) 3*32kB (U) 0*64kB 2*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 3*2048kB (M) 0*4096kB = 6988kB
[44998.363459] Node 0 DMA32: 1871*4kB (EM) 2399*8kB (UEM) 1117*16kB (M) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44548kB
[44998.365051] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[44998.365894] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[44998.366710] 1163 total pagecache pages
[44998.367087] 0 pages in swap cache
[44998.367423] Swap cache stats: add 95573, delete 95573, find 17225/30172
[44998.368063] Free swap  = 3521020kB
[44998.368403] Total swap = 3671036kB
[44998.368744] 524184 pages RAM
[44998.369026] 0 pages HighMem/MovableOnly
[44998.369409] 94206 pages reserved
[44998.369738] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[44998.370539] [  337]     0   337    10055        1      24       98             0 systemd-journal
[44998.371399] [  359]     0   359    29149        0      26       79             0 lvmetad
[44998.372180] [  364]     0   364    11100        1      24      146         -1000 systemd-udevd
[44998.373025] [  455]     0   455    13877        0      27      119         -1000 auditd
[44998.373804] [  481]     0   481     5381        0      15       59             0 irqbalance
[44998.374612] [  484]    32   484    17305        0      36      139             0 rpcbind
[44998.375402] [  486]    81   486    14554        1      33      176          -900 dbus-daemon
[44998.376211] [  489]     0   489    48770        0      37      126             0 gssproxy
[44998.377005] [  492]   998   492    30087        0      29      124             0 chronyd
[44998.377795] [  499]     0   499   137506        0      87     1128             0 NetworkManager
[44998.378649] [  500]   999   500   134608        0      61     1403             0 polkitd
[44998.379433] [  501]     0   501     6594        1      17       77             0 systemd-logind
[44998.380252] [  546]     0   546    26849        1      53      498             0 dhclient
[44998.381077] [  900]     0   900    28203        1      60      257         -1000 sshd
[44998.381846] [  902]     0   902   143453        0      96     3303             0 tuned
[44998.382623] [  909]     0   909    74575        0      75      908             0 rsyslogd
[44998.383418] [  913]    29   913    10605        0      25      209             0 rpc.statd
[44998.384215] [  914]     0   914     6791        1      17       63             0 xinetd
[44998.385003] [  918]   997   918    56469        0      21      274             0 munged
[44998.385798] [  983]     0   983     6476        0      19       52             0 atd
[44998.386544] [  985]     0   985    31570        0      21      155             0 crond
[44998.387319] [  993]     0   993   167982        0      69      547             0 automount
[44998.388108] [  997]     0   997    27522        1       9       32             0 agetty
[44998.388892] [ 1001]     0  1001    27522        1      12       32             0 agetty
[44998.389690] [ 1320]     0  1320    22408        0      42      259             0 master
[44998.390467] [ 1346]    89  1346    22451        0      45      254             0 qmgr
[44998.391223] [10968]     0 10968    39169        0      77      365             0 sshd
[44998.391985] [10970]     0 10970    28296        1      13       58             0 run_test.sh
[44998.392810] [11240]     0 11240    29470        1      16      733             0 bash
[44998.393561] [18492]    89 18492    22434        0      46      252             0 pickup
[44998.394334] [ 1201]     0  1201    29470        0      13      733             0 bash
[44998.395082] [ 1202]     0  1202    26988        0       9       27             0 tee
[44998.395840] [ 1396]     0  1396    29538        1      12      784             0 bash
[44998.396621] [23895]     0 23895    29538        1      12      806             0 bash
[44998.397382] [23896]     0 23896    26988        0      10       28             0 tee
[44998.398123] [24348]     0 24348    29538        0      12      806             0 bash
[44998.398885] [24349]     0 24349    29538        0      12      806             0 bash
[44998.399652] [24854]     0 24854    24023        0      21       87             0 pdsh
[44998.400403] [24855]     0 24855    29228        0      13       42             0 sed
[44998.401140] [27114]     0 27114    29438        1      14      675             0 rundbench
[44998.401941] [27125]     0 27125     1618        0       9       44             0 dbench
[44998.402713] [27126]     0 27126     1620        1       9       56             0 dbench
[44998.403486] [28252]     0 28252    30920        0      17       95             0 tar
[44998.404231] [28253]     0 28253    30852        0      16       77             0 tar
[44998.404986] [28255]     0 28255    40840        6      37      209             0 crond
[44998.405759] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled

[44998.406685] CPU: 0 PID: 28253 Comm: tar Kdump: loaded Tainted: G        W  OE  ------------   3.10.0-862.9.1.el7.x86_64 #1
[44998.407727] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[44998.408277] Call Trace:
[44998.408522]  [<ffffffffabd0e84e>] dump_stack+0x19/0x1b
[44998.409021]  [<ffffffffabd08b50>] panic+0xe8/0x21f
[44998.409491]  [<ffffffffab798295>] check_panic_on_oom+0x55/0x60
[44998.410058]  [<ffffffffab79868b>] out_of_memory+0x23b/0x4f0
[44998.410588]  [<ffffffffab79f224>] __alloc_pages_nodemask+0xaa4/0xbb0
[44998.411198]  [<ffffffffab7ec525>] alloc_pages_vma+0xb5/0x200
[44998.411747]  [<ffffffffab7dae45>] __read_swap_cache_async+0x115/0x190
[44998.412363]  [<ffffffffab7daee6>] read_swap_cache_async+0x26/0x60
[44998.412951]  [<ffffffffab7dafc8>] swapin_readahead+0xa8/0x110
[44998.413498]  [<ffffffffab7c5f37>] handle_pte_fault+0x777/0xc30
[44998.414062]  [<ffffffffab7c7c3d>] handle_mm_fault+0x39d/0x9b0
[44998.414610]  [<ffffffffabd1b557>] __do_page_fault+0x197/0x4f0
[44998.415172]  [<ffffffffabd1b996>] trace_do_page_fault+0x56/0x150
[44998.415747]  [<ffffffffabd1af22>] do_async_page_fault+0x22/0xf0
[44998.416313]  [<ffffffffabd17788>] async_page_fault+0x28/0x30
[44998.416860]  [<ffffffffab959730>] ? copy_user_generic_string+0x30/0x40
[44998.417479]  [<ffffffffab82142f>] ? cp_new_stat+0x14f/0x180
[44998.418024]  [<ffffffffab8215b4>] SYSC_newfstat+0x34/0x60
[44998.418542]  [<ffffffffab82179e>] SyS_newfstat+0xe/0x10
[44998.419056]  [<ffffffffabd20795>] system_call_fastpath+0x1c/0x21
[44998.419651]  [<ffffffffabd206e1>] ? system_call_after_swapgs+0xae/0x146
Comment by Gerrit Updater [ 22/Aug/18 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33052
Subject: LU-7372 tests: stop running replay-dual test 26
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: debba6d3ae60a2448c5a59f46b751b605c2ee69c

Comment by Gerrit Updater [ 11/Sep/18 ]

John L. Hammond (jhammond@whamcloud.com) merged in patch https://review.whamcloud.com/33052/
Subject: LU-7372 tests: stop running replay-dual test 26
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: c5427cbab935259c54957b2ff50e7736f240cd08

Comment by Gerrit Updater [ 11/Jun/21 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43977
Subject: LU-7372 tests: skip replay-dual test_24/25
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 1398315916128263f37a5b53b0d1a9286c5b3574

Comment by Gerrit Updater [ 11/Jun/21 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43978
Subject: LU-7372 tests: re-enable replay-dual test_26
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 8934a96b91ce014a7fe73689fd2d293f436cd716

Comment by Gerrit Updater [ 11/Jun/21 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43979
Subject: LU-7372 tests: re-enable replay-dual test_25
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: b188061abcc6e73ea52e99b18797bd74e01e6d75

Comment by Gerrit Updater [ 11/Jun/21 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43980
Subject: LU-7372 tests: re-enable replay-dual test_24
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 87211c8150c48a1a7876ac52cd2e30b34814eaa3

Comment by Gerrit Updater [ 11/Jun/21 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43982
Subject: LU-7372 tests: re-enable replay-dual test_26
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0f509199a25db416759c3bbcce85c6b79d623585

Comment by Gerrit Updater [ 11/Jun/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43977/
Subject: LU-7372 tests: skip replay-dual test_24/25
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 13e11cf70cc8102d006a681276094517c22e4a47

Comment by Gerrit Updater [ 27/Jun/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43978/
Subject: LU-7372 tests: re-enable replay-dual test_26
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 8cd0ad8c3f7755a9ff41da297a5130a6857fae5c

Comment by Gerrit Updater [ 23/Dec/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/43982/
Subject: LU-7372 tests: re-enable replay-dual test_26
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2da8f7adbe4a0c3eeecf8fda44fb6a4e4f9a16dd

Generated at Sat Feb 10 02:08:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.