[LU-11943] many input/output error after soak running for couple of hours Created: 07/Feb/19  Updated: 08/Apr/19  Resolved: 08/Mar/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.7
Fix Version/s: Lustre 2.13.0, Lustre 2.10.7, Lustre 2.12.1

Type: Bug Priority: Minor
Reporter: Sarah Liu Assignee: Patrick Farrell (Inactive)
Resolution: Fixed Votes: 0
Labels: soak
Environment:

Lustre version=2.10.6_35_g2bc39a1 lustre-b2_10-ib build #98 EL7.6 mlnx DNE; soak has PFL enabled in this test


Attachments: File 240435-mdtestfpp.out     File console-soak-10.log-20190224.gz     File console-soak-23.log-20190224.gz     File soak-10.log-console     File sys-soak-10.log-20190224.gz     File sys-soak-23.log-20190224.gz    
Issue Links:
Related
is related to LU-11358 racer test 1 hangs in locking with DNE Open
is related to LU-10527 LustreError: 7830:0:(llog_cat.c:313:l... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

In the first 4/5 hours running, soak seems working well, but after that, applications started to fail a lot with sth like

mdtest-1.9.3 was launched with 13 total task(s) on 6 node(s)
Command line used: /mnt/soaked/bin/mdtest -d /mnt/soaked/soaktest/test/mdtestfpp/233652 -i 5 -n 158 -u
Path: /mnt/soaked/soaktest/test/mdtestfpp
FS: 83.4 TiB   Used FS: 1.6%   Inodes: 82.1 Mi   Used Inodes: 0.1%
02/07/2019 18:13:12: Process 4(soak-27.spirit.whamcloud.com): FAILED in mdtest_stat, unable to stat file: Cannot send after transport endpoint shutdown
02/07/2019 18:13:29: Process 5(soak-28.spirit.whamcloud.com): FAILED in mdtest_stat, unable to stat file: Cannot send after transport endpoint shutdown
slurmstepd: error: *** STEP 233652.0 ON soak-18 CANCELLED AT 2019-02-07T18:12:14 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: soak-28: task 5: Exited with exit code 1
srun: Terminating job step 233652.0
task 5 writing /mnt/soaked/soaktest/test/iorssf/233650/ssf
WARNING: Task 5 requested transfer of 28199936 bytes,
         but transferred 2093056 bytes at offset 17483960320
WARNING: This file system requires support of partial write()s, in aiori-POSIX.c (line 272).
WARNING: Requested xfer of 28199936 bytes, but xferred 2093056 bytes
Only transferred 2093056 of 28199936 bytes
** error **
ERROR in aiori-POSIX.c (line 256): transfer failed.
ERROR: Input/output error
** exiting **
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: soak-27: task 5: Exited with exit code 255
srun: Terminating job step 233650.0

Around the time when application failed, checking the OSS side, saw many lock time out error

Feb  7 01:02:00 soak-4 systemd-logind: New session 30 of user root.
Feb  7 01:02:00 soak-4 systemd: Started Session 30 of user root.
Feb  7 01:02:00 soak-4 sshd[97515]: pam_unix(sshd:session): session opened for user root by (uid=0)
Feb  7 01:02:00 soak-4 sshd[97515]: pam_unix(sshd:session): session closed for user root
Feb  7 01:02:00 soak-4 systemd-logind: Removed session 30.
Feb  7 01:02:00 soak-4 systemd: Removed slice User Slice of root.
Feb  7 01:02:09 soak-4 sshd[97553]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
Feb  7 01:02:09 soak-4 sshd[97553]: Accepted publickey for root from 10.10.1.116 port 43696 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdK
AWxM
Feb  7 01:02:09 soak-4 systemd: Created slice User Slice of root.
Feb  7 01:02:09 soak-4 systemd-logind: New session 31 of user root.
Feb  7 01:02:09 soak-4 systemd: Started Session 31 of user root.
Feb  7 01:02:09 soak-4 sshd[97553]: pam_unix(sshd:session): session opened for user root by (uid=0)
Feb  7 01:02:10 soak-4 sshd[97553]: Received disconnect from 10.10.1.116 port 43696:11: disconnected by user
Feb  7 01:02:10 soak-4 sshd[97553]: Disconnected from 10.10.1.116 port 43696
Feb  7 01:02:10 soak-4 sshd[97553]: pam_unix(sshd:session): session closed for user root
Feb  7 01:02:10 soak-4 systemd-logind: Removed session 31.
Feb  7 01:02:10 soak-4 systemd: Removed slice User Slice of root.
Feb  7 01:03:03 soak-4 kernel: LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 192.168.1.128@o2ib  ns: filter-soaked-OST0004_UUID lock: ffff89923febd800/0x2a3d985a1228d475 lrc: 3/0,0 mode: PW/PW res: [0x400000402:0x4cf183:0x0].0x0 rrc: 17 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x60000000020020 nid: 192.168.1.128@o2ib remote: 0x5c14a0ad2708d24d expref: 6 pid: 27943 timeout: 4295139340 lvb_type: 0
Feb  7 01:03:19 soak-4 kernel: Lustre: soaked-OST0004: Connection restored to ffd42784-63d2-739c-d487-9f95d4a84e57 (at 192.168.1.128@o2ib)
Feb  7 01:03:19 soak-4 kernel: Lustre: Skipped 2 previous similar messages
Feb  7 01:03:44 soak-4 sshd[105089]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
Feb  7 01:03:44 soak-4 sshd[105089]: Accepted publickey for root from 10.10.1.116 port 43960 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM

Feb  7 01:08:06 soak-4 kernel: LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 152s: evicting client at 192.168.1.127@o2ib  ns: filter-soaked-OST0004_UUID lock: ffff89964ba94200/0x2a3d985a1228d47c lrc: 3/0,0 mode: PW/PW res: [0x400000402:0x4cf183:0x0].0x0 rrc: 14 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x60000000020020 nid: 192.168.1.127@o2ib remote: 0xd8fc8581f4c8de49 expref: 6 pid: 74474 timeout: 4295441140 lvb_type: 0
Feb  7 01:08:18 soak-4 kernel: Lustre: soaked-OST0004: Connection restored to 44cc5f6f-1cd3-3d43-f778-9409d7aeb164 (at 192.168.1.127@o2ib)
Feb  7 01:08:46 soak-4 kernel: Lustre: 26413:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1549501726/real 1549501726]  req@ffff89968a49b000 x1624769706421056/t0(0) o38->soaked-MDT0001-lwp-OST0008@192.168.1.109@o2ib:12/10 lens 520/544 e 0 to 1 dl 1549501742 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
Feb  7 01:08:46 soak-4 kernel: Lustre: 26413:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Feb  7 01:09:36 soak-4 kernel: Lustre: 26413:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1549501776/real 1549501776]  req@ffff89968a498f00 x1624769706421792/t0(0) o38->soaked-MDT0001-lwp-OST000c@192.168.1.109@o2ib:12/10 lens 520/544 e 0 to 1 dl 1549501797 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
Feb  7 01:09:36 soak-4 kernel: Lustre: 26413:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Feb  7 01:09:43 soak-4 kernel: Lustre: soaked-OST0004: haven't heard from client acbada78-7ef5-2938-ae9f-1535dd1eb89f (at 192.168.1.121@o2ib) in 229 seconds. I think it's dead, and I am evicting it. exp ffff89923fa98c00, cur 1549501783 expire 1549501633 last 1549501554
Feb  7 01:10:01 soak-4 systemd: Created slice User Slice of root.
Feb  7 01:10:01 soak-4 systemd: Started Session 44 of user root.
Feb  7 01:10:01 soak-4 CROND[187082]: (root) CMD (/usr/lib64/sa/sa1 1 1)

Later on MDS side, saw following error

Feb  7 09:15:03 soak-8 kernel: LNet: Service thread pid 13609 was inactive for 200.68s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Feb  7 09:15:03 soak-8 kernel: Pid: 13609, comm: mdt00_013 3.10.0-957.el7_lustre.x86_64 #1 SMP Mon Jan 7 20:06:41 UTC 2019
Feb  7 09:15:03 soak-8 kernel: Call Trace:
Feb  7 09:15:03 soak-8 kernel: [<ffffffffc0f9ba11>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
Feb  7 09:15:03 soak-8 kernel: [<ffffffffc0f9dabb>] ldlm_cli_enqueue_fini+0x93b/0xdc0 [ptlrpc]
Feb  7 09:15:03 soak-8 kernel: [<ffffffffc0fa0672>] ldlm_cli_enqueue+0x6c2/0x810 [ptlrpc]
Feb  7 09:15:03 soak-8 kernel: [<ffffffffc1640622>] osp_md_object_lock+0x172/0x2e0 [osp]
Feb  7 09:15:03 soak-8 kernel: [<ffffffffc1574ea3>] lod_object_lock+0xf3/0x950 [lod]
Feb  7 09:15:03 soak-8 kernel: [<ffffffffc15e448e>] mdd_object_lock+0x3e/0xe0 [mdd]
Feb  7 09:15:03 soak-8 kernel: [<ffffffffc1487fc5>] mdt_remote_object_lock+0x1e5/0x710 [mdt]
Feb  7 09:15:03 soak-8 kernel: [<ffffffffc1489516>] mdt_object_lock_internal+0x166/0x300 [mdt]
Feb  7 09:15:03 soak-8 kernel: [<ffffffffc14896f0>] mdt_reint_object_lock+0x20/0x60 [mdt]
Feb  7 09:15:03 soak-8 kernel: [<ffffffffc14a05f4>] mdt_reint_link+0x7e4/0xc30 [mdt]
Feb  7 09:15:03 soak-8 kernel: [<ffffffffc14a2c73>] mdt_reint_rec+0x83/0x210 [mdt]
Feb  7 09:15:03 soak-8 kernel: [<ffffffffc148418b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
Feb  7 09:15:03 soak-8 kernel: [<ffffffffc148ff77>] mdt_reint+0x67/0x140 [mdt]
Feb  7 09:15:03 soak-8 kernel: [<ffffffffc10342aa>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
Feb  7 09:15:03 soak-8 kernel: [<ffffffffc0fdcd5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
Feb  7 09:15:03 soak-8 kernel: [<ffffffffc0fe04a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
Feb  7 09:15:03 soak-8 kernel: [<ffffffff884c1c31>] kthread+0xd1/0xe0
Feb  7 09:15:03 soak-8 kernel: [<ffffffff88b74c37>] ret_from_fork_nospec_end+0x0/0x39
Feb  7 09:15:03 soak-8 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
Feb  7 09:15:03 soak-8 kernel: LustreError: dumping log to /tmp/lustre-log.1549530903.13609
Feb  7 09:15:05 soak-8 kernel: LNet: Service thread pid 13582 was inactive for 202.07s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Feb  7 09:15:05 soak-8 kernel: Pid: 13582, comm: mdt01_007 3.10.0-957.el7_lustre.x86_64 #1 SMP Mon Jan 7 20:06:41 UTC 2019
Feb  7 09:15:05 soak-8 kernel: Call Trace:
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0f9ba11>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0f9dabb>] ldlm_cli_enqueue_fini+0x93b/0xdc0 [ptlrpc]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0fa0672>] ldlm_cli_enqueue+0x6c2/0x810 [ptlrpc]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc1640622>] osp_md_object_lock+0x172/0x2e0 [osp]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc1574ea3>] lod_object_lock+0xf3/0x950 [lod]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc15e448e>] mdd_object_lock+0x3e/0xe0 [mdd]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc1487fc5>] mdt_remote_object_lock+0x1e5/0x710 [mdt]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc1489516>] mdt_object_lock_internal+0x166/0x300 [mdt]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc14896f0>] mdt_reint_object_lock+0x20/0x60 [mdt]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc14a05f4>] mdt_reint_link+0x7e4/0xc30 [mdt]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc14a2c73>] mdt_reint_rec+0x83/0x210 [mdt]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc148418b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc148ff77>] mdt_reint+0x67/0x140 [mdt]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc10342aa>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0fdcd5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0fe04a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
Feb  7 09:15:05 soak-8 kernel: [<ffffffff884c1c31>] kthread+0xd1/0xe0
Feb  7 09:15:05 soak-8 kernel: [<ffffffff88b74c37>] ret_from_fork_nospec_end+0x0/0x39
Feb  7 09:15:05 soak-8 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
Feb  7 09:15:05 soak-8 kernel: Pid: 18685, comm: mdt00_026 3.10.0-957.el7_lustre.x86_64 #1 SMP Mon Jan 7 20:06:41 UTC 2019
Feb  7 09:15:05 soak-8 kernel: Call Trace:
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0f9ba11>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0f9dabb>] ldlm_cli_enqueue_fini+0x93b/0xdc0 [ptlrpc]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0fa0672>] ldlm_cli_enqueue+0x6c2/0x810 [ptlrpc]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc1640622>] osp_md_object_lock+0x172/0x2e0 [osp]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc1574ea3>] lod_object_lock+0xf3/0x950 [lod]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc15e448e>] mdd_object_lock+0x3e/0xe0 [mdd]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc1487fc5>] mdt_remote_object_lock+0x1e5/0x710 [mdt]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc1489516>] mdt_object_lock_internal+0x166/0x300 [mdt]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc14896f0>] mdt_reint_object_lock+0x20/0x60 [mdt]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc14a05f4>] mdt_reint_link+0x7e4/0xc30 [mdt]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc14a2c73>] mdt_reint_rec+0x83/0x210 [mdt]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc148418b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc148ff77>] mdt_reint+0x67/0x140 [mdt]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc10342aa>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0fdcd5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0fe04a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
Feb  7 09:15:05 soak-8 kernel: [<ffffffff884c1c31>] kthread+0xd1/0xe0
Feb  7 09:15:05 soak-8 kernel: [<ffffffff88b74c37>] ret_from_fork_nospec_end+0x0/0x39
Feb  7 09:15:05 soak-8 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
Feb  7 09:15:14 soak-8 sshd[78874]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
Feb  7 09:15:14 soak-8 sshd[78874]: Accepted publickey for root from 10.10.1.116 port 59708 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM
Feb  7 09:16:40 soak-8 systemd: Removed slice User Slice of root.

Feb  7 09:16:43 soak-8 kernel: LustreError: 13609:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1549530702, 300s ago), entering recovery for soaked-MDT0001_UUID@192.168.1.108@o2ib ns: soaked-MDT0001-osp-MDT0000 lock: ffff92322f761800/0x7ebebf899932ee34 lrc: 4/0,1 mode: --/EX res: [0x24000cf43:0x16f41:0x0].0x0 bits 0x2 rrc: 4 type: IBT flags: 0x1000001000000 nid: local remote: 0xc7baaf97e1b53a16 expref: -99 pid: 13609 timeout: 0 lvb_type: 0
Feb  7 09:16:43 soak-8 kernel: LustreError: 13609:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 2 previous similar messages
Feb  7 09:16:55 soak-8 sshd[79087]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
Feb  7 09:16:56 soak-8 sshd[79087]: Accepted publickey for root from 10.10.1.116 port 59970 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM


 Comments   
Comment by Peter Jones [ 11/Feb/19 ]

Mike believes that this issue is a duplicate of LU-11358

Comment by Sarah Liu [ 15/Feb/19 ]

Here is an update on trying to isolate the issue

b2_10-ib branch, system is configured with PFL and striped dir, fault injection interval is 7200s, running overnight, fail/pass rate:
1. build https://build.whamcloud.com/job/lustre-b2_10-ib/92/ 37/301
2. build https://build.whamcloud.com/job/lustre-b2_10-ib/95/ 65/563
3. build https://build.whamcloud.com/job/lustre-b2_10-ib/98/ 58/593

Then I decreased the fault interval from 7200s to 3600s, on build #98 again, the result is 202/636, the failure rate rises about 4 times.

Clients got evicted

Feb 14 22:44:16 soak-25 systemd: Started Session 3452 of user root.
Feb 14 22:44:16 soak-25 sshd[28919]: pam_unix(sshd:session): session opened for user root by (uid=0)
Feb 14 22:45:03 soak-25 kernel: LustreError: 167-0: soaked-MDT0000-mdc-ffff9474f3b93800: This client was evicted by soaked-MDT0000; in progress operations using this service will fail.
Feb 14 22:45:03 soak-25 kernel: LustreError: 28948:0:(lmv_obd.c:1387:lmv_statfs()) can't stat MDS #0 (soaked-MDT0000-mdc-ffff9474f3b93800), error -5
Feb 14 22:45:03 soak-25 kernel: Lustre: soaked-MDT0000-mdc-ffff9474f3b93800: Connection restored to 192.168.1.108@o2ib (at 192.168.1.108@o2ib)
Feb 14 22:45:03 soak-25 kernel: LustreError: 28948:0:(lmv_obd.c:1387:lmv_statfs()) Skipped 1 previous similar message
Feb 14 22:45:03 soak-25 kernel: LustreError: 28948:0:(llite_lib.c:1759:ll_statfs_internal()) md_statfs fails: rc = -5
Feb 14 22:45:03 soak-25 kernel: LustreError: 28948:0:(llite_lib.c:1759:ll_statfs_internal()) Skipped 3 previous similar messages
Feb 14 22:45:03 soak-25 sshd[28919]: Received disconnect from 10.10.1.116 port 37138:11: disconnected by user
Feb 14 22:45:03 soak-25 sshd[28919]: Disconnected from 10.10.1.116 port 37138
Feb 14 22:45:03 soak-25 sshd[28919]: pam_unix(sshd:session): session closed for user root
Feb 14 22:45:03 soak-25 systemd-logind: Removed session 3452.
Feb 14 22:45:03 soak-25 systemd: Removed slice User Slice of root.

Around the same time check MDS 0, recovery had some problem

Feb 14 22:43:54 soak-8 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 32, npartitions: 2
Feb 14 22:43:54 soak-8 kernel: alg: No test for adler32 (adler32-zlib)
Feb 14 22:43:55 soak-8 kernel: Lustre: Lustre: Build Version: 2.10.6_35_g2bc39a1
Feb 14 22:43:55 soak-8 kernel: LNet: Added LNI 192.168.1.108@o2ib [8/256/0/180]
Feb 14 22:43:56 soak-8 kernel: LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcach
e,nodelalloc
Feb 14 22:43:57 soak-8 kernel: Lustre: MGS: Connection restored to 1c77e4d7-3455-4f6a-9bfb-8452b7d6cd45 (at 0@lo)
Feb 14 22:43:58 soak-8 kernel: Lustre: soaked-MDT0000: Imperative Recovery not enabled, recovery window 300-900
Feb 14 22:43:58 soak-8 sshd[12292]: pam_unix(sshd:session): session closed for user root
Feb 14 22:43:58 soak-8 kernel: LustreError: 12581:0:(llog_osd.c:949:llog_osd_next_block()) soaked-MDT0003-osp-MDT0000: invalid llog tail at log id [0x3:0xc001c16b:0x2]:0 offset 53706752 last_rec idx 1515870810 tail idx 0lrt len 0 read_size 32768
Feb 14 22:43:58 soak-8 kernel: LustreError: 12581:0:(lod_dev.c:419:lod_sub_recovery_thread()) soaked-MDT0003-osp-MDT0000 getting update log failed: rc = -22
Feb 14 22:43:58 soak-8 systemd-logind: Removed session 3.
Feb 14 22:43:58 soak-8 systemd: Removed slice User Slice of root.
Feb 14 22:44:01 soak-8 kernel: Lustre: soaked-MDT0000: Connection restored to 5b845528-9dc9-9909-283c-824e2a4fa986 (at 192.168.1.129@o2ib)
Feb 14 22:44:01 soak-8 kernel: Lustre: 12582:0:(ldlm_lib.c:2061:target_recovery_overseer()) recovery is aborted, evict exports in recovery
Feb 14 22:44:01 soak-8 kernel: Lustre: soaked-MDT0000: disconnecting 28 stale clients
Feb 14 22:44:01 soak-8 kernel: LustreError: 12582:0:(ldlm_lib.c:1642:abort_lock_replay_queue()) @@@ aborted:  req@ffff9b251af28000 x1625382729642208/t0(0) o101->5b845528-9dc9-9909-283c-824e2a4fa986@192.168.1.129@o2ib:163/0 lens 328/0 e 0 to 0 dl 1550184283 ref 1 fl Complete:/40/ffffffff rc 0/-1
Feb 14 22:44:03 soak-8 kernel: Lustre: soaked-MDT0000: Connection restored to 192.168.1.111@o2ib (at 192.168.1.111@o2ib)
Feb 14 22:44:03 soak-8 kernel: Lustre: Skipped 3 previous similar messages
Feb 14 22:44:05 soak-8 kernel: Lustre: MGS: Connection restored to f14e792b-93e7-99b9-85bf-e51e2e75f9ff (at 192.168.1.144@o2ib)
Feb 14 22:44:05 soak-8 kernel: Lustre: Skipped 2 previous similar messages

MDT0 dmesg around the time

[2019-02-14T22:42:11+00:00] INFO: Report handlers complete
[  181.040623] LNet: HW NUMA nodes: 2, HW CPU cores: 32, npartitions: 2
[  181.051701] alg: No test for adler32 (adler32-zlib)
[  181.895037] Lustre: Lustre: Build Version: 2.10.6_35_g2bc39a1
[  182.141271] LNet: Added LNI 192.168.1.108@o2ib [8/256/0/180]
[  182.670505] LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc
[  184.313895] Lustre: MGS: Connection restored to 1c77e4d7-3455-4f6a-9bfb-8452b7d6cd45 (at 0@lo)
[  184.846659] Lustre: soaked-MDT0000: Imperative Recovery not enabled, recovery window 300-900
[  184.880565] LustreError: 12581:0:(llog_osd.c:949:llog_osd_next_block()) soaked-MDT0003-osp-MDT0000: invalid llog tail at log id [0x3:0xc001c16b:0
x2]:0 offset 53706752 last_rec idx 1515870810 tail idx 0lrt len 0 read_size 32768
[  184.902991] LustreError: 12581:0:(lod_dev.c:419:lod_sub_recovery_thread()) soaked-MDT0003-osp-MDT0000 getting update log failed: rc = -22
[  188.370619] Lustre: soaked-MDT0000: Connection restored to 5b845528-9dc9-9909-283c-824e2a4fa986 (at 192.168.1.129@o2ib)
[  188.464517] Lustre: 12582:0:(ldlm_lib.c:2061:target_recovery_overseer()) recovery is aborted, evict exports in recovery
[  188.476813] Lustre: soaked-MDT0000: disconnecting 28 stale clients
[  188.484807] LustreError: 12582:0:(ldlm_lib.c:1642:abort_lock_replay_queue()) @@@ aborted:  req@ffff9b251af28000 x1625382729642208/t0(0) o101->5b8
45528-9dc9-9909-283c-824e2a4fa986@192.168.1.129@o2ib:163/0 lens 328/0 e 0 to 0 dl 1550184283 ref 1 fl Complete:/40/ffffffff rc 0/-1
[  189.865022] Lustre: soaked-MDT0000: Connection restored to 192.168.1.111@o2ib (at 192.168.1.111@o2ib)
[  189.875353] Lustre: Skipped 3 previous similar messages
[  192.357184] Lustre: MGS: Connection restored to f14e792b-93e7-99b9-85bf-e51e2e75f9ff (at 192.168.1.144@o2ib)
[  192.368199] Lustre: Skipped 2 previous similar messages
[  197.230719] Lustre: soaked-MDT0000: Connection restored to 13da700d-6c96-7915-6d52-28bb8b744777 (at 192.168.1.119@o2ib)
[  197.242793] Lustre: Skipped 11 previous similar messages
[  205.596871] Lustre: soaked-MDT0000: Connection restored to 192.168.1.105@o2ib (at 192.168.1.105@o2ib)
[  205.607244] Lustre: Skipped 7 previous similar messages
[  222.557556] Lustre: MGS: Connection restored to fbf36631-903d-94c0-e6e4-1a49ce404920 (at 192.168.1.130@o2ib)
[  222.568592] Lustre: Skipped 33 previous similar messages
[  255.598439] Lustre: MGS: Connection restored to 192.168.1.105@o2ib (at 192.168.1.105@o2ib)
[  255.607695] Lustre: Skipped 19 previous similar messages
[ 1194.439447] Lustre: 12595:0:(mdd_device.c:1591:mdd_changelog_clear()) soaked-MDD0000: No entry for user 1
[ 1874.158569] LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 192.168.1.110@o2ib (no target). If you are running an HA pair
 check that the target is mounted on the other server.
[ 1884.911880] Lustre: 12424:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1550185938/real 1
550185938]  req@ffff9b20e3d30f00 x1625486012545520/t0(0) o400->soaked-MDT0001-osp-MDT0000@192.168.1.109@o2ib:24/4 lens 224/224 e 0 to 1 dl 155018594
5 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
[ 1884.944412] Lustre: soaked-MDT0001-osp-MDT0000: Connection to soaked-MDT0001 (at 192.168.1.109@o2ib) was lost; in progress operations using this 
service will wait for recovery to complete
[ 1886.720057] LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 192.168.1.123@o2ib (no target). If you are running an HA pair
 check that the target is mounted on the other server.
...
[10521.406121] Lustre: Skipped 2 previous similar messages
[10551.358073] Lustre: 12598:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1550194604/real 1550194604]  req@ffff9b2501e3e000 x1625486098246384/t0(0) o105->soaked-MDT0000@192.168.1.111@o2ib:15/16 lens 304/224 e 0 to 1 dl 1550194611 ref 1 fl Rpc:eX/2/ffffffff rc 0/-1
[10551.389449] Lustre: 12598:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 95159 previous similar messages
[10615.361454] Lustre: 12598:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1550194668/real 1550194668]  req@ffff9b2501e3e000 x1625486098246384/t0(0) o105->soaked-MDT0000@192.168.1.111@o2ib:15/16 lens 304/224 e 0 to 1 dl 1550194675 ref 1 fl Rpc:eX/2/ffffffff rc 0/-1
[10615.392830] Lustre: 12598:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 492391 previous similar messages
[10712.839564] LNet: Service thread pid 16282 was inactive for 200.48s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[10712.858522] Pid: 16282, comm: mdt00_022 3.10.0-957.el7_lustre.x86_64 #1 SMP Mon Jan 7 20:06:41 UTC 2019
[10712.869032] Call Trace:
[10712.871792]  [<ffffffffc0fc9a11>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
[10712.879579]  [<ffffffffc0fcab53>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
[10712.887616]  [<ffffffffc14cca1c>] mdt_reint_rename_or_migrate.isra.39+0x67c/0x860 [mdt]
[10712.896612]  [<ffffffffc14ccc33>] mdt_reint_rename+0x13/0x20 [mdt]
[10712.903547]  [<ffffffffc14d0c73>] mdt_reint_rec+0x83/0x210 [mdt]
[10712.910282]  [<ffffffffc14b218b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
[10712.917601]  [<ffffffffc14bdf77>] mdt_reint+0x67/0x140 [mdt]
[10712.923952]  [<ffffffffc10622aa>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[10712.931718]  [<ffffffffc100ad5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[10712.940334]  [<ffffffffc100e4a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[10712.947393]  [<ffffffffbcac1c31>] kthread+0xd1/0xe0
[10712.952870]  [<ffffffffbd174c37>] ret_from_fork_nospec_end+0x0/0x39
[10712.959891]  [<ffffffffffffffff>] 0xffffffffffffffff
[10712.965483] LustreError: dumping log to /tmp/lustre-log.1550194765.16282
[10714.094205] LNet: Service thread pid 16277 was inactive for 201.74s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[10714.113169] Pid: 16277, comm: mdt01_023 3.10.0-957.el7_lustre.x86_64 #1 SMP Mon Jan 7 20:06:41 UTC 2019
[10714.123670] Call Trace:
[10714.126418]  [<ffffffffc0fc9a11>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
[10714.134183]  [<ffffffffc0fcab53>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
[10714.142219]  [<ffffffffc14cca1c>] mdt_reint_rename_or_migrate.isra.39+0x67c/0x860 [mdt]
[10714.151209]  [<ffffffffc14ccc33>] mdt_reint_rename+0x13/0x20 [mdt]
[10714.158135]  [<ffffffffc14d0c73>] mdt_reint_rec+0x83/0x210 [mdt]
[10714.164869]  [<ffffffffc14b218b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
[10714.172180]  [<ffffffffc14bdf77>] mdt_reint+0x67/0x140 [mdt]
[10714.178525]  [<ffffffffc10622aa>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[10714.186283]  [<ffffffffc100ad5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[10714.194892]  [<ffffffffc100e4a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[10714.201941]  [<ffffffffbcac1c31>] kthread+0xd1/0xe0
[10714.207408]  [<ffffffffbd174c37>] ret_from_fork_nospec_end+0x0/0x39
[10714.214416]  [<ffffffffffffffff>] 0xffffffffffffffff
[10717.449725] LNet: 12398:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.1.111@o2ib: 2 seconds
[10725.781274] Lustre: MGS: haven't heard from client 32a49f15-8d69-4a99-1a3c-7cfee33c5424 (at 192.168.1.111@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9b20ebbf5c00, cur 1550194778 expire 1550194628 last 1550194551
[10725.805010] Lustre: Skipped 3 previous similar messages
[10730.449987] LNet: 12398:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.1.111@o2ib: 15 seconds
[10743.450263] LNet: 12398:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.1.111@o2ib: 28 seconds
[10747.416333] Lustre: 12409:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1550194774/real 0]  req@ffff9b2427efb000 x1625486098318112/t0(0) o38->soaked-MDT0003-osp-MDT0000@192.168.1.111@o2ib:24/4 lens 520/544 e 0 to 1 dl 1550194800 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
[10747.447516] Lustre: 12409:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 770929 previous similar messages
[10755.450564] LNet: 12398:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.1.111@o2ib: 40 seconds
[10768.450863] LNet: 12398:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.1.111@o2ib: 53 seconds
[10780.451086] LNet: 12398:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.1.111@o2ib: 10 seconds
[10806.451671] LNet: 12398:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.1.111@o2ib: 36 seconds
[10806.463238] LNet: 12398:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped 2 previous similar messages
[10812.354796] LustreError: 16277:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1550194565, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-soaked-MDT0000_UUID lock: ffff9b2507f3fa00/0xbb9303ff9824fff1 lrc: 3/0,1 mode: --/EX res: [0x200000004:0x1:0x0].0x0 bits 0x2 rrc: 9 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 16277 timeout: 0 lvb_type: 0
[10812.355047] LustreError: dumping log to /tmp/lustre-log.1550194865.16282
[10812.372441] Lustre: soaked-MDT0000: Received new LWP connection from 192.168.1.109@o2ib, removing former export from same NID
[10812.372444] Lustre: Skipped 1 previous similar message
[10812.372488] Lustre: soaked-MDT0000: Connection restored to 192.168.1.109@o2ib (at 192.168.1.109@o2ib)
[10812.372490] Lustre: Skipped 8 previous similar messages
[10812.440503] LustreError: 16277:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) Skipped 1 previous similar message
[10844.452579] LNet: 12398:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.1.111@o2ib: 74 seconds
[10844.464157] LNet: 12398:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped 2 previous similar messages

I will reload build #92 with 3600s interval and see how it goes

----Changes from #92 to #95 are------
LU-10527 obdclass: don't recycle loghandle upon ENOSPC (detail)
LU-11582 llite: protect reading inode->i_data.nrpages (detail)
LU-11461 scripts: Support symlink target (detail)
LU-9230 ldlm: speed up preparation for list of lock cancel (detail)
LU-7631 tests: wait_osts_up waits for MDS precreates (detail)
LU-11187 ldiskfs: don't mark mmp buffer head dirty (detail)
LU-10055 mdt: use max_mdsize in reply for layout intent (detail)
LU-10818 obdecho: don't set ma_need in echo_attr_get_complex() (detail)
LU-10612 tests: reply_single.sh,test_48: No space left (detail)
LU-11195 lod: Mark comps cached on replay of layout change (detail)
LU-11536 ofd: ofd_create_hdl may return 0 in case of ENOSPC (detail)
LU-10629 lod: Clear OST pool with setstripe (detail)
LU-11068 build: remove invalid kernel srpm location (detail)

----Changes from #95 to #98 are------
LU-8999 quota: fix issue of multiple call of seq start (detail)
LU-11186 ofd: fix for a final oid at sequence (detail)
LU-11658 lov: cl_cache could miss initialize (detail)
LU-11567 utils: llog_reader print changelog index (detail)
LU-10985 mdt: properly handle unknown intent requests (detail)
LU-11056 lwp: fix lwp reconnection issue (detail)
LU-9874 osd-ldiskfs: simplify project transfer codes (detail)
LU-10461 tests: call exit in the skip routine (detail)
LU-9810 lnd: use less CQ entries for each connection (detail)
LU-10213 lnd: calculate qp max_send_wrs properly (detail)
LU-8895 target: limit grant allocation (detail)
LU-7988 hsm: added coordinator housekeeping flag (detail)
LU-10516 misc: require ldiskfsprogs-1.42.13.wc6 or later (detail)
LU-11049 ssk: correctly handle null byte by lgss_sk (detail)
LU-11196 tests: clean up after conf-sanity test_101 (detail)
LU-11736 utils: don't set max_sectors_kb on MDT/MGT (detail)
LU-11757 lod: use calculated stripe count (detail)
LU-11006 lnet: fix show peer yaml tree with no peer (detail)
LU-11737 lfsck: do not ignore dryrun (detail)
LU-10565 osd: unify interface for vfs (detail)
LU-11187 ldiskfs: update rhel7.6 series (detail)
LU-8402 ldlm: assume OBD_CONNECT_IBITS (detail)

Comment by Sarah Liu [ 19/Feb/19 ]

With the same fault injection time(3600s) and same build (#98), after clearing the update log on all MDS, the fail/pass number change to 82 fail / 530 pass, so it helps with the case. Will try on earlier build to see how it does

Comment by Patrick Farrell (Inactive) [ 20/Feb/19 ]

Just want to update the ticket to clarify a bit.

The current plan (per the call this morning) is to go to the tip of b2_10 and run the tests on that codebase.

  1. What is the approximate ETA on that?
  2. Is it possible to stop testing on failure?  One problem with the existing logs is the volume of information.
Comment by Patrick Farrell (Inactive) [ 20/Feb/19 ]

By the way, what is our fault injection?  What exactly are we doing?  I know failover, but of which components, and how?

Comment by Sarah Liu [ 21/Feb/19 ]

hi Patrick, please see below

What is the approximate ETA on that?

I will work on moving soak to tip of b2_10 tomorrow, and can get some data later tomorrow

Is it possible to stop testing on failure? One problem with the existing logs is the volume of information.

As I know soak cannot stop on its own when hitting problem, but I can stop it manually

what is our fault injection? What exactly are we doing? I know failover, but of which components, and how?

soak does 4 kinds of faults for now: MDS restart, MDS failover, OSS restart, OSS failover
for MDS/OSS restart, it will reboot the node(power cycle) and wait the node come back to mount
for MDS/OSS failover, it will reboot the node as above, and do failover and failback to/from secondary node configured.

Comment by Patrick Farrell (Inactive) [ 22/Feb/19 ]

Thanks Sarah!

So James was saying Soak is up and running today, and we've got some failures.

Could you help me out by giving me an example with this information:

  1. Application information for failing application
    Application, command invocation, nodes, any logs or error messages specifically from the application
  2. Whatever fault injection we did around that time (what it was, which nodes)
    Pointers towards logs from these nodes

I will try to see about getting on the system some time later today.

Comment by Sarah Liu [ 25/Feb/19 ]

Here is an example failure, from 240435-mdtestfpp.out
console and syslog and application log are attached.
1. Application failure error, soak-* are test nodes involved. I picked 1 node soak-23 to find more info listed below. The fail occurred around 2/22 16:04:36
command used:

Command line used: /mnt/soaked/bin/mdtest -d /mnt/soaked/soaktest/test/mdtestfpp/240435 -i 5 -n 383 -u
/mnt/soaked/soaktest/test/mdtestfpp/240435
stripe_count:  5 stripe_size:   1048576 stripe_offset: -1
lmv_stripe_count: 4 lmv_stripe_offset: 1 lmv_hash_type: fnv_1a_64
mdtidx           FID[seq:oid:ver]
     1           [0x240029440:0xe7f:0x0]                
     2           [0x2800255ce:0xe7f:0x0]                
     3           [0x2c002abdf:0xe7f:0x0]                
     0           [0x2000226f5:0xc29:0x0]                
lmv_stripe_count: 4 lmv_stripe_offset: 1 lmv_hash_type: fnv_1a_64
srun: Warning: can't honor --ntasks-per-node set to 2 which doesn't match the requested tasks 23 with the number of requested nodes 23. Ignoring --ntasks-per-node.
-- started at 02/22/2019 15:13:27 --

mdtest-1.9.3 was launched with 23 total task(s) on 23 node(s)
Command line used: /mnt/soaked/bin/mdtest -d /mnt/soaked/soaktest/test/mdtestfpp/240435 -i 5 -n 383 -u
Path: /mnt/soaked/soaktest/test/mdtestfpp
FS: 83.4 TiB   Used FS: 6.8%   Inodes: 78.6 Mi   Used Inodes: 1.1%
02/22/2019 16:04:58: Process 9(soak-26.spirit.whamcloud.com): FAILED in create_remove_items_helper, unable to create directory: Input/output error
02/22/2019 16:01:53: Process 16(soak-37.spirit.whamcloud.com): FAILED in create_remove_items_helper, unable to create directory: Input/output error
02/22/2019 16:01:36: Process 19(soak-40.spirit.whamcloud.com): FAILED in create_remove_items_helper, unable to create directory: Input/output error
02/22/2019 16:01:48: Process 21(soak-42.spirit.whamcloud.com): FAILED in create_remove_items_helper, unable to create directory: Input/output error
02/22/2019 16:04:46: Process 4(soak-21.spirit.whamcloud.com): FAILED in create_remove_items_helper, unable to create directory: Input/output error
02/22/2019 16:04:34: Process 7(soak-24.spirit.whamcloud.com): FAILED in create_remove_items_helper, unable to create directory: Input/output error
02/22/2019 16:04:42: Process 3(soak-20.spirit.whamcloud.com): FAILED in create_remove_items_helper, unable to create directory: Input/output error
02/22/2019 16:05:13: Process 11(soak-28.spirit.whamcloud.com): FAILED in create_remove_items_helper, unable to create directory: Input/output error
02/22/2019 16:04:36: Process 6(soak-23.spirit.whamcloud.com): FAILED in create_remove_items_helper, unable to create directory: Input/output error
02/22/2019 16:03:40: Process 1(soak-18.spirit.whamcloud.com): FAILED in create_remove_items_helper, unable to create directory: Input/output error
02/22/2019 16:02:02: Process 22(soak-43.spirit.whamcloud.com): FAILED in create_remove_items_helper, unable to create directory: Input/output error
02/22/2019 16:01:50: Process 18(soak-39.spirit.whamcloud.com): FAILED in create_remove_items_helper, unable to create directory: Input/output error
02/22/2019 16:01:48: Process 20(soak-41.spirit.whamcloud.com): FAILED in create_remove_items_helper, unable to create directory: Input/output error
02/22/2019 16:02:36: Process 13(soak-30.spirit.whamcloud.com): FAILED in create_remove_items_helper, unable to create directory: Input/output error
02/22/2019 16:02:51: Process 2(soak-19.spirit.whamcloud.com): FAILED in create_remove_items_helper, unable to create directory: Input/output error
02/22/2019 16:01:24: Process 15(soak-36.spirit.whamcloud.com): FAILED in create_remove_items_helper, unable to create directory: Input/output error

23 tasks, 8809 files/directories
02/22/2019 16:02:48: Process 0(soak-17.spirit.whamcloud.com): FAILED in create_remove_items_helper, unable to create directory: Input/output error
02/22/2019 16:02:15: Process 17(soak-38.spirit.whamcloud.com): FAILED in create_remove_items_helper, unable to create directory: Input/output error
02/22/2019 16:02:21: Process 12(soak-29.spirit.whamcloud.com): FAILED in create_remove_items_helper, unable to create directory: Input/output error
02/22/2019 16:02:10: Process 14(soak-35.spirit.whamcloud.com): FAILED in create_remove_items_helper, unable to create directory: Input/output error
02/22/2019 16:04:51: Process 5(soak-22.spirit.whamcloud.com): FAILED in create_remove_items_helper, unable to create directory: Input/output error
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
srun: error: soak-26: task 9: Exited with exit code 1
...
srun: error: soak-23: task 6: Exited with exit code 1
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: soak-24: task 7: Exited with exit code 1
srun: error: soak-17: task 0: Exited with exit code 1
srun: error: soak-21: task 4: Exited with exit code 1
srun: error: soak-20: task 3: Exited with exit code 1
srun: error: soak-19: task 2: Exited with exit code 1
srun: error: soak-28: task 11: Exited with exit code 1
srun: error: soak-18: task 1: Exited with exit code 1
srun: error: soak-22: task 5: Exited with exit code 1
srun: error: soak-37: task 16: Exited with exit code 1
srun: error: soak-30: task 13: Exited with exit code 1
srun: error: soak-40: task 19: Exited with exit code 1
srun: error: soak-38: task 17: Exited with exit code 1
srun: error: soak-43: task 22: Exited with exit code 1
srun: error: soak-36: task 15: Exited with exit code 1
srun: error: soak-42: task 21: Exited with exit code 1
srun: error: soak-41: task 20: Exited with exit code 1
srun: error: soak-39: task 18: Exited with exit code 1
srun: error: soak-35: task 14: Exited with exit code 1
srun: error: soak-29: task 12: Exited with exit code 1
srun: error: soak-27: task 10: Killed
srun: error: soak-25: task 8: Killed
rm: cannot remove ‘/mnt/soaked/soaktest/test/mdtestfpp/240435/#test-dir.3/mdtest_tree.15.0’: Directory not empty
rm: cannot remove ‘/mnt/soaked/soaktest/test/mdtestfpp/240435/#test-dir.3/mdtest_tree.14.0’: Directory not empty
rm: cannot remove ‘/mnt/soaked/soaktest/test/mdtestfpp/240435/#test-dir.3/mdtest_tree.0.0’: Directory not empty
rm: cannot remove ‘/mnt/soaked/soaktest/test/mdtestfpp/240435/#test-dir.3/mdtest_tree.5.0’: Directory not empty
rm: cannot remove ‘/mnt/soaked/soaktest/test/mdtestfpp/240435/#test-dir.3/mdtest_tree.8.0’: Directory not empty
rm: cannot remove ‘/mnt/soaked/soaktest/test/mdtestfpp/240435/#test-dir.3/mdtest_tree.17.0’: Directory not empty

2. syslog of soak-23(soak-23.log-20190224.) around 2/22 16:04:36, the client was evicted, cannot stat MDS #2, which is soak-10.

Feb 22 16:03:51 soak-23 sshd[108453]: Disconnected from 10.10.1.116 port 35852
Feb 22 16:03:51 soak-23 sshd[108453]: pam_unix(sshd:session): session closed for user root
Feb 22 16:04:27 soak-23 kernel: LustreError: 167-0: soaked-MDT0002-mdc-ffff9adc14f2b800: This client was evicted by soaked-MDT0002; in progress operations using this service will fail.
Feb 22 16:04:27 soak-23 kernel: LustreError: 108482:0:(lmv_obd.c:1387:lmv_statfs()) can't stat MDS #2 (soaked-MDT0002-mdc-ffff9adc14f2b800), error -5
Feb 22 16:04:27 soak-23 kernel: LustreError: 108401:0:(llite_lib.c:1759:ll_statfs_internal()) md_statfs fails: rc = -5
Feb 22 16:04:27 soak-23 kernel: Lustre: soaked-MDT0002-mdc-ffff9adc14f2b800: Connection restored to 192.168.1.110@o2ib (at 192.168.1.110@o2ib)
Feb 22 16:04:27 soak-23 kernel: LustreError: 108482:0:(lmv_obd.c:1387:lmv_statfs()) Skipped 1 previous similar message
Feb 22 16:04:27 soak-23 systemd-logind: Removed session 2460.
Feb 22 16:04:27 soak-23 systemd: Removed slice User Slice of root.

3. soak-10 syslog (soak-10.log-20190224) around the same time, it just came back from reboot, check the soak.log and found soak-10 was triggered mds_failover right before this

Feb 22 16:03:18 soak-10 multipathd: 360080e50001fedb80000015752012949: queue_if_no_path enabled
Feb 22 16:03:18 soak-10 multipathd: 360080e50001fedb80000015752012949: Recovered to normal mode
Feb 22 16:03:18 soak-10 multipathd: 360080e50001fedb80000015752012949: remaining active paths: 1
Feb 22 16:03:18 soak-10 kernel: sd 0:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, queueing MODE_SELECT command
Feb 22 16:03:19 soak-10 kernel: sd 0:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, MODE_SELECT completed
Feb 22 16:03:19 soak-10 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 32, npartitions: 2
Feb 22 16:03:19 soak-10 kernel: alg: No test for adler32 (adler32-zlib)
Feb 22 16:03:20 soak-10 kernel: Lustre: Lustre: Build Version: 2.10.6_51_gd6dabe0
Feb 22 16:03:20 soak-10 kernel: LNet: Added LNI 192.168.1.110@o2ib [8/256/0/180]
Feb 22 16:03:20 soak-10 kernel: LDISKFS-fs (dm-5): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc
Feb 22 16:03:22 soak-10 kernel: Lustre: soaked-MDT0002: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
Feb 22 16:03:22 soak-10 kernel: LustreError: 12847:0:(llog.c:588:llog_process_thread()) soaked-MDT0003-osp-MDT0002: [0x1:0xc002abb4:0x2] Invalid record: index 4 but expected 2
Feb 22 16:03:22 soak-10 kernel: LustreError: 12847:0:(lod_dev.c:419:lod_sub_recovery_thread()) soaked-MDT0003-osp-MDT0002 getting update log failed: rc = -34
Feb 22 16:03:22 soak-10 kernel: Lustre: 12848:0:(ldlm_lib.c:2061:target_recovery_overseer()) recovery is aborted, evict exports in recovery
Feb 22 16:03:22 soak-10 kernel: Lustre: soaked-MDT0002: disconnecting 4 stale clients
Feb 22 16:03:22 soak-10 sshd[12127]: Received disconnect from 10.10.1.116 port 33696:11: disconnected by user
Feb 22 16:03:22 soak-10 sshd[12127]: Disconnected from 10.10.1.116 port 33696
Feb 22 16:03:22 soak-10 sshd[12127]: pam_unix(sshd:session): session closed for user root
Feb 22 16:03:22 soak-10 systemd-logind: Removed session 3.

soak.log

2019-02-22 15:55:47,097:fsmgmt.fsmgmt:INFO     triggering fault mds_failover
2019-02-22 15:55:47,097:fsmgmt.fsmgmt:INFO     reseting MDS node soak-10
2019-02-22 15:55:47,099:fsmgmt.fsmgmt:INFO     executing cmd pm -h powerman -c soak-10> /dev/null
2019-02-22 15:56:17,656:fsmgmt.fsmgmt:INFO     trying to connect to soak-10 ...
2019-02-22 15:56:27,673:fsmgmt.fsmgmt:INFO     trying to connect to soak-10 ...
2019-02-22 15:56:36,884:fsmgmt.fsmgmt:INFO     trying to connect to soak-10 ...
2019-02-22 15:56:45,068:fsmgmt.fsmgmt:INFO     trying to connect to soak-10 ...
2019-02-22 15:56:53,246:fsmgmt.fsmgmt:INFO     trying to connect to soak-10 ...
2019-02-22 15:57:01,410:fsmgmt.fsmgmt:INFO     trying to connect to soak-10 ...
2019-02-22 15:57:09,590:fsmgmt.fsmgmt:INFO     trying to connect to soak-10 ...
2019-02-22 15:57:17,770:fsmgmt.fsmgmt:INFO     trying to connect to soak-10 ...
2019-02-22 15:57:25,942:fsmgmt.fsmgmt:INFO     trying to connect to soak-10 ...
2019-02-22 15:57:34,122:fsmgmt.fsmgmt:INFO     trying to connect to soak-10 ...
2019-02-22 15:57:42,300:fsmgmt.fsmgmt:INFO     trying to connect to soak-10 ...
2019-02-22 15:57:52,315:fsmgmt.fsmgmt:INFO     trying to connect to soak-10 ...
2019-02-22 15:58:00,492:fsmgmt.fsmgmt:INFO     trying to connect to soak-10 ...
2019-02-22 15:58:08,672:fsmgmt.fsmgmt:INFO     trying to connect to soak-10 ...
2019-02-22 15:58:16,852:fsmgmt.fsmgmt:INFO     trying to connect to soak-10 ...
2019-02-22 15:58:25,022:fsmgmt.fsmgmt:INFO     trying to connect to soak-10 ...
2019-02-22 15:58:33,202:fsmgmt.fsmgmt:INFO     trying to connect to soak-10 ...
2019-02-22 15:58:41,374:fsmgmt.fsmgmt:INFO     trying to connect to soak-10 ...
2019-02-22 15:58:46,895:fsmgmt.fsmgmt:INFO     trying to connect to soak-10 ...
2019-02-22 15:58:52,071:fsmgmt.fsmgmt:INFO     trying to connect to soak-10 ...
2019-02-22 15:58:52,891:fsmgmt.fsmgmt:INFO     soak-10 is up!!!
2019-02-22 15:59:03,903:fsmgmt.fsmgmt:INFO     Failing over soaked-MDT0002 ...
2019-02-22 15:59:03,903:fsmgmt.fsmgmt:INFO     Mounting soaked-MDT0002 on soak-11 ...
2019-02-22 16:00:21,285:fsmgmt.fsmgmt:INFO     ... soaked-MDT0002 mounted successfully on soak-11
2019-02-22 16:00:21,285:fsmgmt.fsmgmt:INFO     ... soaked-MDT0002 failed over
2019-02-22 16:00:21,285:fsmgmt.fsmgmt:INFO     Wait for recovery to complete
2019-02-22 16:00:21,969:fsmgmt.fsmgmt:DEBUG    Recovery Result Record: {'soak-11': {'soaked-MDT0003': 'COMPLETE', 'soaked-MDT0002': 'COMPLETE'}}
2019-02-22 16:00:21,969:fsmgmt.fsmgmt:INFO     Node soak-11: 'soaked-MDT0002' recovery completed
2019-02-22 16:00:21,969:fsmgmt.fsmgmt:INFO     Failing back soaked-MDT0002 ...
2019-02-22 16:00:21,970:fsmgmt.fsmgmt:INFO     Unmounting soaked-MDT0002 on soak-11 ...
2019-02-22 16:00:27,601:sched.sched :INFO     job 240214 completed with status FAIL
2019-02-22 16:00:27,601:sched.sched :ERROR    job 240214 of type simul FAILED
2019-02-22 16:00:27,602:sched.sched :DEBUG    JobID: 240214 type: simul #tasks: 2 timelimit: 04:00:00
cmd: simul -d /mnt/soaked/soaktest/test -n 165 -N 11 
2019-02-22 16:00:28,621:sched.sched :INFO     job 240441 completed with status FAIL
2019-02-22 16:00:28,622:sched.sched :ERROR    job 240441 of type simul FAILED
2019-02-22 16:00:28,622:sched.sched :DEBUG    JobID: 240441 type: simul #tasks: 15 timelimit: 04:00:00
cmd: simul -d /mnt/soaked/soaktest/test -n 375 -N 11 
2019-02-22 16:00:28,622:sched.sched :INFO     DEFUG1 28 nrnodes 896 nrcpus 15
2019-02-22 16:00:28,622:sched.sched :INFO     DEFUG22 13 nrnodes 416 nrcpus
2019-02-22 16:00:28,623:sched.sched :INFO     executing: cmd=sbatch -vv -p soaktest --share  -n 5  -N 2-5  --ntasks-per-node=2  -t 04:00:00  --uid soaktest -J iorssf -o /scratch/soak-results//jobs/%j-iorssf.out -D /mnt/soaked/soaktest/test
2019-02-22 16:00:28,645:sched.sched :INFO     New iorssf job sumbitted with id 240457 and 5 tasks
2019-02-22 16:00:28,645:sched.sched :DEBUG    JobID: 240457 type: iorssf #tasks: 5 timelimit: 04:00:00
cmd: IOR -vvv -a POSIX -E -w -W -r -Z -q -R -W -b 8039360k -t 11840k -o /mnt/soaked/soaktest/test/ssf 
2019-02-22 16:00:28,646:sched.sched :INFO     DEFUG1 28 nrnodes 896 nrcpus 15
2019-02-22 16:00:28,646:sched.sched :INFO     DEFUG22 13 nrnodes 416 nrcpus
2019-02-22 16:00:28,646:sched.sched :INFO     executing: cmd=sbatch -vv -p soaktest --share  -n 23  -N 11-23  --ntasks-per-node=2  -t 04:00:00  --uid soaktest -J iorfpp -o /scratch/soak-results//jobs/%j-iorfpp.out -D /mnt/soaked/soaktest/test
2019-02-22 16:00:28,665:sched.sched :INFO     New iorfpp job sumbitted with id 240458 and 23 tasks
2019-02-22 16:00:28,665:sched.sched :DEBUG    JobID: 240458 type: iorfpp #tasks: 23 timelimit: 04:00:00
cmd: IOR -vvv -a POSIX -F -w -W -r -Z -q -R -W -b 3976512k -t 28608k -o /mnt/soaked/soaktest/test/iorfpp_file
2019-02-22 16:00:29,301:fsmgmt.fsmgmt:INFO     ... soaked-MDT0002 unmounted successfully on soak-11
2019-02-22 16:00:29,301:fsmgmt.fsmgmt:INFO     Mounting soaked-MDT0002 on soak-10 ...
2019-02-22 16:03:39,432:fsmgmt.fsmgmt:INFO     ... soaked-MDT0002 mounted successfully on soak-10
2019-02-22 16:03:39,432:fsmgmt.fsmgmt:INFO     ... soaked-MDT0002 failed back
2019-02-22 16:03:39,432:fsmgmt.fsmgmt:INFO     mds_failover just completed
2019-02-22 16:03:39,433:fsmgmt.fsmgmt:INFO     next fault in 522s
Comment by Patrick Farrell (Inactive) [ 26/Feb/19 ]

Hmm, OK.

Well, this error and the associated evictions are enough to cause job failure:

Feb 22 16:03:22 soak-10 kernel: LustreError: 12847:0:(llog.c:588:llog_process_thread()) soaked-MDT0003-osp-MDT0002: [0x1:0xc002abb4:0x2] Invalid record: index 4 but expected 2
Feb 22 16:03:22 soak-10 kernel: LustreError: 12847:0:(lod_dev.c:419:lod_sub_recovery_thread()) soaked-MDT0003-osp-MDT0002 getting update log failed: rc = -34
Feb 22 16:03:22 soak-10 kernel: Lustre: 12848:0:(ldlm_lib.c:2061:target_recovery_overseer()) recovery is aborted, evict exports in recovery

I'm trying to find a report of an issue like this, nothing is coming up immediately.  I'll see what, if anything else, I can find...

Comment by Patrick Farrell (Inactive) [ 26/Feb/19 ]

And, thank you!  Very helpful.

I need to go on soak and look around, but if you're able, can you check how often/if we've seen this set of messages?  Specifically the Invalid record message - something like:

 grep llog_process_thread [...] | grep "Invalid record"

 

Comment by Sarah Liu [ 26/Feb/19 ]

In about 19h running, found 3 messages

Feb 22 09:27:13 soak-10 kernel: LustreError: 12838:0:(llog.c:588:llog_process_thread()) soaked-MDT0003-osp-MDT0002: [0x1:0xc002abb4:0x2] Invalid record: index 4 but expected 2
Feb 22 15:10:32 soak-10 kernel: LustreError: 12830:0:(llog.c:588:llog_process_thread()) soaked-MDT0003-osp-MDT0002: [0x1:0xc002abb4:0x2] Invalid record: index 4 but expected 2
Feb 22 16:03:22 soak-10 kernel: LustreError: 12847:0:(llog.c:588:llog_process_thread()) soaked-MDT0003-osp-MDT0002: [0x1:0xc002abb4:0x2] Invalid record: index 4 but expected 2

I will move soak to tip of b2_10 later today, you can still get these logs on soak

Comment by Patrick Farrell (Inactive) [ 26/Feb/19 ]

I'm sorry if I'm asking questions that have been answered elsewhere, but I've lost track of all the info.

So, the application failure rate on soak is pretty high right now.  The primary culprit - for the large majority of failures - is various different errors coming out of the llog code on MDS/MDT failover/failback (all of which point to, essentially, the log being incorrect/damaged).  It's causing aborted recoveries, which cause a lot of evictions, which lead (sometimes) to application failures.

A few questions.

  1. Have we tried master on soak recently?  Did it have a similar failure rate to 2.10?
  2. What earlier versions of 2.10 have we tried?  Did any of them have significantly lower failure rates?
  3. What is the most recent version of 2.10 we are confident had lower failure rates?
Comment by Patrick Farrell (Inactive) [ 26/Feb/19 ]

I've identified two candidates for possible changes that could've introduced these issues.  I have not identified any llog related changes in master that are missing in b2_10 currently.

These are the changes:

commit 51e962be60cf599ecf154ea3a6b1c0f9882daac2
Author: Bruno Faccini <bruno.faccini@intel.com>
Date: Wed Jan 17 16:22:58 2018 +0100

LU-10527 obdclass: don't recycle loghandle upon ENOSPC

In llog_cat_add_rec(), upon -ENOSPC error being returned from
llog_cat_new_log(), don't reset "cathandle->u.chd.chd_current_log"
to NULL.
Not doing so will avoid to have llog_cat_declare_add_rec() repeatedly
and unnecessarily create new+partially initialized LLOGs/llog_handle
and assigned to "cathandle->u.chd.chd_current_log", this without
llog_init_handle() never being called to initialize
"loghandle->lgh_hdr".

Also, unnecessary LASSERT(llh) has been removed in
llog_cat_current_log() as it prevented to gracefully handle this
case by simply returning the loghandle.
Thanks to S.Cheremencev (Cray) to report this.

Both ways to fix have been kept in patch as the 1st part allows for
better performance in terms of number of FS operations being done
with permanent changelog's ENOSPC condition, even if this covers
a somewhat unlikely situation.


That change first appeared in 2.10.6.

 

The other change:
-------

commit 10cc97e3c1487692b460702bf46220b1acb452ee
Author: Alexander Boyko <alexander.boyko@seagate.com>
Date: Wed Mar 22 14:39:48 2017 +0300

LU-7001 osp: fix llog processing

The osp_sync_thread base on fact that llog_cat_process
will not end until umount. This is worng when processing reaches
bottom of catalog, or if catalog is wrapped.
The patch fixes this issue.

For wrapped catalog llog_process_thread could process old
record.
1 thread llog_process_thread read chunk and proccing first record
2 thread add rec to this catalog at this chunk and
update bitmap
1 check bitmap for next idx and process old record

Test conf-sanity 106 was added.

Lustre-change: https://review.whamcloud.com/26132
Lustre-commit: 8da9fb0cf14cc79bf1985d144d0a201e136dfe51

Signed-off-by: Alexander Boyko <alexander.boyko@seagate.com>
Seagate-bug-id: MRP-4235
Change-Id: Ifc983018e3a325622ef3215bec4b69f5c9ac2ba2
Reviewed-by: Andriy Skulysh
Reviewed-by: Mike Pershin <mike.pershin@intel.com>
Signed-off-by: Minh Diep <minh.diep@intel.com>
Reviewed-on: https://review.whamcloud.com/32097
Tested-by: Jenkins
Tested-by: Maloo <hpdd-maloo@intel.com>
Reviewed-by: John L. Hammond <john.hammond@intel.com>

------

This change first appeared in 2.10.4.  (Which had a lot of changes from 2.10.3.)

Comment by Sarah Liu [ 26/Feb/19 ]

Have we tried master on soak recently? Did it have a similar failure rate to 2.10?

Last time tested master was on 1/11 and that was running without PFL and DoM stripe. When we did the 2.12 testing, it had similar failure rate as current 2.10 with PFl and DoM stripe(LU-11818); but if disable PFL and DoM, it did better

What earlier versions of 2.10 have we tried? Did any of them have significantly lower failure rates?

The earlier build tested are: 2.10.6 (b2_10-ib build #92), #95 #98, build #98 seems have higher failure rates

What is the most recent version of 2.10 we are confident had lower failure rates?

I think 2.10.6 RC build(#92) is in good shape

Comment by Patrick Farrell (Inactive) [ 26/Feb/19 ]

I see you said we were going to run build #92 recently:

"I will reload build #92 with 3600s interval and see how it goes"

But I don't see the results.  Did we do that?  How did it go?

If #92 is indeed OK, then that suggests:
LU-10527 obdclass: don't recycle loghandle upon ENOSPC (detail)

Which is in your list of changes from 92 to 95.

If the recent run of #92 was indeed OK, can we remove that patch and try tip of b2_10?

Comment by Sarah Liu [ 27/Feb/19 ]

No, I didn't rerun #92 with 3600s inverval, instead I tried #98 first with removal of update log and it helped the pass rate, then triage team suggested move to tip of b2_10

Let me first try the review build, if it doesn't help with the pass rate, then I could move to #92 to make sure it is the latest good one.

Comment by Patrick Farrell (Inactive) [ 27/Feb/19 ]

Sounds perfect, thanks!

Comment by Patrick Farrell (Inactive) [ 27/Feb/19 ]

Quick early readout.  I don't have application failure rates (Sarah will have to provide that), but I did look at the failovers and evictions today.  Overall failover still looks pretty choppy - quite a few have evictions - but the MDT failover is (tentatively) much improved.

Over the last month or so, about 50% of the ~320 MDT failovers we did resulted in an aborted recovery.  All or almost all of those were from llog problems.  So far today we've done 9 MDT failovers, with 0 aborts.

So...  Tentatively, much improved.  We'll have to see what the failure rate is like, and how things shake out overnight.

Comment by Sarah Liu [ 28/Feb/19 ]

A quick update: soak has been running with Patrick's patch(https://build.whamcloud.com/job/lustre-reviews-ib/4213/) for about 6 hours, with only mds_failover and mds_restart, 105 pass / 4 fail(simul only). It seems the revert works. I will leave it running overnight and update the ticket tomorrow

> show
Job results:
        FAIL
                simul: 4
                        240992 241065 241032 241086 
                Total: 4
        QUEUED
                mdtestfpp: 3
                iorssf: 2
                iorfpp: 1
                fiorandom: 10
                blogbench: 10
                mdtestssf: 4
                fiosequent: 10
                fiosas: 10
                kcompile: 10
                simul: 8
                Total: 68
        SUCCESS
                mdtestssf: 23
                iorssf: 23
                iorfpp: 21
                mdtestfpp: 26
                simul: 12
                Total: 105
Fault injection results:
        CURRENT
                mds_failover in progress
        COMPLETED
                mds_failover: 5
                mds_restart: 5
                Total: 10
Comment by Sarah Liu [ 28/Feb/19 ]

review build #4213 has been running 20 hours, the fail/pass number is 19 fail/384 pass, only simul failed due to time out.

no crash on servers, but 1 client hit kernel NULL pointer
soak-40 console

soak-40 login: [   60.771568] IPv6: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
[ 4976.239443] LNet: HW NUMA nodes: 2, HW CPU cores: 88, npartitions: 2
[ 4976.252443] alg: No test for adler32 (adler32-zlib)
[ 4977.043702] Lustre: Lustre: Build Version: 2.10.6_64_g1307d8a
[ 4977.175814] LNet: Added LNI 192.168.1.140@o2ib [8/256/0/180]
[ 8232.416484] LustreError: 11-0: soaked-OST000c-osc-ffff9847cefdd800: operation ost_connect to node 192.168.1.104@o2ib failed: rc = -16
[ 8232.431730] Lustre: Mounted soaked-client
[10013.355513] perf: interrupt took too long (3164 > 3156), lowering kernel.perf_event_max_sample_rate to 63000
[11889.420578] Lustre: 14411:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1551302693/real 0]  req@ffff9840c6e6b300 
x1626651565453584/t0(0) o400->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 224/224 e 0 to 1 dl 1551302700 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
[11889.451642] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail
[11895.468004] Lustre: 14364:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1551302700/real 0]  req@ffff9840c6e6e600 
x1626651565467584/t0(0) o250->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 520/544 e 0 to 1 dl 1551302706 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
[11907.508055] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
[11907.516881] IP: [<ffffffffc099d920>] kiblnd_connect_peer+0x70/0x670 [ko2iblnd]
[11907.525014] PGD 0 
[11907.527286] Oops: 0000 [#1] SMP 
[11907.530934] Modules linked in: osc(OE) mgc(OE) lustre(OE) lmv(OE) fld(OE) mdc(OE) fid(OE) lov(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt mxm_wmi iTCO_vendor_support ipmi_ssif mei_me joydev pcspkr mei ioatdma sg lpc_ich i2c_i801 ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_power_meter acpi_pad auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 mlx4_ib(OE) ib_core(OE) sd_mod crc_t10dif crct10dif_generic mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops igb ttm ahci mlx4_core(OE) drm ptp libahci crct10dif_pclmul crct10dif_common pps_core crc32c_intel devlink libata dca drm_panel_orientation_quirks mlx_compat(OE) i2c_algo_bit
[11907.634252] CPU: 33 PID: 20895 Comm: mdtest Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.el7.x86_64 #1
[11907.646651] Hardware name: Intel Corporation S2600WT2R/S2600WT2R, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
[11907.658563] task: ffff9847d5e030c0 ti: ffff983faff90000 task.ti: ffff983faff90000
[11907.666964] RIP: 0010:[<ffffffffc099d920>]  [<ffffffffc099d920>] kiblnd_connect_peer+0x70/0x670 [ko2iblnd]
[11907.677817] RSP: 0018:ffff983faff937a8  EFLAGS: 00010202
[11907.683782] RAX: 0000000000000000 RBX: ffff9847cda39800 RCX: 0000000000000106
[11907.691792] RDX: ffff984386f09d80 RSI: ffffffffc09a4460 RDI: ffff984386f09d80
[11907.699803] RBP: ffff983faff937f8 R08: 0000000000000002 R09: ffffffffc09abd24
[11907.707813] R10: ffff9838ffc07900 R11: ffff984386f09d80 R12: 00050000c0a8016c
[11907.715825] R13: ffff984386f09d80 R14: ffff984386f09d80 R15: ffff9847d77cfa00
[11907.723838] FS:  0000000000000000(0000) GS:ffff9847de0c0000(0000) knlGS:0000000000000000
[11907.732921] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11907.739371] CR2: 0000000000000028 CR3: 00000008cca10000 CR4: 00000000003607e0
[11907.747383] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[11907.757744] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[11907.768095] Call Trace:
[11907.773201]  [<ffffffffc099ff6c>] kiblnd_launch_tx+0x90c/0xc20 [ko2iblnd]
[11907.783193]  [<ffffffffc09a05d7>] kiblnd_send+0x357/0xa10 [ko2iblnd]
[11907.792681]  [<ffffffffc0af1bc1>] lnet_ni_send+0x41/0xd0 [lnet]
[11907.801654]  [<ffffffffc0af6fb7>] lnet_send+0x77/0x180 [lnet]
[11907.810429]  [<ffffffffc0af7305>] LNetPut+0x245/0x7a0 [lnet]
[11907.819136]  [<ffffffffc0e94006>] ptl_send_buf+0x146/0x530 [ptlrpc]
[11907.828470]  [<ffffffffc0aeca44>] ? LNetMDAttach+0x3f4/0x450 [lnet]
[11907.837843]  [<ffffffffc0e95bfd>] ptl_send_rpc+0x67d/0xe60 [ptlrpc]
[11907.847119]  [<ffffffffc0e8b4a8>] ptlrpc_send_new_req+0x468/0xa60 [ptlrpc]
[11907.857087]  [<ffffffffc0ed439a>] ? null_alloc_reqbuf+0x19a/0x3a0 [ptlrpc]
[11907.867016]  [<ffffffffc0e900c1>] ptlrpc_set_wait+0x3d1/0x920 [ptlrpc]
[11907.876515]  [<ffffffffc0b6b8c9>] ? lustre_get_jobid+0x99/0x4d0 [obdclass]
[11907.886425]  [<ffffffffc0e9b635>] ? lustre_msg_set_jobid+0x95/0x100 [ptlrpc]
[11907.896423]  [<ffffffffc0e9068d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc]
[11907.905865]  [<ffffffffc0cff58c>] mdc_close+0x1bc/0x8a0 [mdc]
[11907.914312]  [<ffffffffc0d3707c>] lmv_close+0x21c/0x550 [lmv]
[11907.922702]  [<ffffffffc0fbb6fe>] ll_close_inode_openhandle+0x2fe/0xe20 [lustre]
[11907.932901]  [<ffffffffc0fbe5c0>] ll_md_real_close+0xf0/0x1e0 [lustre]
[11907.942087]  [<ffffffffc0fbecf8>] ll_file_release+0x648/0xa80 [lustre]
[11907.951236]  [<ffffffff936433dc>] __fput+0xec/0x260
[11907.958513]  [<ffffffff9364363e>] ____fput+0xe/0x10
[11907.965763]  [<ffffffff934be79b>] task_work_run+0xbb/0xe0
[11907.973631]  [<ffffffff9349dc61>] do_exit+0x2d1/0xa40
[11907.981039]  [<ffffffff93b6f608>] ? __do_page_fault+0x228/0x500
[11907.989417]  [<ffffffff9349e44f>] do_group_exit+0x3f/0xa0
[11907.997175]  [<ffffffff9349e4c4>] SyS_exit_group+0x14/0x20
[11908.004972]  [<ffffffff93b74ddb>] system_call_fastpath+0x22/0x27
[11908.013306] Code: 48 8b 04 25 80 0e 01 00 48 8b 80 60 07 00 00 49 c7 c1 24 bd 9a c0 41 b8 02 00 00 00 b9 06 01 00 00 4c 89 ea 48 c7 c6 60 44 9a c0 <48> 8b 78 28 e8 37 61 c9 ff 48 3d 00 f0 ff ff 49 89 c6 0f 87 d1 
[11908.038427] RIP  [<ffffffffc099d920>] kiblnd_connect_peer+0x70/0x670 [ko2iblnd]
[11908.048166]  RSP <ffff983faff937a8>
[11908.053569] CR2: 0000000000000028
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct
Comment by Patrick Farrell (Inactive) [ 28/Feb/19 ]

We are now up to 54 failovers without an abort.  I'll update the other bug - I'm not sure how, but that patch is causing a lot of trouble.

Comment by Patrick Farrell (Inactive) [ 28/Feb/19 ]

Suggested fix is here:
https://review.whamcloud.com/34346

I'll push it for master shortly.

Comment by Patrick Farrell (Inactive) [ 28/Feb/19 ]

Changed fixes to associate with this ticket.

Master:
https://review.whamcloud.com/#/c/34347/

b2_10:

https://review.whamcloud.com/#/c/34346/

 

Comment by Sarah Liu [ 01/Mar/19 ]

update of lustre-reviews-ib build #4241 (https://review.whamcloud.com/34346)

After 17 hours running with only mds_failover and mds_restart, the fail/pass rate is: 97 fail / 453 pass. With this modified patch, the pass rate were still worse than the previous revert patch.

Check one of the failed application mdtestfpp.out

lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none
lmv_stripe_count: 0 lmv_stripe_offset: -1 lmv_hash_type: none
srun: Warning: can't honor --ntasks-per-node set to 2 which doesn't match the requested tasks 12 with the number of requested nodes 8. Ignoring --ntasks-per-node.
-- started at 03/01/2019 12:45:51 --

mdtest-1.9.3 was launched with 12 total task(s) on 6 node(s)
Command line used: /mnt/soaked/bin/mdtest -d /mnt/soaked/soaktest/test/mdtestfpp/241953 -i 5 -n 105 -u
03/01/2019 12:49:18: Process 0(): FAILED in show_file_system_size, unable to statfs() file system: Input/output error
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 241953.0 ON soak-28 CANCELLED AT 2019-03-01T12:49:18 ***

then check the client soak-28 syslog around 3/1 12:49, it was evicted by MDS #2(soak-10)

Mar  1 12:46:46 soak-28 kernel: Lustre: 10153:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Mar  1 12:47:05 soak-28 kernel: LNet: 10142:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.1.110@o2ib: 6 seconds
Mar  1 12:47:56 soak-28 kernel: LNet: 10142:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.1.110@o2ib: 0 seconds
Mar  1 12:49:16 soak-28 kernel: LustreError: 65561:0:(lmv_obd.c:1387:lmv_statfs()) can't stat MDS #2 (soaked-MDT0002-mdc-ffff98d1ac0b5000), error -19
Mar  1 12:49:16 soak-28 kernel: LustreError: 65561:0:(lmv_obd.c:1387:lmv_statfs()) Skipped 1 previous similar message
Mar  1 12:49:16 soak-28 kernel: LustreError: 65561:0:(llite_lib.c:1759:ll_statfs_internal()) md_statfs fails: rc = -19
Mar  1 12:49:17 soak-28 kernel: LustreError: 65561:0:(llite_lib.c:1759:ll_statfs_internal()) Skipped 3 previous similar messages
Mar  1 12:49:18 soak-28 kernel: LustreError: 167-0: soaked-MDT0002-mdc-ffff98d1ac0b5000: This client was evicted by soaked-MDT0002; in progress operations using this service will fail.
Mar  1 12:49:18 soak-28 kernel: LustreError: 65550:0:(lmv_obd.c:1387:lmv_statfs()) can't stat MDS #2 (soaked-MDT0002-mdc-ffff98d1ac0b5000), error -5
Mar  1 12:49:18 soak-28 kernel: Lustre: soaked-MDT0002-mdc-ffff98d1ac0b5000: Connection restored to 192.168.1.110@o2ib (at 192.168.1.110@o2ib)
Mar  1 12:49:18 soak-28 kernel: LustreError: 65550:0:(llite_lib.c:1759:ll_statfs_internal()) md_statfs fails: rc = -5
Mar  1 12:49:20 soak-28 sshd[65650]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
Mar  1 12:49:20 soak-28 sshd[65650]: Accepted publickey for root from 10.10.1.116 port 36250 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM
Mar  1 12:49:20 soak-28 systemd: Created slice User Slice of root.
Mar  1 12:49:20 soak-28 systemd-logind: New session 1419 of user root.
Mar  1 12:49:20 soak-28 systemd: Started Session 1419 of user root.
Mar  1 12:49:20 soak-28 sshd[65650]: pam_unix(sshd:session): session opened for user root by (uid=0)
Mar  1 12:49:21 soak-28 sshd[65650]: pam_unix(sshd:session): session closed for user root
Mar  1 12:49:21 soak-28 systemd-logind: Removed session 1419.
Mar  1 12:49:21 soak-28 systemd: Removed slice User Slice of root.
Mar  1 12:49:30 soak-28 sshd[65684]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
Mar  1 12:49:30 soak-28 sshd[65684]: Accepted publickey for root from 10.10.1.116 port 36388 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM
Mar  1 12:49:30 soak-28 systemd: Created slice User Slice of root.
Mar  1 12:49:30 soak-28 systemd-logind: New session 1420 of user root.
Mar  1 12:49:30 soak-28 systemd: Started Session 1420 of user root.
Mar  1 12:49:30 soak-28 sshd[65684]: pam_unix(sshd:session): session opened for user root by (uid=0)
Mar  1 12:49:30 soak-28 sshd[65684]: Received disconnect from 10.10.1.116 port 36388:11: disconnected by user
Mar  1 12:49:30 soak-28 sshd[65684]: Disconnected from 10.10.1.116 port 36388
Mar  1 12:49:30 soak-28 sshd[65684]: pam_unix(sshd:session): session closed for user root
Mar  1 12:49:30 soak-28 systemd-logind: Removed session 1420.
Mar  1 12:49:30 soak-28 systemd: Removed slice User Slice of root.
Mar  1 12:49:56 soak-28 kernel: LustreError: 65742:0:(llite_lib.c:2345:ll_prep_inode()) new_inode -fatal: rc -2
Mar  1 12:49:57 soak-28 kernel: LustreError: 65772:0:(llite_lib.c:2345:ll_prep_inode()) new_inode -fatal: rc -2
Mar  1 12:49:57 soak-28 kernel: LustreError: 65772:0:(llite_lib.c:2345:ll_prep_inode()) Skipped 10 previous similar messages

Check soak-10 around the same time

Mar  1 12:47:37 soak-10 multipathd: 360080e50001fedb80000015752012949: remaining active paths: 1
Mar  1 12:47:37 soak-10 kernel: sd 0:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, queueing MODE_SELECT command
Mar  1 12:47:37 soak-10 kernel: Lustre: soaked-MDT0002: Not available for connect from 192.168.1.123@o2ib (not set up)
Mar  1 12:47:38 soak-10 kernel: sd 0:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, MODE_SELECT completed
Mar  1 12:47:38 soak-10 kernel: Lustre: soaked-MDT0002: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
Mar  1 12:47:38 soak-10 kernel: LustreError: 13072:0:(llog.c:588:llog_process_thread()) soaked-MDT0003-osp-MDT0002: [0x2:0xc0030980:0x2] Invalid record: index 6 but expe
cted 2
Mar  1 12:47:38 soak-10 kernel: LustreError: 13072:0:(lod_dev.c:419:lod_sub_recovery_thread()) soaked-MDT0003-osp-MDT0002 getting update log failed: rc = -34
Mar  1 12:47:39 soak-10 sshd[12270]: pam_unix(sshd:session): session closed for user root
Mar  1 12:47:39 soak-10 systemd-logind: Removed session 3.
Mar  1 12:47:39 soak-10 systemd: Removed slice User Slice of root.
Mar  1 12:47:39 soak-10 kernel: Lustre: soaked-MDT0002: Connection restored to b61568ce-eadb-273d-d151-58976b27b2a3 (at 192.168.1.122@o2ib)
Mar  1 12:47:39 soak-10 kernel: Lustre: 13073:0:(ldlm_lib.c:2061:target_recovery_overseer()) recovery is aborted, evict exports in recovery
Mar  1 12:47:39 soak-10 kernel: Lustre: soaked-MDT0002: disconnecting 16 stale clients
Mar  1 12:47:41 soak-10 kernel: Lustre: soaked-MDT0002: Connection restored to 4ae090bf-6bf9-e882-df72-d1d71f1e04c8 (at 192.168.1.120@o2ib)
Mar  1 12:47:42 soak-10 multipathd: 360080e50001fedb80000015752012949: sdg - rdac checker reports path is up
Mar  1 12:47:42 soak-10 multipathd: 8:96: reinstated
Mar  1 12:47:42 soak-10 multipathd: 360080e50001fedb80000015752012949: load table [0 31247843328 multipath 2 queue_if_no_path retain_attached_hw_handler 1 rdac 1 1 round
-robin 0 1 1 8:96 1]
Mar  1 12:47:44 soak-10 kernel: Lustre: soaked-MDT0002: Connection restored to d4732544-d66c-f143-5fc3-3aff0109b21f (at 192.168.1.135@o2ib)
Mar  1 12:47:46 soak-10 kernel: Lustre: soaked-MDT0002: Connection restored to  (at 192.168.1.124@o2ib)
Mar  1 12:47:46 soak-10 kernel: Lustre: Skipped 13 previous similar messages
Mar  1 12:47:47 soak-10 multipathd: 360080e50001fedb80000015752012949: sdg - rdac checker reports path is ghost
Mar  1 12:47:47 soak-10 multipathd: 8:96: reinstated
Mar  1 12:47:48 soak-10 kernel: sd 0:0:0:44: [sdg] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Mar  1 12:47:48 soak-10 kernel: sd 0:0:0:44: [sdg] Sense Key : Illegal Request [current] 
Mar  1 12:47:48 soak-10 kernel: sd 0:0:0:44: [sdg] <<vendor>>ASC=0x94 ASCQ=0x1 
Mar  1 12:47:48 soak-10 kernel: sd 0:0:0:44: [sdg] CDB: Write(16) 8a 00 00 00 00 00 00 b1 23 78 00 00 00 08 00 00
Mar  1 12:47:48 soak-10 kernel: blk_update_request: I/O error, dev sdg, sector 11608952
Mar  1 12:47:48 soak-10 kernel: device-mapper: multipath: Failing path 8:96.
Mar  1 12:47:48 soak-10 multipathd: sdg: mark as failed
Comment by Patrick Farrell (Inactive) [ 01/Mar/19 ]

Woah.  That's very confusing.  Unless we made a mistake with the build, that should be impossible.  But I do see the errors.  That's extremely confusing.

I'll see what I can figure out.

Comment by Patrick Farrell (Inactive) [ 01/Mar/19 ]

Would we know if the MDS were asserting during shutdown/failover?  How do we cause the MDS failover - What do we do to it exactly?

Comment by Sarah Liu [ 01/Mar/19 ]

the lustre version is

----------------
soak-[4-11,17-31,35-43]
----------------
version=2.10.6_64_g155f4e8

lustre-reviews-ib build #4241 shows the same version, unfortunately the 4241 build was gone from jenkins.

We don't assert server, failover by hard reboot. If you want to check the soak code, it is under

[root@soak-16 fsmgmt]# pwd
/scratch/soak-merge/soak-git/soak-test-harness/harness/fsmgmt
[root@soak-16 fsmgmt]# less builtin.py
Comment by Patrick Farrell (Inactive) [ 01/Mar/19 ]

Sarah,

Just reviewed with Minh.

He made a mistake yesterday.  4241 is an unrelated build.
The one we need is:
https://build.whamcloud.com/job/lustre-reviews-ib/4248/

Sorry.  Are you able to install and test that on Soak?

Comment by Gerrit Updater [ 02/Mar/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34346/
Subject: LU-11943 llog: Reset current log on ENOSPC
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 0cba3cc14fa68b8f43833abfb6f7bc3e547bce03

Comment by Sarah Liu [ 05/Mar/19 ]

Here is an update of patch #34346 tested on soak(running about 48h) 200 fail / 1192 pass
I did see some application failures with input/output error, here is an example failure:

application mdtestfpp failed as

-- started at 03/02/2019 01:10:03 --

mdtest-1.9.3 was launched with 15 total task(s) on 15 node(s)
Command line used: /mnt/soaked/bin/mdtest -d /mnt/soaked/soaktest/test/mdtestfpp/242297 -i 5 -n 467 -u
Path: /mnt/soaked/soaktest/test/mdtestfpp
FS: 83.4 TiB   Used FS: 7.9%   Inodes: 77.7 Mi   Used Inodes: 1.1%
srun: error: soak-25: task 1: Exited with exit code 1
03/02/2019 01:41:10: Process 1(soak-25.spirit.whamcloud.com): FAILED in create_remove_items_helper, unable to remove directory: Input/output error
srun: Terminating job step 242297.0
slurmstepd: error: *** STEP 242297.0 ON soak-24 CANCELLED AT 2019-03-02T01:41:04 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

check the syslog of client node soak-24 and it was evicted by MDS1 which is soak-9

Mar  2 01:40:48 soak-24 kernel: Lustre: soaked-MDT0001-mdc-ffff9eb348691000: Connection to soaked-MDT0001 (at 192.168.1.108@o2ib) was lost; in progre
ss operations using this service will wait for recovery to complete
Mar  2 01:41:08 soak-24 kernel: LustreError: 167-0: soaked-MDT0001-mdc-ffff9eb348691000: This client was evicted by soaked-MDT0001; in progress operations using this service will fail.

check soak-9, around the time

Mar  2 01:40:03 soak-9 systemd-logind: New session 4 of user root.
Mar  2 01:40:03 soak-9 systemd: Started Session 4 of user root.
Mar  2 01:40:03 soak-9 sshd[12203]: pam_unix(sshd:session): session opened for user root by (uid=0)
Mar  2 01:40:03 soak-9 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 32, npartitions: 2
Mar  2 01:40:03 soak-9 kernel: alg: No test for adler32 (adler32-zlib)
Mar  2 01:40:04 soak-9 kernel: Lustre: Lustre: Build Version: 2.10.6_64_g56b299e
Mar  2 01:40:04 soak-9 kernel: LNet: Added LNI 192.168.1.109@o2ib [8/256/0/180]
Mar  2 01:40:05 soak-9 kernel: LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc
Mar  2 01:40:06 soak-9 kernel: Lustre: soaked-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
Mar  2 01:40:06 soak-9 kernel: LustreError: 12482:0:(llog_osd.c:978:llog_osd_next_block()) soaked-MDT0002-osp-MDT0001: missed desired record? 17 > 1
Mar  2 01:40:06 soak-9 kernel: LustreError: 12482:0:(lod_dev.c:419:lod_sub_recovery_thread()) soaked-MDT0002-osp-MDT0001 getting update log failed: rc = -2
Mar  2 01:40:06 soak-9 sshd[12203]: pam_unix(sshd:session): session closed for user root
Mar  2 01:40:06 soak-9 systemd-logind: Removed session 4.
Mar  2 01:40:06 soak-9 systemd: Removed slice User Slice of root.
Mar  2 01:40:12 soak-9 kernel: Lustre: soaked-MDT0001: Connection restored to 8ca4d6ac-9a33-9228-6f4a-bf7741f4f05f (at 192.168.1.117@o2ib)
Mar  2 01:40:12 soak-9 kernel: Lustre: 12484:0:(ldlm_lib.c:2061:target_recovery_overseer()) recovery is aborted, evict exports in recovery
Mar  2 01:40:12 soak-9 kernel: Lustre: soaked-MDT0001: disconnecting 26 stale clients
Comment by Sarah Liu [ 05/Mar/19 ]

After system running about 48 hours, 1 MDS crashed after it was restarted. and I think it hit LU-11781. Also attach the console log of soak-10, soak-10-console

soak-10

[ 6965.662564] device-mapper: multipath: Reinstating path 8:96.
[ 6965.669190] sd 7:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, queueing MODE_SELECT command
[ 6966.303860] sd 7:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, MODE_SELECT completed
[ 6970.938822] LustreError: 16808:0:(osd_oi.c:734:osd_oi_insert()) dm-6: the FID [0x280043a38:0xebde:0x0] is used by two objects: 163053849/1029500233 283115771/1029500235
[ 6970.955975] LNet: 12293:0:(o2iblnd_cb.c:396:kiblnd_handle_rx()) PUT_NACK from 192.168.1.109@o2ib
[ 6971.241968] Lustre: soaked-MDT0003-osp-MDT0002: Connection restored to 192.168.1.111@o2ib (at 192.168.1.111@o2ib)
[ 6972.374798] LNet: Service thread pid 17480 completed after 341.59s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[ 7054.766137] BUG: unable to handle kernel NULL pointer dereference at           (null)
[ 7054.774915] IP: [<          (null)>]           (null)
[ 7054.780574] PGD 0 
[ 7054.782850] Oops: 0010 [#1] SMP 
[ 7054.786480] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm dm_round_robin irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul ipmi_ssif iTCO_wdt glue_helper ablk_helper iTCO_vendor_support sg cryptd pcspkr joydev mei_me ioatdma mei lpc_ich ipmi_si i2c_i801 ipmi_devintf ipmi_msghandler wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops igb mlx4_core(OE) ttm mpt3sas ahci isci crct10dif_pclmul crct10dif_common libahci crc32c_intel ptp drm devlink libsas raid_class pps_core libata dca mlx_compat(OE) drm_panel_orientation_quirks scsi_transport_sas i2c_algo_bit
[ 7054.897558] CPU: 15 PID: 16809 Comm: mdt_out01_007 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.el7_lustre.x86_64 #1
[ 7054.911247] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
[ 7054.923780] task: ffff9618ed955140 ti: ffff9618dc770000 task.ti: ffff9618dc770000
[ 7054.932139] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
[ 7054.940496] RSP: 0018:ffff9618dc773b98  EFLAGS: 00010246
[ 7054.946427] RAX: 0000000000000000 RBX: ffff961cdf4601e0 RCX: 0000000000000002
[ 7054.954392] RDX: ffffffffc12bf7e0 RSI: ffff9618dc773ba0 RDI: ffff961cdf4601e8
[ 7054.962358] RBP: ffff9618dc773bd8 R08: 0000000000000004 R09: 0000000000000000
[ 7054.970325] R10: 0000000000000001 R11: 00000000007fffff R12: ffff9619b9a5eb00
[ 7054.978291] R13: ffff961910518948 R14: ffff961cbd713800 R15: ffff961cdf4601e8
[ 7054.986260] FS:  0000000000000000(0000) GS:ffff961d2d9c0000(0000) knlGS:0000000000000000
[ 7054.995292] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7055.001707] CR2: 0000000000000000 CR3: 000000003f810000 CR4: 00000000000607e0
[ 7055.009663] Call Trace:
[ 7055.012418]  [<ffffffffc12bde7e>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]
[ 7055.020706]  [<ffffffffc12be057>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]
[ 7055.028356]  [<ffffffffc0ca53c8>] dt_index_walk+0xf8/0x430 [obdclass]
[ 7055.035596]  [<ffffffffc0ca5700>] ? dt_index_walk+0x430/0x430 [obdclass]
[ 7055.043127]  [<ffffffffc0ca67dd>] dt_index_read+0x39d/0x6b0 [obdclass]
[ 7055.050525]  [<ffffffffc0f2bd32>] tgt_obd_idx_read+0x612/0x860 [ptlrpc]
[ 7055.057982]  [<ffffffffc0f2e38a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[ 7055.065720]  [<ffffffffc0ed6e3b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[ 7055.074343]  [<ffffffffc0ed3468>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[ 7055.081929]  [<ffffffffa18d67c2>] ? default_wake_function+0x12/0x20
[ 7055.088943]  [<ffffffffa18cba9b>] ? __wake_up_common+0x5b/0x90
[ 7055.095532]  [<ffffffffc0eda582>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[ 7055.102601]  [<ffffffffc0ed9af0>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
[ 7055.110867]  [<ffffffffa18c1c31>] kthread+0xd1/0xe0
[ 7055.116316]  [<ffffffffa18c1b60>] ? insert_kthread_work+0x40/0x40
[ 7055.123126]  [<ffffffffa1f74c37>] ret_from_fork_nospec_begin+0x21/0x21
[ 7055.130416]  [<ffffffffa18c1b60>] ? insert_kthread_work+0x40/0x40
[ 7055.137216] Code:  Bad RIP value.
[ 7055.140957] RIP  [<          (null)>]           (null)
[ 7055.146714]  RSP <ffff9618dc773b98>
[ 7055.150607] CR2: 0000000000000000
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct
Comment by Gerrit Updater [ 08/Mar/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34347/
Subject: LU-11943 llog: Reset current log on ENOSPC
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7f4a1dd92a900a55cbb980bac3d20c32968334ec

Comment by Peter Jones [ 08/Mar/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 19/Mar/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34459
Subject: LU-11943 llog: Reset current log on ENOSPC
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 374a1414097fcaf83a27e7b107aa984809dba58a

Comment by Gerrit Updater [ 08/Apr/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34459/
Subject: LU-11943 llog: Reset current log on ENOSPC
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 4fd6fbbd0f0a0f45cd41b560c4711002e86747b6

Generated at Sat Feb 10 02:48:17 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.