Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.10.7
-
Lustre version=2.10.6_35_g2bc39a1 lustre-b2_10-ib build #98 EL7.6 mlnx DNE; soak has PFL enabled in this test
-
3
-
9223372036854775807
Description
In the first 4/5 hours running, soak seems working well, but after that, applications started to fail a lot with sth like
mdtest-1.9.3 was launched with 13 total task(s) on 6 node(s) Command line used: /mnt/soaked/bin/mdtest -d /mnt/soaked/soaktest/test/mdtestfpp/233652 -i 5 -n 158 -u Path: /mnt/soaked/soaktest/test/mdtestfpp FS: 83.4 TiB Used FS: 1.6% Inodes: 82.1 Mi Used Inodes: 0.1% 02/07/2019 18:13:12: Process 4(soak-27.spirit.whamcloud.com): FAILED in mdtest_stat, unable to stat file: Cannot send after transport endpoint shutdown 02/07/2019 18:13:29: Process 5(soak-28.spirit.whamcloud.com): FAILED in mdtest_stat, unable to stat file: Cannot send after transport endpoint shutdown slurmstepd: error: *** STEP 233652.0 ON soak-18 CANCELLED AT 2019-02-07T18:12:14 *** srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: soak-28: task 5: Exited with exit code 1 srun: Terminating job step 233652.0
task 5 writing /mnt/soaked/soaktest/test/iorssf/233650/ssf WARNING: Task 5 requested transfer of 28199936 bytes, but transferred 2093056 bytes at offset 17483960320 WARNING: This file system requires support of partial write()s, in aiori-POSIX.c (line 272). WARNING: Requested xfer of 28199936 bytes, but xferred 2093056 bytes Only transferred 2093056 of 28199936 bytes ** error ** ERROR in aiori-POSIX.c (line 256): transfer failed. ERROR: Input/output error ** exiting ** srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: soak-27: task 5: Exited with exit code 255 srun: Terminating job step 233650.0
Around the time when application failed, checking the OSS side, saw many lock time out error
Feb 7 01:02:00 soak-4 systemd-logind: New session 30 of user root. Feb 7 01:02:00 soak-4 systemd: Started Session 30 of user root. Feb 7 01:02:00 soak-4 sshd[97515]: pam_unix(sshd:session): session opened for user root by (uid=0) Feb 7 01:02:00 soak-4 sshd[97515]: pam_unix(sshd:session): session closed for user root Feb 7 01:02:00 soak-4 systemd-logind: Removed session 30. Feb 7 01:02:00 soak-4 systemd: Removed slice User Slice of root. Feb 7 01:02:09 soak-4 sshd[97553]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key Feb 7 01:02:09 soak-4 sshd[97553]: Accepted publickey for root from 10.10.1.116 port 43696 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdK AWxM Feb 7 01:02:09 soak-4 systemd: Created slice User Slice of root. Feb 7 01:02:09 soak-4 systemd-logind: New session 31 of user root. Feb 7 01:02:09 soak-4 systemd: Started Session 31 of user root. Feb 7 01:02:09 soak-4 sshd[97553]: pam_unix(sshd:session): session opened for user root by (uid=0) Feb 7 01:02:10 soak-4 sshd[97553]: Received disconnect from 10.10.1.116 port 43696:11: disconnected by user Feb 7 01:02:10 soak-4 sshd[97553]: Disconnected from 10.10.1.116 port 43696 Feb 7 01:02:10 soak-4 sshd[97553]: pam_unix(sshd:session): session closed for user root Feb 7 01:02:10 soak-4 systemd-logind: Removed session 31. Feb 7 01:02:10 soak-4 systemd: Removed slice User Slice of root. Feb 7 01:03:03 soak-4 kernel: LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 192.168.1.128@o2ib ns: filter-soaked-OST0004_UUID lock: ffff89923febd800/0x2a3d985a1228d475 lrc: 3/0,0 mode: PW/PW res: [0x400000402:0x4cf183:0x0].0x0 rrc: 17 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x60000000020020 nid: 192.168.1.128@o2ib remote: 0x5c14a0ad2708d24d expref: 6 pid: 27943 timeout: 4295139340 lvb_type: 0 Feb 7 01:03:19 soak-4 kernel: Lustre: soaked-OST0004: Connection restored to ffd42784-63d2-739c-d487-9f95d4a84e57 (at 192.168.1.128@o2ib) Feb 7 01:03:19 soak-4 kernel: Lustre: Skipped 2 previous similar messages Feb 7 01:03:44 soak-4 sshd[105089]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key Feb 7 01:03:44 soak-4 sshd[105089]: Accepted publickey for root from 10.10.1.116 port 43960 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM Feb 7 01:08:06 soak-4 kernel: LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 152s: evicting client at 192.168.1.127@o2ib ns: filter-soaked-OST0004_UUID lock: ffff89964ba94200/0x2a3d985a1228d47c lrc: 3/0,0 mode: PW/PW res: [0x400000402:0x4cf183:0x0].0x0 rrc: 14 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x60000000020020 nid: 192.168.1.127@o2ib remote: 0xd8fc8581f4c8de49 expref: 6 pid: 74474 timeout: 4295441140 lvb_type: 0 Feb 7 01:08:18 soak-4 kernel: Lustre: soaked-OST0004: Connection restored to 44cc5f6f-1cd3-3d43-f778-9409d7aeb164 (at 192.168.1.127@o2ib) Feb 7 01:08:46 soak-4 kernel: Lustre: 26413:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1549501726/real 1549501726] req@ffff89968a49b000 x1624769706421056/t0(0) o38->soaked-MDT0001-lwp-OST0008@192.168.1.109@o2ib:12/10 lens 520/544 e 0 to 1 dl 1549501742 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 Feb 7 01:08:46 soak-4 kernel: Lustre: 26413:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Feb 7 01:09:36 soak-4 kernel: Lustre: 26413:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1549501776/real 1549501776] req@ffff89968a498f00 x1624769706421792/t0(0) o38->soaked-MDT0001-lwp-OST000c@192.168.1.109@o2ib:12/10 lens 520/544 e 0 to 1 dl 1549501797 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 Feb 7 01:09:36 soak-4 kernel: Lustre: 26413:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Feb 7 01:09:43 soak-4 kernel: Lustre: soaked-OST0004: haven't heard from client acbada78-7ef5-2938-ae9f-1535dd1eb89f (at 192.168.1.121@o2ib) in 229 seconds. I think it's dead, and I am evicting it. exp ffff89923fa98c00, cur 1549501783 expire 1549501633 last 1549501554 Feb 7 01:10:01 soak-4 systemd: Created slice User Slice of root. Feb 7 01:10:01 soak-4 systemd: Started Session 44 of user root. Feb 7 01:10:01 soak-4 CROND[187082]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Later on MDS side, saw following error
Feb 7 09:15:03 soak-8 kernel: LNet: Service thread pid 13609 was inactive for 200.68s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Feb 7 09:15:03 soak-8 kernel: Pid: 13609, comm: mdt00_013 3.10.0-957.el7_lustre.x86_64 #1 SMP Mon Jan 7 20:06:41 UTC 2019 Feb 7 09:15:03 soak-8 kernel: Call Trace: Feb 7 09:15:03 soak-8 kernel: [<ffffffffc0f9ba11>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc] Feb 7 09:15:03 soak-8 kernel: [<ffffffffc0f9dabb>] ldlm_cli_enqueue_fini+0x93b/0xdc0 [ptlrpc] Feb 7 09:15:03 soak-8 kernel: [<ffffffffc0fa0672>] ldlm_cli_enqueue+0x6c2/0x810 [ptlrpc] Feb 7 09:15:03 soak-8 kernel: [<ffffffffc1640622>] osp_md_object_lock+0x172/0x2e0 [osp] Feb 7 09:15:03 soak-8 kernel: [<ffffffffc1574ea3>] lod_object_lock+0xf3/0x950 [lod] Feb 7 09:15:03 soak-8 kernel: [<ffffffffc15e448e>] mdd_object_lock+0x3e/0xe0 [mdd] Feb 7 09:15:03 soak-8 kernel: [<ffffffffc1487fc5>] mdt_remote_object_lock+0x1e5/0x710 [mdt] Feb 7 09:15:03 soak-8 kernel: [<ffffffffc1489516>] mdt_object_lock_internal+0x166/0x300 [mdt] Feb 7 09:15:03 soak-8 kernel: [<ffffffffc14896f0>] mdt_reint_object_lock+0x20/0x60 [mdt] Feb 7 09:15:03 soak-8 kernel: [<ffffffffc14a05f4>] mdt_reint_link+0x7e4/0xc30 [mdt] Feb 7 09:15:03 soak-8 kernel: [<ffffffffc14a2c73>] mdt_reint_rec+0x83/0x210 [mdt] Feb 7 09:15:03 soak-8 kernel: [<ffffffffc148418b>] mdt_reint_internal+0x5fb/0x9c0 [mdt] Feb 7 09:15:03 soak-8 kernel: [<ffffffffc148ff77>] mdt_reint+0x67/0x140 [mdt] Feb 7 09:15:03 soak-8 kernel: [<ffffffffc10342aa>] tgt_request_handle+0x92a/0x1370 [ptlrpc] Feb 7 09:15:03 soak-8 kernel: [<ffffffffc0fdcd5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] Feb 7 09:15:03 soak-8 kernel: [<ffffffffc0fe04a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] Feb 7 09:15:03 soak-8 kernel: [<ffffffff884c1c31>] kthread+0xd1/0xe0 Feb 7 09:15:03 soak-8 kernel: [<ffffffff88b74c37>] ret_from_fork_nospec_end+0x0/0x39 Feb 7 09:15:03 soak-8 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Feb 7 09:15:03 soak-8 kernel: LustreError: dumping log to /tmp/lustre-log.1549530903.13609 Feb 7 09:15:05 soak-8 kernel: LNet: Service thread pid 13582 was inactive for 202.07s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Feb 7 09:15:05 soak-8 kernel: Pid: 13582, comm: mdt01_007 3.10.0-957.el7_lustre.x86_64 #1 SMP Mon Jan 7 20:06:41 UTC 2019 Feb 7 09:15:05 soak-8 kernel: Call Trace: Feb 7 09:15:05 soak-8 kernel: [<ffffffffc0f9ba11>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc] Feb 7 09:15:05 soak-8 kernel: [<ffffffffc0f9dabb>] ldlm_cli_enqueue_fini+0x93b/0xdc0 [ptlrpc] Feb 7 09:15:05 soak-8 kernel: [<ffffffffc0fa0672>] ldlm_cli_enqueue+0x6c2/0x810 [ptlrpc] Feb 7 09:15:05 soak-8 kernel: [<ffffffffc1640622>] osp_md_object_lock+0x172/0x2e0 [osp] Feb 7 09:15:05 soak-8 kernel: [<ffffffffc1574ea3>] lod_object_lock+0xf3/0x950 [lod] Feb 7 09:15:05 soak-8 kernel: [<ffffffffc15e448e>] mdd_object_lock+0x3e/0xe0 [mdd] Feb 7 09:15:05 soak-8 kernel: [<ffffffffc1487fc5>] mdt_remote_object_lock+0x1e5/0x710 [mdt] Feb 7 09:15:05 soak-8 kernel: [<ffffffffc1489516>] mdt_object_lock_internal+0x166/0x300 [mdt] Feb 7 09:15:05 soak-8 kernel: [<ffffffffc14896f0>] mdt_reint_object_lock+0x20/0x60 [mdt] Feb 7 09:15:05 soak-8 kernel: [<ffffffffc14a05f4>] mdt_reint_link+0x7e4/0xc30 [mdt] Feb 7 09:15:05 soak-8 kernel: [<ffffffffc14a2c73>] mdt_reint_rec+0x83/0x210 [mdt] Feb 7 09:15:05 soak-8 kernel: [<ffffffffc148418b>] mdt_reint_internal+0x5fb/0x9c0 [mdt] Feb 7 09:15:05 soak-8 kernel: [<ffffffffc148ff77>] mdt_reint+0x67/0x140 [mdt] Feb 7 09:15:05 soak-8 kernel: [<ffffffffc10342aa>] tgt_request_handle+0x92a/0x1370 [ptlrpc] Feb 7 09:15:05 soak-8 kernel: [<ffffffffc0fdcd5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] Feb 7 09:15:05 soak-8 kernel: [<ffffffffc0fe04a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] Feb 7 09:15:05 soak-8 kernel: [<ffffffff884c1c31>] kthread+0xd1/0xe0 Feb 7 09:15:05 soak-8 kernel: [<ffffffff88b74c37>] ret_from_fork_nospec_end+0x0/0x39 Feb 7 09:15:05 soak-8 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Feb 7 09:15:05 soak-8 kernel: Pid: 18685, comm: mdt00_026 3.10.0-957.el7_lustre.x86_64 #1 SMP Mon Jan 7 20:06:41 UTC 2019 Feb 7 09:15:05 soak-8 kernel: Call Trace: Feb 7 09:15:05 soak-8 kernel: [<ffffffffc0f9ba11>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc] Feb 7 09:15:05 soak-8 kernel: [<ffffffffc0f9dabb>] ldlm_cli_enqueue_fini+0x93b/0xdc0 [ptlrpc] Feb 7 09:15:05 soak-8 kernel: [<ffffffffc0fa0672>] ldlm_cli_enqueue+0x6c2/0x810 [ptlrpc] Feb 7 09:15:05 soak-8 kernel: [<ffffffffc1640622>] osp_md_object_lock+0x172/0x2e0 [osp] Feb 7 09:15:05 soak-8 kernel: [<ffffffffc1574ea3>] lod_object_lock+0xf3/0x950 [lod] Feb 7 09:15:05 soak-8 kernel: [<ffffffffc15e448e>] mdd_object_lock+0x3e/0xe0 [mdd] Feb 7 09:15:05 soak-8 kernel: [<ffffffffc1487fc5>] mdt_remote_object_lock+0x1e5/0x710 [mdt] Feb 7 09:15:05 soak-8 kernel: [<ffffffffc1489516>] mdt_object_lock_internal+0x166/0x300 [mdt] Feb 7 09:15:05 soak-8 kernel: [<ffffffffc14896f0>] mdt_reint_object_lock+0x20/0x60 [mdt] Feb 7 09:15:05 soak-8 kernel: [<ffffffffc14a05f4>] mdt_reint_link+0x7e4/0xc30 [mdt] Feb 7 09:15:05 soak-8 kernel: [<ffffffffc14a2c73>] mdt_reint_rec+0x83/0x210 [mdt] Feb 7 09:15:05 soak-8 kernel: [<ffffffffc148418b>] mdt_reint_internal+0x5fb/0x9c0 [mdt] Feb 7 09:15:05 soak-8 kernel: [<ffffffffc148ff77>] mdt_reint+0x67/0x140 [mdt] Feb 7 09:15:05 soak-8 kernel: [<ffffffffc10342aa>] tgt_request_handle+0x92a/0x1370 [ptlrpc] Feb 7 09:15:05 soak-8 kernel: [<ffffffffc0fdcd5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] Feb 7 09:15:05 soak-8 kernel: [<ffffffffc0fe04a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] Feb 7 09:15:05 soak-8 kernel: [<ffffffff884c1c31>] kthread+0xd1/0xe0 Feb 7 09:15:05 soak-8 kernel: [<ffffffff88b74c37>] ret_from_fork_nospec_end+0x0/0x39 Feb 7 09:15:05 soak-8 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Feb 7 09:15:14 soak-8 sshd[78874]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key Feb 7 09:15:14 soak-8 sshd[78874]: Accepted publickey for root from 10.10.1.116 port 59708 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM Feb 7 09:16:40 soak-8 systemd: Removed slice User Slice of root. Feb 7 09:16:43 soak-8 kernel: LustreError: 13609:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1549530702, 300s ago), entering recovery for soaked-MDT0001_UUID@192.168.1.108@o2ib ns: soaked-MDT0001-osp-MDT0000 lock: ffff92322f761800/0x7ebebf899932ee34 lrc: 4/0,1 mode: --/EX res: [0x24000cf43:0x16f41:0x0].0x0 bits 0x2 rrc: 4 type: IBT flags: 0x1000001000000 nid: local remote: 0xc7baaf97e1b53a16 expref: -99 pid: 13609 timeout: 0 lvb_type: 0 Feb 7 09:16:43 soak-8 kernel: LustreError: 13609:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 2 previous similar messages Feb 7 09:16:55 soak-8 sshd[79087]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key Feb 7 09:16:56 soak-8 sshd[79087]: Accepted publickey for root from 10.10.1.116 port 59970 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM