[LU-11943] many input/output error after soak running for couple of hours - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.13.0, Lustre 2.10.7, Lustre 2.12.1
Affects Version/s: Lustre 2.10.7
Labels:
- soak
Environment:
Lustre version=2.10.6_35_g2bc39a1 lustre-b2_10-ib build #98 EL7.6 mlnx DNE; soak has PFL enabled in this test

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

In the first 4/5 hours running, soak seems working well, but after that, applications started to fail a lot with sth like

mdtest-1.9.3 was launched with 13 total task(s) on 6 node(s)
Command line used: /mnt/soaked/bin/mdtest -d /mnt/soaked/soaktest/test/mdtestfpp/233652 -i 5 -n 158 -u
Path: /mnt/soaked/soaktest/test/mdtestfpp
FS: 83.4 TiB   Used FS: 1.6%   Inodes: 82.1 Mi   Used Inodes: 0.1%
02/07/2019 18:13:12: Process 4(soak-27.spirit.whamcloud.com): FAILED in mdtest_stat, unable to stat file: Cannot send after transport endpoint shutdown
02/07/2019 18:13:29: Process 5(soak-28.spirit.whamcloud.com): FAILED in mdtest_stat, unable to stat file: Cannot send after transport endpoint shutdown
slurmstepd: error: *** STEP 233652.0 ON soak-18 CANCELLED AT 2019-02-07T18:12:14 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: soak-28: task 5: Exited with exit code 1
srun: Terminating job step 233652.0

task 5 writing /mnt/soaked/soaktest/test/iorssf/233650/ssf
WARNING: Task 5 requested transfer of 28199936 bytes,
         but transferred 2093056 bytes at offset 17483960320
WARNING: This file system requires support of partial write()s, in aiori-POSIX.c (line 272).
WARNING: Requested xfer of 28199936 bytes, but xferred 2093056 bytes
Only transferred 2093056 of 28199936 bytes
** error **
ERROR in aiori-POSIX.c (line 256): transfer failed.
ERROR: Input/output error
** exiting **
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: soak-27: task 5: Exited with exit code 255
srun: Terminating job step 233650.0

Around the time when application failed, checking the OSS side, saw many lock time out error

Feb  7 01:02:00 soak-4 systemd-logind: New session 30 of user root.
Feb  7 01:02:00 soak-4 systemd: Started Session 30 of user root.
Feb  7 01:02:00 soak-4 sshd[97515]: pam_unix(sshd:session): session opened for user root by (uid=0)
Feb  7 01:02:00 soak-4 sshd[97515]: pam_unix(sshd:session): session closed for user root
Feb  7 01:02:00 soak-4 systemd-logind: Removed session 30.
Feb  7 01:02:00 soak-4 systemd: Removed slice User Slice of root.
Feb  7 01:02:09 soak-4 sshd[97553]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
Feb  7 01:02:09 soak-4 sshd[97553]: Accepted publickey for root from 10.10.1.116 port 43696 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdK
AWxM
Feb  7 01:02:09 soak-4 systemd: Created slice User Slice of root.
Feb  7 01:02:09 soak-4 systemd-logind: New session 31 of user root.
Feb  7 01:02:09 soak-4 systemd: Started Session 31 of user root.
Feb  7 01:02:09 soak-4 sshd[97553]: pam_unix(sshd:session): session opened for user root by (uid=0)
Feb  7 01:02:10 soak-4 sshd[97553]: Received disconnect from 10.10.1.116 port 43696:11: disconnected by user
Feb  7 01:02:10 soak-4 sshd[97553]: Disconnected from 10.10.1.116 port 43696
Feb  7 01:02:10 soak-4 sshd[97553]: pam_unix(sshd:session): session closed for user root
Feb  7 01:02:10 soak-4 systemd-logind: Removed session 31.
Feb  7 01:02:10 soak-4 systemd: Removed slice User Slice of root.
Feb  7 01:03:03 soak-4 kernel: LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 192.168.1.128@o2ib  ns: filter-soaked-OST0004_UUID lock: ffff89923febd800/0x2a3d985a1228d475 lrc: 3/0,0 mode: PW/PW res: [0x400000402:0x4cf183:0x0].0x0 rrc: 17 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x60000000020020 nid: 192.168.1.128@o2ib remote: 0x5c14a0ad2708d24d expref: 6 pid: 27943 timeout: 4295139340 lvb_type: 0
Feb  7 01:03:19 soak-4 kernel: Lustre: soaked-OST0004: Connection restored to ffd42784-63d2-739c-d487-9f95d4a84e57 (at 192.168.1.128@o2ib)
Feb  7 01:03:19 soak-4 kernel: Lustre: Skipped 2 previous similar messages
Feb  7 01:03:44 soak-4 sshd[105089]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
Feb  7 01:03:44 soak-4 sshd[105089]: Accepted publickey for root from 10.10.1.116 port 43960 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM

Feb  7 01:08:06 soak-4 kernel: LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 152s: evicting client at 192.168.1.127@o2ib  ns: filter-soaked-OST0004_UUID lock: ffff89964ba94200/0x2a3d985a1228d47c lrc: 3/0,0 mode: PW/PW res: [0x400000402:0x4cf183:0x0].0x0 rrc: 14 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x60000000020020 nid: 192.168.1.127@o2ib remote: 0xd8fc8581f4c8de49 expref: 6 pid: 74474 timeout: 4295441140 lvb_type: 0
Feb  7 01:08:18 soak-4 kernel: Lustre: soaked-OST0004: Connection restored to 44cc5f6f-1cd3-3d43-f778-9409d7aeb164 (at 192.168.1.127@o2ib)
Feb  7 01:08:46 soak-4 kernel: Lustre: 26413:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1549501726/real 1549501726]  req@ffff89968a49b000 x1624769706421056/t0(0) o38->soaked-MDT0001-lwp-OST0008@192.168.1.109@o2ib:12/10 lens 520/544 e 0 to 1 dl 1549501742 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
Feb  7 01:08:46 soak-4 kernel: Lustre: 26413:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Feb  7 01:09:36 soak-4 kernel: Lustre: 26413:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1549501776/real 1549501776]  req@ffff89968a498f00 x1624769706421792/t0(0) o38->soaked-MDT0001-lwp-OST000c@192.168.1.109@o2ib:12/10 lens 520/544 e 0 to 1 dl 1549501797 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
Feb  7 01:09:36 soak-4 kernel: Lustre: 26413:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Feb  7 01:09:43 soak-4 kernel: Lustre: soaked-OST0004: haven't heard from client acbada78-7ef5-2938-ae9f-1535dd1eb89f (at 192.168.1.121@o2ib) in 229 seconds. I think it's dead, and I am evicting it. exp ffff89923fa98c00, cur 1549501783 expire 1549501633 last 1549501554
Feb  7 01:10:01 soak-4 systemd: Created slice User Slice of root.
Feb  7 01:10:01 soak-4 systemd: Started Session 44 of user root.
Feb  7 01:10:01 soak-4 CROND[187082]: (root) CMD (/usr/lib64/sa/sa1 1 1)

Later on MDS side, saw following error

Feb  7 09:15:03 soak-8 kernel: LNet: Service thread pid 13609 was inactive for 200.68s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Feb  7 09:15:03 soak-8 kernel: Pid: 13609, comm: mdt00_013 3.10.0-957.el7_lustre.x86_64 #1 SMP Mon Jan 7 20:06:41 UTC 2019
Feb  7 09:15:03 soak-8 kernel: Call Trace:
Feb  7 09:15:03 soak-8 kernel: [<ffffffffc0f9ba11>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
Feb  7 09:15:03 soak-8 kernel: [<ffffffffc0f9dabb>] ldlm_cli_enqueue_fini+0x93b/0xdc0 [ptlrpc]
Feb  7 09:15:03 soak-8 kernel: [<ffffffffc0fa0672>] ldlm_cli_enqueue+0x6c2/0x810 [ptlrpc]
Feb  7 09:15:03 soak-8 kernel: [<ffffffffc1640622>] osp_md_object_lock+0x172/0x2e0 [osp]
Feb  7 09:15:03 soak-8 kernel: [<ffffffffc1574ea3>] lod_object_lock+0xf3/0x950 [lod]
Feb  7 09:15:03 soak-8 kernel: [<ffffffffc15e448e>] mdd_object_lock+0x3e/0xe0 [mdd]
Feb  7 09:15:03 soak-8 kernel: [<ffffffffc1487fc5>] mdt_remote_object_lock+0x1e5/0x710 [mdt]
Feb  7 09:15:03 soak-8 kernel: [<ffffffffc1489516>] mdt_object_lock_internal+0x166/0x300 [mdt]
Feb  7 09:15:03 soak-8 kernel: [<ffffffffc14896f0>] mdt_reint_object_lock+0x20/0x60 [mdt]
Feb  7 09:15:03 soak-8 kernel: [<ffffffffc14a05f4>] mdt_reint_link+0x7e4/0xc30 [mdt]
Feb  7 09:15:03 soak-8 kernel: [<ffffffffc14a2c73>] mdt_reint_rec+0x83/0x210 [mdt]
Feb  7 09:15:03 soak-8 kernel: [<ffffffffc148418b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
Feb  7 09:15:03 soak-8 kernel: [<ffffffffc148ff77>] mdt_reint+0x67/0x140 [mdt]
Feb  7 09:15:03 soak-8 kernel: [<ffffffffc10342aa>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
Feb  7 09:15:03 soak-8 kernel: [<ffffffffc0fdcd5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
Feb  7 09:15:03 soak-8 kernel: [<ffffffffc0fe04a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
Feb  7 09:15:03 soak-8 kernel: [<ffffffff884c1c31>] kthread+0xd1/0xe0
Feb  7 09:15:03 soak-8 kernel: [<ffffffff88b74c37>] ret_from_fork_nospec_end+0x0/0x39
Feb  7 09:15:03 soak-8 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
Feb  7 09:15:03 soak-8 kernel: LustreError: dumping log to /tmp/lustre-log.1549530903.13609
Feb  7 09:15:05 soak-8 kernel: LNet: Service thread pid 13582 was inactive for 202.07s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Feb  7 09:15:05 soak-8 kernel: Pid: 13582, comm: mdt01_007 3.10.0-957.el7_lustre.x86_64 #1 SMP Mon Jan 7 20:06:41 UTC 2019
Feb  7 09:15:05 soak-8 kernel: Call Trace:
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0f9ba11>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0f9dabb>] ldlm_cli_enqueue_fini+0x93b/0xdc0 [ptlrpc]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0fa0672>] ldlm_cli_enqueue+0x6c2/0x810 [ptlrpc]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc1640622>] osp_md_object_lock+0x172/0x2e0 [osp]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc1574ea3>] lod_object_lock+0xf3/0x950 [lod]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc15e448e>] mdd_object_lock+0x3e/0xe0 [mdd]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc1487fc5>] mdt_remote_object_lock+0x1e5/0x710 [mdt]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc1489516>] mdt_object_lock_internal+0x166/0x300 [mdt]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc14896f0>] mdt_reint_object_lock+0x20/0x60 [mdt]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc14a05f4>] mdt_reint_link+0x7e4/0xc30 [mdt]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc14a2c73>] mdt_reint_rec+0x83/0x210 [mdt]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc148418b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc148ff77>] mdt_reint+0x67/0x140 [mdt]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc10342aa>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0fdcd5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0fe04a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
Feb  7 09:15:05 soak-8 kernel: [<ffffffff884c1c31>] kthread+0xd1/0xe0
Feb  7 09:15:05 soak-8 kernel: [<ffffffff88b74c37>] ret_from_fork_nospec_end+0x0/0x39
Feb  7 09:15:05 soak-8 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
Feb  7 09:15:05 soak-8 kernel: Pid: 18685, comm: mdt00_026 3.10.0-957.el7_lustre.x86_64 #1 SMP Mon Jan 7 20:06:41 UTC 2019
Feb  7 09:15:05 soak-8 kernel: Call Trace:
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0f9ba11>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0f9dabb>] ldlm_cli_enqueue_fini+0x93b/0xdc0 [ptlrpc]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0fa0672>] ldlm_cli_enqueue+0x6c2/0x810 [ptlrpc]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc1640622>] osp_md_object_lock+0x172/0x2e0 [osp]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc1574ea3>] lod_object_lock+0xf3/0x950 [lod]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc15e448e>] mdd_object_lock+0x3e/0xe0 [mdd]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc1487fc5>] mdt_remote_object_lock+0x1e5/0x710 [mdt]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc1489516>] mdt_object_lock_internal+0x166/0x300 [mdt]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc14896f0>] mdt_reint_object_lock+0x20/0x60 [mdt]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc14a05f4>] mdt_reint_link+0x7e4/0xc30 [mdt]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc14a2c73>] mdt_reint_rec+0x83/0x210 [mdt]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc148418b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc148ff77>] mdt_reint+0x67/0x140 [mdt]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc10342aa>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0fdcd5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0fe04a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
Feb  7 09:15:05 soak-8 kernel: [<ffffffff884c1c31>] kthread+0xd1/0xe0
Feb  7 09:15:05 soak-8 kernel: [<ffffffff88b74c37>] ret_from_fork_nospec_end+0x0/0x39
Feb  7 09:15:05 soak-8 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
Feb  7 09:15:14 soak-8 sshd[78874]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
Feb  7 09:15:14 soak-8 sshd[78874]: Accepted publickey for root from 10.10.1.116 port 59708 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM
Feb  7 09:16:40 soak-8 systemd: Removed slice User Slice of root.

Feb  7 09:16:43 soak-8 kernel: LustreError: 13609:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1549530702, 300s ago), entering recovery for soaked-MDT0001_UUID@192.168.1.108@o2ib ns: soaked-MDT0001-osp-MDT0000 lock: ffff92322f761800/0x7ebebf899932ee34 lrc: 4/0,1 mode: --/EX res: [0x24000cf43:0x16f41:0x0].0x0 bits 0x2 rrc: 4 type: IBT flags: 0x1000001000000 nid: local remote: 0xc7baaf97e1b53a16 expref: -99 pid: 13609 timeout: 0 lvb_type: 0
Feb  7 09:16:43 soak-8 kernel: LustreError: 13609:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 2 previous similar messages
Feb  7 09:16:55 soak-8 sshd[79087]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
Feb  7 09:16:56 soak-8 sshd[79087]: Accepted publickey for root from 10.10.1.116 port 59970 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

240435-mdtestfpp.out
14 kB
25/Feb/19 7:23 PM
console-soak-10.log-20190224.gz
515 kB
25/Feb/19 7:23 PM
console-soak-23.log-20190224.gz
121 kB
25/Feb/19 7:23 PM
soak-10.log-console
1.13 MB
05/Mar/19 9:58 PM
sys-soak-10.log-20190224.gz
1.58 MB
25/Feb/19 7:23 PM
sys-soak-23.log-20190224.gz
1.04 MB
25/Feb/19 7:23 PM

Issue Links

is related to

LU-10527 LustreError: 7830:0:(llog_cat.c:313:llog_cat_current_log()) ASSERTION( llh )

Resolved

is related to

LU-11358 racer test 1 hangs in locking with DNE

Open

Activity

[LU-11943] many input/output error after soak running for couple of hours

Patrick Farrell (Inactive) added a comment - 28/Feb/19 7:26 PM

Changed fixes to associate with this ticket.

Master:
https://review.whamcloud.com/#/c/34347/

b2_10:

https://review.whamcloud.com/#/c/34346/

Patrick Farrell (Inactive) added a comment - 28/Feb/19 7:26 PM Changed fixes to associate with this ticket. Master: https://review.whamcloud.com/#/c/34347/ b2_10: https://review.whamcloud.com/#/c/34346/

Patrick Farrell (Inactive) added a comment - 28/Feb/19 6:01 PM

Suggested fix is here:
https://review.whamcloud.com/34346

I'll push it for master shortly.

Patrick Farrell (Inactive) added a comment - 28/Feb/19 6:01 PM Suggested fix is here: https://review.whamcloud.com/34346 I'll push it for master shortly.

Patrick Farrell (Inactive) added a comment - 28/Feb/19 5:02 PM

We are now up to 54 failovers without an abort. I'll update the other bug - I'm not sure how, but that patch is causing a lot of trouble.

Patrick Farrell (Inactive) added a comment - 28/Feb/19 5:02 PM We are now up to 54 failovers without an abort. I'll update the other bug - I'm not sure how, but that patch is causing a lot of trouble.

Sarah Liu added a comment - 28/Feb/19 5:01 PM

review build #4213 has been running 20 hours, the fail/pass number is 19 fail/384 pass, only simul failed due to time out.

no crash on servers, but 1 client hit kernel NULL pointer
soak-40 console

soak-40 login: [   60.771568] IPv6: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
[ 4976.239443] LNet: HW NUMA nodes: 2, HW CPU cores: 88, npartitions: 2
[ 4976.252443] alg: No test for adler32 (adler32-zlib)
[ 4977.043702] Lustre: Lustre: Build Version: 2.10.6_64_g1307d8a
[ 4977.175814] LNet: Added LNI 192.168.1.140@o2ib [8/256/0/180]
[ 8232.416484] LustreError: 11-0: soaked-OST000c-osc-ffff9847cefdd800: operation ost_connect to node 192.168.1.104@o2ib failed: rc = -16
[ 8232.431730] Lustre: Mounted soaked-client
[10013.355513] perf: interrupt took too long (3164 > 3156), lowering kernel.perf_event_max_sample_rate to 63000
[11889.420578] Lustre: 14411:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1551302693/real 0]  req@ffff9840c6e6b300 
x1626651565453584/t0(0) o400->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 224/224 e 0 to 1 dl 1551302700 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
[11889.451642] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail
[11895.468004] Lustre: 14364:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1551302700/real 0]  req@ffff9840c6e6e600 
x1626651565467584/t0(0) o250->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 520/544 e 0 to 1 dl 1551302706 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
[11907.508055] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
[11907.516881] IP: [<ffffffffc099d920>] kiblnd_connect_peer+0x70/0x670 [ko2iblnd]
[11907.525014] PGD 0 
[11907.527286] Oops: 0000 [#1] SMP 
[11907.530934] Modules linked in: osc(OE) mgc(OE) lustre(OE) lmv(OE) fld(OE) mdc(OE) fid(OE) lov(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt mxm_wmi iTCO_vendor_support ipmi_ssif mei_me joydev pcspkr mei ioatdma sg lpc_ich i2c_i801 ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_power_meter acpi_pad auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 mlx4_ib(OE) ib_core(OE) sd_mod crc_t10dif crct10dif_generic mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops igb ttm ahci mlx4_core(OE) drm ptp libahci crct10dif_pclmul crct10dif_common pps_core crc32c_intel devlink libata dca drm_panel_orientation_quirks mlx_compat(OE) i2c_algo_bit
[11907.634252] CPU: 33 PID: 20895 Comm: mdtest Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.el7.x86_64 #1
[11907.646651] Hardware name: Intel Corporation S2600WT2R/S2600WT2R, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
[11907.658563] task: ffff9847d5e030c0 ti: ffff983faff90000 task.ti: ffff983faff90000
[11907.666964] RIP: 0010:[<ffffffffc099d920>]  [<ffffffffc099d920>] kiblnd_connect_peer+0x70/0x670 [ko2iblnd]
[11907.677817] RSP: 0018:ffff983faff937a8  EFLAGS: 00010202
[11907.683782] RAX: 0000000000000000 RBX: ffff9847cda39800 RCX: 0000000000000106
[11907.691792] RDX: ffff984386f09d80 RSI: ffffffffc09a4460 RDI: ffff984386f09d80
[11907.699803] RBP: ffff983faff937f8 R08: 0000000000000002 R09: ffffffffc09abd24
[11907.707813] R10: ffff9838ffc07900 R11: ffff984386f09d80 R12: 00050000c0a8016c
[11907.715825] R13: ffff984386f09d80 R14: ffff984386f09d80 R15: ffff9847d77cfa00
[11907.723838] FS:  0000000000000000(0000) GS:ffff9847de0c0000(0000) knlGS:0000000000000000
[11907.732921] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11907.739371] CR2: 0000000000000028 CR3: 00000008cca10000 CR4: 00000000003607e0
[11907.747383] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[11907.757744] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[11907.768095] Call Trace:
[11907.773201]  [<ffffffffc099ff6c>] kiblnd_launch_tx+0x90c/0xc20 [ko2iblnd]
[11907.783193]  [<ffffffffc09a05d7>] kiblnd_send+0x357/0xa10 [ko2iblnd]
[11907.792681]  [<ffffffffc0af1bc1>] lnet_ni_send+0x41/0xd0 [lnet]
[11907.801654]  [<ffffffffc0af6fb7>] lnet_send+0x77/0x180 [lnet]
[11907.810429]  [<ffffffffc0af7305>] LNetPut+0x245/0x7a0 [lnet]
[11907.819136]  [<ffffffffc0e94006>] ptl_send_buf+0x146/0x530 [ptlrpc]
[11907.828470]  [<ffffffffc0aeca44>] ? LNetMDAttach+0x3f4/0x450 [lnet]
[11907.837843]  [<ffffffffc0e95bfd>] ptl_send_rpc+0x67d/0xe60 [ptlrpc]
[11907.847119]  [<ffffffffc0e8b4a8>] ptlrpc_send_new_req+0x468/0xa60 [ptlrpc]
[11907.857087]  [<ffffffffc0ed439a>] ? null_alloc_reqbuf+0x19a/0x3a0 [ptlrpc]
[11907.867016]  [<ffffffffc0e900c1>] ptlrpc_set_wait+0x3d1/0x920 [ptlrpc]
[11907.876515]  [<ffffffffc0b6b8c9>] ? lustre_get_jobid+0x99/0x4d0 [obdclass]
[11907.886425]  [<ffffffffc0e9b635>] ? lustre_msg_set_jobid+0x95/0x100 [ptlrpc]
[11907.896423]  [<ffffffffc0e9068d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc]
[11907.905865]  [<ffffffffc0cff58c>] mdc_close+0x1bc/0x8a0 [mdc]
[11907.914312]  [<ffffffffc0d3707c>] lmv_close+0x21c/0x550 [lmv]
[11907.922702]  [<ffffffffc0fbb6fe>] ll_close_inode_openhandle+0x2fe/0xe20 [lustre]
[11907.932901]  [<ffffffffc0fbe5c0>] ll_md_real_close+0xf0/0x1e0 [lustre]
[11907.942087]  [<ffffffffc0fbecf8>] ll_file_release+0x648/0xa80 [lustre]
[11907.951236]  [<ffffffff936433dc>] __fput+0xec/0x260
[11907.958513]  [<ffffffff9364363e>] ____fput+0xe/0x10
[11907.965763]  [<ffffffff934be79b>] task_work_run+0xbb/0xe0
[11907.973631]  [<ffffffff9349dc61>] do_exit+0x2d1/0xa40
[11907.981039]  [<ffffffff93b6f608>] ? __do_page_fault+0x228/0x500
[11907.989417]  [<ffffffff9349e44f>] do_group_exit+0x3f/0xa0
[11907.997175]  [<ffffffff9349e4c4>] SyS_exit_group+0x14/0x20
[11908.004972]  [<ffffffff93b74ddb>] system_call_fastpath+0x22/0x27
[11908.013306] Code: 48 8b 04 25 80 0e 01 00 48 8b 80 60 07 00 00 49 c7 c1 24 bd 9a c0 41 b8 02 00 00 00 b9 06 01 00 00 4c 89 ea 48 c7 c6 60 44 9a c0 <48> 8b 78 28 e8 37 61 c9 ff 48 3d 00 f0 ff ff 49 89 c6 0f 87 d1 
[11908.038427] RIP  [<ffffffffc099d920>] kiblnd_connect_peer+0x70/0x670 [ko2iblnd]
[11908.048166]  RSP <ffff983faff937a8>
[11908.053569] CR2: 0000000000000028
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct

Sarah Liu added a comment - 28/Feb/19 5:01 PM review build #4213 has been running 20 hours, the fail/pass number is 19 fail/384 pass, only simul failed due to time out. no crash on servers, but 1 client hit kernel NULL pointer soak-40 console soak-40 login: [ 60.771568] IPv6: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready [ 4976.239443] LNet: HW NUMA nodes: 2, HW CPU cores: 88, npartitions: 2 [ 4976.252443] alg: No test for adler32 (adler32-zlib) [ 4977.043702] Lustre: Lustre: Build Version: 2.10.6_64_g1307d8a [ 4977.175814] LNet: Added LNI 192.168.1.140@o2ib [8/256/0/180] [ 8232.416484] LustreError: 11-0: soaked-OST000c-osc-ffff9847cefdd800: operation ost_connect to node 192.168.1.104@o2ib failed: rc = -16 [ 8232.431730] Lustre: Mounted soaked-client [10013.355513] perf: interrupt took too long (3164 > 3156), lowering kernel.perf_event_max_sample_rate to 63000 [11889.420578] Lustre: 14411:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1551302693/real 0] req@ffff9840c6e6b300 x1626651565453584/t0(0) o400->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 224/224 e 0 to 1 dl 1551302700 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 [11889.451642] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail [11895.468004] Lustre: 14364:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1551302700/real 0] req@ffff9840c6e6e600 x1626651565467584/t0(0) o250->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 520/544 e 0 to 1 dl 1551302706 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 [11907.508055] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028 [11907.516881] IP: [<ffffffffc099d920>] kiblnd_connect_peer+0x70/0x670 [ko2iblnd] [11907.525014] PGD 0 [11907.527286] Oops: 0000 [#1] SMP [11907.530934] Modules linked in: osc(OE) mgc(OE) lustre(OE) lmv(OE) fld(OE) mdc(OE) fid(OE) lov(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt mxm_wmi iTCO_vendor_support ipmi_ssif mei_me joydev pcspkr mei ioatdma sg lpc_ich i2c_i801 ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_power_meter acpi_pad auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 mlx4_ib(OE) ib_core(OE) sd_mod crc_t10dif crct10dif_generic mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops igb ttm ahci mlx4_core(OE) drm ptp libahci crct10dif_pclmul crct10dif_common pps_core crc32c_intel devlink libata dca drm_panel_orientation_quirks mlx_compat(OE) i2c_algo_bit [11907.634252] CPU: 33 PID: 20895 Comm: mdtest Kdump: loaded Tainted: G OE ------------ 3.10.0-957.el7.x86_64 #1 [11907.646651] Hardware name: Intel Corporation S2600WT2R/S2600WT2R, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016 [11907.658563] task: ffff9847d5e030c0 ti: ffff983faff90000 task.ti: ffff983faff90000 [11907.666964] RIP: 0010:[<ffffffffc099d920>] [<ffffffffc099d920>] kiblnd_connect_peer+0x70/0x670 [ko2iblnd] [11907.677817] RSP: 0018:ffff983faff937a8 EFLAGS: 00010202 [11907.683782] RAX: 0000000000000000 RBX: ffff9847cda39800 RCX: 0000000000000106 [11907.691792] RDX: ffff984386f09d80 RSI: ffffffffc09a4460 RDI: ffff984386f09d80 [11907.699803] RBP: ffff983faff937f8 R08: 0000000000000002 R09: ffffffffc09abd24 [11907.707813] R10: ffff9838ffc07900 R11: ffff984386f09d80 R12: 00050000c0a8016c [11907.715825] R13: ffff984386f09d80 R14: ffff984386f09d80 R15: ffff9847d77cfa00 [11907.723838] FS: 0000000000000000(0000) GS:ffff9847de0c0000(0000) knlGS:0000000000000000 [11907.732921] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [11907.739371] CR2: 0000000000000028 CR3: 00000008cca10000 CR4: 00000000003607e0 [11907.747383] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [11907.757744] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [11907.768095] Call Trace: [11907.773201] [<ffffffffc099ff6c>] kiblnd_launch_tx+0x90c/0xc20 [ko2iblnd] [11907.783193] [<ffffffffc09a05d7>] kiblnd_send+0x357/0xa10 [ko2iblnd] [11907.792681] [<ffffffffc0af1bc1>] lnet_ni_send+0x41/0xd0 [lnet] [11907.801654] [<ffffffffc0af6fb7>] lnet_send+0x77/0x180 [lnet] [11907.810429] [<ffffffffc0af7305>] LNetPut+0x245/0x7a0 [lnet] [11907.819136] [<ffffffffc0e94006>] ptl_send_buf+0x146/0x530 [ptlrpc] [11907.828470] [<ffffffffc0aeca44>] ? LNetMDAttach+0x3f4/0x450 [lnet] [11907.837843] [<ffffffffc0e95bfd>] ptl_send_rpc+0x67d/0xe60 [ptlrpc] [11907.847119] [<ffffffffc0e8b4a8>] ptlrpc_send_new_req+0x468/0xa60 [ptlrpc] [11907.857087] [<ffffffffc0ed439a>] ? null_alloc_reqbuf+0x19a/0x3a0 [ptlrpc] [11907.867016] [<ffffffffc0e900c1>] ptlrpc_set_wait+0x3d1/0x920 [ptlrpc] [11907.876515] [<ffffffffc0b6b8c9>] ? lustre_get_jobid+0x99/0x4d0 [obdclass] [11907.886425] [<ffffffffc0e9b635>] ? lustre_msg_set_jobid+0x95/0x100 [ptlrpc] [11907.896423] [<ffffffffc0e9068d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc] [11907.905865] [<ffffffffc0cff58c>] mdc_close+0x1bc/0x8a0 [mdc] [11907.914312] [<ffffffffc0d3707c>] lmv_close+0x21c/0x550 [lmv] [11907.922702] [<ffffffffc0fbb6fe>] ll_close_inode_openhandle+0x2fe/0xe20 [lustre] [11907.932901] [<ffffffffc0fbe5c0>] ll_md_real_close+0xf0/0x1e0 [lustre] [11907.942087] [<ffffffffc0fbecf8>] ll_file_release+0x648/0xa80 [lustre] [11907.951236] [<ffffffff936433dc>] __fput+0xec/0x260 [11907.958513] [<ffffffff9364363e>] ____fput+0xe/0x10 [11907.965763] [<ffffffff934be79b>] task_work_run+0xbb/0xe0 [11907.973631] [<ffffffff9349dc61>] do_exit+0x2d1/0xa40 [11907.981039] [<ffffffff93b6f608>] ? __do_page_fault+0x228/0x500 [11907.989417] [<ffffffff9349e44f>] do_group_exit+0x3f/0xa0 [11907.997175] [<ffffffff9349e4c4>] SyS_exit_group+0x14/0x20 [11908.004972] [<ffffffff93b74ddb>] system_call_fastpath+0x22/0x27 [11908.013306] Code: 48 8b 04 25 80 0e 01 00 48 8b 80 60 07 00 00 49 c7 c1 24 bd 9a c0 41 b8 02 00 00 00 b9 06 01 00 00 4c 89 ea 48 c7 c6 60 44 9a c0 <48> 8b 78 28 e8 37 61 c9 ff 48 3d 00 f0 ff ff 49 89 c6 0f 87 d1 [11908.038427] RIP [<ffffffffc099d920>] kiblnd_connect_peer+0x70/0x670 [ko2iblnd] [11908.048166] RSP <ffff983faff937a8> [11908.053569] CR2: 0000000000000028 [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Initializing cgroup subsys cpuacct

Sarah Liu added a comment - 28/Feb/19 2:04 AM - edited

A quick update: soak has been running with Patrick's patch(https://build.whamcloud.com/job/lustre-reviews-ib/4213/) for about 6 hours, with only mds_failover and mds_restart, 105 pass / 4 fail(simul only). It seems the revert works. I will leave it running overnight and update the ticket tomorrow

> show
Job results:
        FAIL
                simul: 4
                        240992 241065 241032 241086 
                Total: 4
        QUEUED
                mdtestfpp: 3
                iorssf: 2
                iorfpp: 1
                fiorandom: 10
                blogbench: 10
                mdtestssf: 4
                fiosequent: 10
                fiosas: 10
                kcompile: 10
                simul: 8
                Total: 68
        SUCCESS
                mdtestssf: 23
                iorssf: 23
                iorfpp: 21
                mdtestfpp: 26
                simul: 12
                Total: 105
Fault injection results:
        CURRENT
                mds_failover in progress
        COMPLETED
                mds_failover: 5
                mds_restart: 5
                Total: 10

Sarah Liu added a comment - 28/Feb/19 2:04 AM - edited A quick update: soak has been running with Patrick's patch( https://build.whamcloud.com/job/lustre-reviews-ib/4213/ ) for about 6 hours, with only mds_failover and mds_restart, 105 pass / 4 fail(simul only). It seems the revert works. I will leave it running overnight and update the ticket tomorrow > show Job results: FAIL simul: 4 240992 241065 241032 241086 Total: 4 QUEUED mdtestfpp: 3 iorssf: 2 iorfpp: 1 fiorandom: 10 blogbench: 10 mdtestssf: 4 fiosequent: 10 fiosas: 10 kcompile: 10 simul: 8 Total: 68 SUCCESS mdtestssf: 23 iorssf: 23 iorfpp: 21 mdtestfpp: 26 simul: 12 Total: 105 Fault injection results: CURRENT mds_failover in progress COMPLETED mds_failover: 5 mds_restart: 5 Total: 10

Patrick Farrell (Inactive) added a comment - 27/Feb/19 11:13 PM

Quick early readout. I don't have application failure rates (Sarah will have to provide that), but I did look at the failovers and evictions today. Overall failover still looks pretty choppy - quite a few have evictions - but the MDT failover is (tentatively) much improved.

Over the last month or so, about 50% of the ~320 MDT failovers we did resulted in an aborted recovery. All or almost all of those were from llog problems. So far today we've done 9 MDT failovers, with 0 aborts.

So... Tentatively, much improved. We'll have to see what the failure rate is like, and how things shake out overnight.

Patrick Farrell (Inactive) added a comment - 27/Feb/19 11:13 PM Quick early readout. I don't have application failure rates (Sarah will have to provide that), but I did look at the failovers and evictions today. Overall failover still looks pretty choppy - quite a few have evictions - but the MDT failover is (tentatively) much improved. Over the last month or so, about 50% of the ~320 MDT failovers we did resulted in an aborted recovery. All or almost all of those were from llog problems. So far today we've done 9 MDT failovers, with 0 aborts. So... Tentatively, much improved. We'll have to see what the failure rate is like, and how things shake out overnight.

Patrick Farrell (Inactive) added a comment - 27/Feb/19 4:54 PM

Sounds perfect, thanks!

Patrick Farrell (Inactive) added a comment - 27/Feb/19 4:54 PM Sounds perfect, thanks!

Sarah Liu added a comment - 27/Feb/19 4:51 PM

No, I didn't rerun #92 with 3600s inverval, instead I tried #98 first with removal of update log and it helped the pass rate, then triage team suggested move to tip of b2_10

Let me first try the review build, if it doesn't help with the pass rate, then I could move to #92 to make sure it is the latest good one.

Sarah Liu added a comment - 27/Feb/19 4:51 PM No, I didn't rerun #92 with 3600s inverval, instead I tried #98 first with removal of update log and it helped the pass rate, then triage team suggested move to tip of b2_10 Let me first try the review build, if it doesn't help with the pass rate, then I could move to #92 to make sure it is the latest good one.

Patrick Farrell (Inactive) added a comment - 26/Feb/19 11:36 PM

I see you said we were going to run build #92 recently:

"I will reload build #92 with 3600s interval and see how it goes"

But I don't see the results. Did we do that? How did it go?

If #92 is indeed OK, then that suggests:
~~LU-10527~~ obdclass: don't recycle loghandle upon ENOSPC (detail)

Which is in your list of changes from 92 to 95.

If the recent run of #92 was indeed OK, can we remove that patch and try tip of b2_10?

Patrick Farrell (Inactive) added a comment - 26/Feb/19 11:36 PM I see you said we were going to run build #92 recently: "I will reload build #92 with 3600s interval and see how it goes" But I don't see the results. Did we do that? How did it go? If #92 is indeed OK, then that suggests: LU-10527 obdclass: don't recycle loghandle upon ENOSPC (detail) Which is in your list of changes from 92 to 95. If the recent run of #92 was indeed OK, can we remove that patch and try tip of b2_10?

Sarah Liu added a comment - 26/Feb/19 11:28 PM - edited

Have we tried master on soak recently? Did it have a similar failure rate to 2.10?

Last time tested master was on 1/11 and that was running without PFL and DoM stripe. When we did the 2.12 testing, it had similar failure rate as current 2.10 with PFl and DoM stripe(LU-11818); but if disable PFL and DoM, it did better

What earlier versions of 2.10 have we tried? Did any of them have significantly lower failure rates?

The earlier build tested are: 2.10.6 (b2_10-ib build #92), #95 #98, build #98 seems have higher failure rates

What is the most recent version of 2.10 we are confident had lower failure rates?

I think 2.10.6 RC build(#92) is in good shape

Sarah Liu added a comment - 26/Feb/19 11:28 PM - edited Have we tried master on soak recently? Did it have a similar failure rate to 2.10? Last time tested master was on 1/11 and that was running without PFL and DoM stripe. When we did the 2.12 testing, it had similar failure rate as current 2.10 with PFl and DoM stripe( LU-11818 ); but if disable PFL and DoM, it did better What earlier versions of 2.10 have we tried? Did any of them have significantly lower failure rates? The earlier build tested are: 2.10.6 (b2_10-ib build #92), #95 #98, build #98 seems have higher failure rates What is the most recent version of 2.10 we are confident had lower failure rates? I think 2.10.6 RC build(#92) is in good shape

Patrick Farrell (Inactive) added a comment - 26/Feb/19 10:32 PM

I've identified two candidates for possible changes that could've introduced these issues. I have not identified any llog related changes in master that are missing in b2_10 currently.

These are the changes:

commit 51e962be60cf599ecf154ea3a6b1c0f9882daac2
Author: Bruno Faccini <bruno.faccini@intel.com>
Date: Wed Jan 17 16:22:58 2018 +0100

~~LU-10527~~ obdclass: don't recycle loghandle upon ENOSPC

In llog_cat_add_rec(), upon -ENOSPC error being returned from
llog_cat_new_log(), don't reset "cathandle->u.chd.chd_current_log"
to NULL.
Not doing so will avoid to have llog_cat_declare_add_rec() repeatedly
and unnecessarily create new+partially initialized LLOGs/llog_handle
and assigned to "cathandle->u.chd.chd_current_log", this without
llog_init_handle() never being called to initialize
"loghandle->lgh_hdr".

Also, unnecessary LASSERT(llh) has been removed in
llog_cat_current_log() as it prevented to gracefully handle this
case by simply returning the loghandle.
Thanks to S.Cheremencev (Cray) to report this.

Both ways to fix have been kept in patch as the 1st part allows for
better performance in terms of number of FS operations being done
with permanent changelog's ENOSPC condition, even if this covers
a somewhat unlikely situation.

That change first appeared in 2.10.6.

The other change:
-------

commit 10cc97e3c1487692b460702bf46220b1acb452ee
Author: Alexander Boyko <alexander.boyko@seagate.com>
Date: Wed Mar 22 14:39:48 2017 +0300

~~LU-7001~~ osp: fix llog processing

The osp_sync_thread base on fact that llog_cat_process
will not end until umount. This is worng when processing reaches
bottom of catalog, or if catalog is wrapped.
The patch fixes this issue.

For wrapped catalog llog_process_thread could process old
record.
1 thread llog_process_thread read chunk and proccing first record
2 thread add rec to this catalog at this chunk and
update bitmap
1 check bitmap for next idx and process old record

Test conf-sanity 106 was added.

Lustre-change: https://review.whamcloud.com/26132
Lustre-commit: 8da9fb0cf14cc79bf1985d144d0a201e136dfe51

Signed-off-by: Alexander Boyko <alexander.boyko@seagate.com>
Seagate-bug-id: MRP-4235
Change-Id: Ifc983018e3a325622ef3215bec4b69f5c9ac2ba2
Reviewed-by: Andriy Skulysh
Reviewed-by: Mike Pershin <mike.pershin@intel.com>
Signed-off-by: Minh Diep <minh.diep@intel.com>
Reviewed-on: https://review.whamcloud.com/32097
Tested-by: Jenkins
Tested-by: Maloo <hpdd-maloo@intel.com>
Reviewed-by: John L. Hammond <john.hammond@intel.com>

------

This change first appeared in 2.10.4. (Which had a lot of changes from 2.10.3.)

Patrick Farrell (Inactive) added a comment - 26/Feb/19 10:32 PM I've identified two candidates for possible changes that could've introduced these issues. I have not identified any llog related changes in master that are missing in b2_10 currently. These are the changes: commit 51e962be60cf599ecf154ea3a6b1c0f9882daac2 Author: Bruno Faccini <bruno.faccini@intel.com> Date: Wed Jan 17 16:22:58 2018 +0100 LU-10527 obdclass: don't recycle loghandle upon ENOSPC In llog_cat_add_rec(), upon -ENOSPC error being returned from llog_cat_new_log(), don't reset "cathandle->u.chd.chd_current_log" to NULL. Not doing so will avoid to have llog_cat_declare_add_rec() repeatedly and unnecessarily create new+partially initialized LLOGs/llog_handle and assigned to "cathandle->u.chd.chd_current_log", this without llog_init_handle() never being called to initialize "loghandle->lgh_hdr". Also, unnecessary LASSERT(llh) has been removed in llog_cat_current_log() as it prevented to gracefully handle this case by simply returning the loghandle. Thanks to S.Cheremencev (Cray) to report this. Both ways to fix have been kept in patch as the 1st part allows for better performance in terms of number of FS operations being done with permanent changelog's ENOSPC condition, even if this covers a somewhat unlikely situation. That change first appeared in 2.10.6. The other change: ------- commit 10cc97e3c1487692b460702bf46220b1acb452ee Author: Alexander Boyko <alexander.boyko@seagate.com> Date: Wed Mar 22 14:39:48 2017 +0300 LU-7001 osp: fix llog processing The osp_sync_thread base on fact that llog_cat_process will not end until umount. This is worng when processing reaches bottom of catalog, or if catalog is wrapped. The patch fixes this issue. For wrapped catalog llog_process_thread could process old record. 1 thread llog_process_thread read chunk and proccing first record 2 thread add rec to this catalog at this chunk and update bitmap 1 check bitmap for next idx and process old record Test conf-sanity 106 was added. Lustre-change: https://review.whamcloud.com/26132 Lustre-commit: 8da9fb0cf14cc79bf1985d144d0a201e136dfe51 Signed-off-by: Alexander Boyko <alexander.boyko@seagate.com> Seagate-bug-id: MRP-4235 Change-Id: Ifc983018e3a325622ef3215bec4b69f5c9ac2ba2 Reviewed-by: Andriy Skulysh Reviewed-by: Mike Pershin <mike.pershin@intel.com> Signed-off-by: Minh Diep <minh.diep@intel.com> Reviewed-on: https://review.whamcloud.com/32097 Tested-by: Jenkins Tested-by: Maloo <hpdd-maloo@intel.com> Reviewed-by: John L. Hammond <john.hammond@intel.com> ------ This change first appeared in 2.10.4. (Which had a lot of changes from 2.10.3.)

People

Assignee:: Patrick Farrell (Inactive)

Reporter:: Sarah Liu

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 07/Feb/19 7:48 PM

Updated:: 08/Apr/19 2:13 PM

Resolved:: 08/Mar/19 10:26 PM