Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11943

many input/output error after soak running for couple of hours

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.10.7
    • Lustre version=2.10.6_35_g2bc39a1 lustre-b2_10-ib build #98 EL7.6 mlnx DNE; soak has PFL enabled in this test
    • 3
    • 9223372036854775807

    Description

      In the first 4/5 hours running, soak seems working well, but after that, applications started to fail a lot with sth like

      mdtest-1.9.3 was launched with 13 total task(s) on 6 node(s)
      Command line used: /mnt/soaked/bin/mdtest -d /mnt/soaked/soaktest/test/mdtestfpp/233652 -i 5 -n 158 -u
      Path: /mnt/soaked/soaktest/test/mdtestfpp
      FS: 83.4 TiB   Used FS: 1.6%   Inodes: 82.1 Mi   Used Inodes: 0.1%
      02/07/2019 18:13:12: Process 4(soak-27.spirit.whamcloud.com): FAILED in mdtest_stat, unable to stat file: Cannot send after transport endpoint shutdown
      02/07/2019 18:13:29: Process 5(soak-28.spirit.whamcloud.com): FAILED in mdtest_stat, unable to stat file: Cannot send after transport endpoint shutdown
      slurmstepd: error: *** STEP 233652.0 ON soak-18 CANCELLED AT 2019-02-07T18:12:14 ***
      srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
      srun: error: soak-28: task 5: Exited with exit code 1
      srun: Terminating job step 233652.0
      
      task 5 writing /mnt/soaked/soaktest/test/iorssf/233650/ssf
      WARNING: Task 5 requested transfer of 28199936 bytes,
               but transferred 2093056 bytes at offset 17483960320
      WARNING: This file system requires support of partial write()s, in aiori-POSIX.c (line 272).
      WARNING: Requested xfer of 28199936 bytes, but xferred 2093056 bytes
      Only transferred 2093056 of 28199936 bytes
      ** error **
      ERROR in aiori-POSIX.c (line 256): transfer failed.
      ERROR: Input/output error
      ** exiting **
      srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
      srun: error: soak-27: task 5: Exited with exit code 255
      srun: Terminating job step 233650.0
      

      Around the time when application failed, checking the OSS side, saw many lock time out error

      Feb  7 01:02:00 soak-4 systemd-logind: New session 30 of user root.
      Feb  7 01:02:00 soak-4 systemd: Started Session 30 of user root.
      Feb  7 01:02:00 soak-4 sshd[97515]: pam_unix(sshd:session): session opened for user root by (uid=0)
      Feb  7 01:02:00 soak-4 sshd[97515]: pam_unix(sshd:session): session closed for user root
      Feb  7 01:02:00 soak-4 systemd-logind: Removed session 30.
      Feb  7 01:02:00 soak-4 systemd: Removed slice User Slice of root.
      Feb  7 01:02:09 soak-4 sshd[97553]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
      Feb  7 01:02:09 soak-4 sshd[97553]: Accepted publickey for root from 10.10.1.116 port 43696 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdK
      AWxM
      Feb  7 01:02:09 soak-4 systemd: Created slice User Slice of root.
      Feb  7 01:02:09 soak-4 systemd-logind: New session 31 of user root.
      Feb  7 01:02:09 soak-4 systemd: Started Session 31 of user root.
      Feb  7 01:02:09 soak-4 sshd[97553]: pam_unix(sshd:session): session opened for user root by (uid=0)
      Feb  7 01:02:10 soak-4 sshd[97553]: Received disconnect from 10.10.1.116 port 43696:11: disconnected by user
      Feb  7 01:02:10 soak-4 sshd[97553]: Disconnected from 10.10.1.116 port 43696
      Feb  7 01:02:10 soak-4 sshd[97553]: pam_unix(sshd:session): session closed for user root
      Feb  7 01:02:10 soak-4 systemd-logind: Removed session 31.
      Feb  7 01:02:10 soak-4 systemd: Removed slice User Slice of root.
      Feb  7 01:03:03 soak-4 kernel: LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 192.168.1.128@o2ib  ns: filter-soaked-OST0004_UUID lock: ffff89923febd800/0x2a3d985a1228d475 lrc: 3/0,0 mode: PW/PW res: [0x400000402:0x4cf183:0x0].0x0 rrc: 17 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x60000000020020 nid: 192.168.1.128@o2ib remote: 0x5c14a0ad2708d24d expref: 6 pid: 27943 timeout: 4295139340 lvb_type: 0
      Feb  7 01:03:19 soak-4 kernel: Lustre: soaked-OST0004: Connection restored to ffd42784-63d2-739c-d487-9f95d4a84e57 (at 192.168.1.128@o2ib)
      Feb  7 01:03:19 soak-4 kernel: Lustre: Skipped 2 previous similar messages
      Feb  7 01:03:44 soak-4 sshd[105089]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
      Feb  7 01:03:44 soak-4 sshd[105089]: Accepted publickey for root from 10.10.1.116 port 43960 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM
      
      Feb  7 01:08:06 soak-4 kernel: LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 152s: evicting client at 192.168.1.127@o2ib  ns: filter-soaked-OST0004_UUID lock: ffff89964ba94200/0x2a3d985a1228d47c lrc: 3/0,0 mode: PW/PW res: [0x400000402:0x4cf183:0x0].0x0 rrc: 14 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x60000000020020 nid: 192.168.1.127@o2ib remote: 0xd8fc8581f4c8de49 expref: 6 pid: 74474 timeout: 4295441140 lvb_type: 0
      Feb  7 01:08:18 soak-4 kernel: Lustre: soaked-OST0004: Connection restored to 44cc5f6f-1cd3-3d43-f778-9409d7aeb164 (at 192.168.1.127@o2ib)
      Feb  7 01:08:46 soak-4 kernel: Lustre: 26413:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1549501726/real 1549501726]  req@ffff89968a49b000 x1624769706421056/t0(0) o38->soaked-MDT0001-lwp-OST0008@192.168.1.109@o2ib:12/10 lens 520/544 e 0 to 1 dl 1549501742 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
      Feb  7 01:08:46 soak-4 kernel: Lustre: 26413:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
      Feb  7 01:09:36 soak-4 kernel: Lustre: 26413:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1549501776/real 1549501776]  req@ffff89968a498f00 x1624769706421792/t0(0) o38->soaked-MDT0001-lwp-OST000c@192.168.1.109@o2ib:12/10 lens 520/544 e 0 to 1 dl 1549501797 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
      Feb  7 01:09:36 soak-4 kernel: Lustre: 26413:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
      Feb  7 01:09:43 soak-4 kernel: Lustre: soaked-OST0004: haven't heard from client acbada78-7ef5-2938-ae9f-1535dd1eb89f (at 192.168.1.121@o2ib) in 229 seconds. I think it's dead, and I am evicting it. exp ffff89923fa98c00, cur 1549501783 expire 1549501633 last 1549501554
      Feb  7 01:10:01 soak-4 systemd: Created slice User Slice of root.
      Feb  7 01:10:01 soak-4 systemd: Started Session 44 of user root.
      Feb  7 01:10:01 soak-4 CROND[187082]: (root) CMD (/usr/lib64/sa/sa1 1 1)
      

      Later on MDS side, saw following error

      Feb  7 09:15:03 soak-8 kernel: LNet: Service thread pid 13609 was inactive for 200.68s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Feb  7 09:15:03 soak-8 kernel: Pid: 13609, comm: mdt00_013 3.10.0-957.el7_lustre.x86_64 #1 SMP Mon Jan 7 20:06:41 UTC 2019
      Feb  7 09:15:03 soak-8 kernel: Call Trace:
      Feb  7 09:15:03 soak-8 kernel: [<ffffffffc0f9ba11>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
      Feb  7 09:15:03 soak-8 kernel: [<ffffffffc0f9dabb>] ldlm_cli_enqueue_fini+0x93b/0xdc0 [ptlrpc]
      Feb  7 09:15:03 soak-8 kernel: [<ffffffffc0fa0672>] ldlm_cli_enqueue+0x6c2/0x810 [ptlrpc]
      Feb  7 09:15:03 soak-8 kernel: [<ffffffffc1640622>] osp_md_object_lock+0x172/0x2e0 [osp]
      Feb  7 09:15:03 soak-8 kernel: [<ffffffffc1574ea3>] lod_object_lock+0xf3/0x950 [lod]
      Feb  7 09:15:03 soak-8 kernel: [<ffffffffc15e448e>] mdd_object_lock+0x3e/0xe0 [mdd]
      Feb  7 09:15:03 soak-8 kernel: [<ffffffffc1487fc5>] mdt_remote_object_lock+0x1e5/0x710 [mdt]
      Feb  7 09:15:03 soak-8 kernel: [<ffffffffc1489516>] mdt_object_lock_internal+0x166/0x300 [mdt]
      Feb  7 09:15:03 soak-8 kernel: [<ffffffffc14896f0>] mdt_reint_object_lock+0x20/0x60 [mdt]
      Feb  7 09:15:03 soak-8 kernel: [<ffffffffc14a05f4>] mdt_reint_link+0x7e4/0xc30 [mdt]
      Feb  7 09:15:03 soak-8 kernel: [<ffffffffc14a2c73>] mdt_reint_rec+0x83/0x210 [mdt]
      Feb  7 09:15:03 soak-8 kernel: [<ffffffffc148418b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
      Feb  7 09:15:03 soak-8 kernel: [<ffffffffc148ff77>] mdt_reint+0x67/0x140 [mdt]
      Feb  7 09:15:03 soak-8 kernel: [<ffffffffc10342aa>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      Feb  7 09:15:03 soak-8 kernel: [<ffffffffc0fdcd5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      Feb  7 09:15:03 soak-8 kernel: [<ffffffffc0fe04a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      Feb  7 09:15:03 soak-8 kernel: [<ffffffff884c1c31>] kthread+0xd1/0xe0
      Feb  7 09:15:03 soak-8 kernel: [<ffffffff88b74c37>] ret_from_fork_nospec_end+0x0/0x39
      Feb  7 09:15:03 soak-8 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
      Feb  7 09:15:03 soak-8 kernel: LustreError: dumping log to /tmp/lustre-log.1549530903.13609
      Feb  7 09:15:05 soak-8 kernel: LNet: Service thread pid 13582 was inactive for 202.07s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Feb  7 09:15:05 soak-8 kernel: Pid: 13582, comm: mdt01_007 3.10.0-957.el7_lustre.x86_64 #1 SMP Mon Jan 7 20:06:41 UTC 2019
      Feb  7 09:15:05 soak-8 kernel: Call Trace:
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0f9ba11>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0f9dabb>] ldlm_cli_enqueue_fini+0x93b/0xdc0 [ptlrpc]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0fa0672>] ldlm_cli_enqueue+0x6c2/0x810 [ptlrpc]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc1640622>] osp_md_object_lock+0x172/0x2e0 [osp]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc1574ea3>] lod_object_lock+0xf3/0x950 [lod]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc15e448e>] mdd_object_lock+0x3e/0xe0 [mdd]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc1487fc5>] mdt_remote_object_lock+0x1e5/0x710 [mdt]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc1489516>] mdt_object_lock_internal+0x166/0x300 [mdt]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc14896f0>] mdt_reint_object_lock+0x20/0x60 [mdt]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc14a05f4>] mdt_reint_link+0x7e4/0xc30 [mdt]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc14a2c73>] mdt_reint_rec+0x83/0x210 [mdt]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc148418b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc148ff77>] mdt_reint+0x67/0x140 [mdt]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc10342aa>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0fdcd5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0fe04a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffff884c1c31>] kthread+0xd1/0xe0
      Feb  7 09:15:05 soak-8 kernel: [<ffffffff88b74c37>] ret_from_fork_nospec_end+0x0/0x39
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
      Feb  7 09:15:05 soak-8 kernel: Pid: 18685, comm: mdt00_026 3.10.0-957.el7_lustre.x86_64 #1 SMP Mon Jan 7 20:06:41 UTC 2019
      Feb  7 09:15:05 soak-8 kernel: Call Trace:
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0f9ba11>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0f9dabb>] ldlm_cli_enqueue_fini+0x93b/0xdc0 [ptlrpc]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0fa0672>] ldlm_cli_enqueue+0x6c2/0x810 [ptlrpc]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc1640622>] osp_md_object_lock+0x172/0x2e0 [osp]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc1574ea3>] lod_object_lock+0xf3/0x950 [lod]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc15e448e>] mdd_object_lock+0x3e/0xe0 [mdd]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc1487fc5>] mdt_remote_object_lock+0x1e5/0x710 [mdt]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc1489516>] mdt_object_lock_internal+0x166/0x300 [mdt]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc14896f0>] mdt_reint_object_lock+0x20/0x60 [mdt]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc14a05f4>] mdt_reint_link+0x7e4/0xc30 [mdt]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc14a2c73>] mdt_reint_rec+0x83/0x210 [mdt]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc148418b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc148ff77>] mdt_reint+0x67/0x140 [mdt]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc10342aa>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0fdcd5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0fe04a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffff884c1c31>] kthread+0xd1/0xe0
      Feb  7 09:15:05 soak-8 kernel: [<ffffffff88b74c37>] ret_from_fork_nospec_end+0x0/0x39
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
      Feb  7 09:15:14 soak-8 sshd[78874]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
      Feb  7 09:15:14 soak-8 sshd[78874]: Accepted publickey for root from 10.10.1.116 port 59708 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM
      Feb  7 09:16:40 soak-8 systemd: Removed slice User Slice of root.
      
      Feb  7 09:16:43 soak-8 kernel: LustreError: 13609:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1549530702, 300s ago), entering recovery for soaked-MDT0001_UUID@192.168.1.108@o2ib ns: soaked-MDT0001-osp-MDT0000 lock: ffff92322f761800/0x7ebebf899932ee34 lrc: 4/0,1 mode: --/EX res: [0x24000cf43:0x16f41:0x0].0x0 bits 0x2 rrc: 4 type: IBT flags: 0x1000001000000 nid: local remote: 0xc7baaf97e1b53a16 expref: -99 pid: 13609 timeout: 0 lvb_type: 0
      Feb  7 09:16:43 soak-8 kernel: LustreError: 13609:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 2 previous similar messages
      Feb  7 09:16:55 soak-8 sshd[79087]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
      Feb  7 09:16:56 soak-8 sshd[79087]: Accepted publickey for root from 10.10.1.116 port 59970 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM
      

      Attachments

        1. 240435-mdtestfpp.out
          14 kB
        2. console-soak-10.log-20190224.gz
          515 kB
        3. console-soak-23.log-20190224.gz
          121 kB
        4. soak-10.log-console
          1.13 MB
        5. sys-soak-10.log-20190224.gz
          1.58 MB
        6. sys-soak-23.log-20190224.gz
          1.04 MB

        Issue Links

          Activity

            [LU-11943] many input/output error after soak running for couple of hours

            Would we know if the MDS were asserting during shutdown/failover?  How do we cause the MDS failover - What do we do to it exactly?

            pfarrell Patrick Farrell (Inactive) added a comment - Would we know if the MDS were asserting during shutdown/failover?  How do we cause the MDS failover - What do we do to it exactly?

            Woah.  That's very confusing.  Unless we made a mistake with the build, that should be impossible.  But I do see the errors.  That's extremely confusing.

            I'll see what I can figure out.

            pfarrell Patrick Farrell (Inactive) added a comment - Woah.  That's very confusing.  Unless we made a mistake with the build, that should be impossible.  But I do see the errors.  That's extremely confusing. I'll see what I can figure out.
            sarah Sarah Liu added a comment -

            update of lustre-reviews-ib build #4241 (https://review.whamcloud.com/34346)

            After 17 hours running with only mds_failover and mds_restart, the fail/pass rate is: 97 fail / 453 pass. With this modified patch, the pass rate were still worse than the previous revert patch.

            Check one of the failed application mdtestfpp.out

            lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none
            lmv_stripe_count: 0 lmv_stripe_offset: -1 lmv_hash_type: none
            srun: Warning: can't honor --ntasks-per-node set to 2 which doesn't match the requested tasks 12 with the number of requested nodes 8. Ignoring --ntasks-per-node.
            -- started at 03/01/2019 12:45:51 --
            
            mdtest-1.9.3 was launched with 12 total task(s) on 6 node(s)
            Command line used: /mnt/soaked/bin/mdtest -d /mnt/soaked/soaktest/test/mdtestfpp/241953 -i 5 -n 105 -u
            03/01/2019 12:49:18: Process 0(): FAILED in show_file_system_size, unable to statfs() file system: Input/output error
            --------------------------------------------------------------------------
            MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
            with errorcode 1.
            
            NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
            You may or may not see output from other processes, depending on
            exactly when Open MPI kills them.
            --------------------------------------------------------------------------
            srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
            slurmstepd: error: *** STEP 241953.0 ON soak-28 CANCELLED AT 2019-03-01T12:49:18 ***
            

            then check the client soak-28 syslog around 3/1 12:49, it was evicted by MDS #2(soak-10)

            Mar  1 12:46:46 soak-28 kernel: Lustre: 10153:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
            Mar  1 12:47:05 soak-28 kernel: LNet: 10142:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.1.110@o2ib: 6 seconds
            Mar  1 12:47:56 soak-28 kernel: LNet: 10142:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.1.110@o2ib: 0 seconds
            Mar  1 12:49:16 soak-28 kernel: LustreError: 65561:0:(lmv_obd.c:1387:lmv_statfs()) can't stat MDS #2 (soaked-MDT0002-mdc-ffff98d1ac0b5000), error -19
            Mar  1 12:49:16 soak-28 kernel: LustreError: 65561:0:(lmv_obd.c:1387:lmv_statfs()) Skipped 1 previous similar message
            Mar  1 12:49:16 soak-28 kernel: LustreError: 65561:0:(llite_lib.c:1759:ll_statfs_internal()) md_statfs fails: rc = -19
            Mar  1 12:49:17 soak-28 kernel: LustreError: 65561:0:(llite_lib.c:1759:ll_statfs_internal()) Skipped 3 previous similar messages
            Mar  1 12:49:18 soak-28 kernel: LustreError: 167-0: soaked-MDT0002-mdc-ffff98d1ac0b5000: This client was evicted by soaked-MDT0002; in progress operations using this service will fail.
            Mar  1 12:49:18 soak-28 kernel: LustreError: 65550:0:(lmv_obd.c:1387:lmv_statfs()) can't stat MDS #2 (soaked-MDT0002-mdc-ffff98d1ac0b5000), error -5
            Mar  1 12:49:18 soak-28 kernel: Lustre: soaked-MDT0002-mdc-ffff98d1ac0b5000: Connection restored to 192.168.1.110@o2ib (at 192.168.1.110@o2ib)
            Mar  1 12:49:18 soak-28 kernel: LustreError: 65550:0:(llite_lib.c:1759:ll_statfs_internal()) md_statfs fails: rc = -5
            Mar  1 12:49:20 soak-28 sshd[65650]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
            Mar  1 12:49:20 soak-28 sshd[65650]: Accepted publickey for root from 10.10.1.116 port 36250 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM
            Mar  1 12:49:20 soak-28 systemd: Created slice User Slice of root.
            Mar  1 12:49:20 soak-28 systemd-logind: New session 1419 of user root.
            Mar  1 12:49:20 soak-28 systemd: Started Session 1419 of user root.
            Mar  1 12:49:20 soak-28 sshd[65650]: pam_unix(sshd:session): session opened for user root by (uid=0)
            Mar  1 12:49:21 soak-28 sshd[65650]: pam_unix(sshd:session): session closed for user root
            Mar  1 12:49:21 soak-28 systemd-logind: Removed session 1419.
            Mar  1 12:49:21 soak-28 systemd: Removed slice User Slice of root.
            Mar  1 12:49:30 soak-28 sshd[65684]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
            Mar  1 12:49:30 soak-28 sshd[65684]: Accepted publickey for root from 10.10.1.116 port 36388 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM
            Mar  1 12:49:30 soak-28 systemd: Created slice User Slice of root.
            Mar  1 12:49:30 soak-28 systemd-logind: New session 1420 of user root.
            Mar  1 12:49:30 soak-28 systemd: Started Session 1420 of user root.
            Mar  1 12:49:30 soak-28 sshd[65684]: pam_unix(sshd:session): session opened for user root by (uid=0)
            Mar  1 12:49:30 soak-28 sshd[65684]: Received disconnect from 10.10.1.116 port 36388:11: disconnected by user
            Mar  1 12:49:30 soak-28 sshd[65684]: Disconnected from 10.10.1.116 port 36388
            Mar  1 12:49:30 soak-28 sshd[65684]: pam_unix(sshd:session): session closed for user root
            Mar  1 12:49:30 soak-28 systemd-logind: Removed session 1420.
            Mar  1 12:49:30 soak-28 systemd: Removed slice User Slice of root.
            Mar  1 12:49:56 soak-28 kernel: LustreError: 65742:0:(llite_lib.c:2345:ll_prep_inode()) new_inode -fatal: rc -2
            Mar  1 12:49:57 soak-28 kernel: LustreError: 65772:0:(llite_lib.c:2345:ll_prep_inode()) new_inode -fatal: rc -2
            Mar  1 12:49:57 soak-28 kernel: LustreError: 65772:0:(llite_lib.c:2345:ll_prep_inode()) Skipped 10 previous similar messages
            

            Check soak-10 around the same time

            Mar  1 12:47:37 soak-10 multipathd: 360080e50001fedb80000015752012949: remaining active paths: 1
            Mar  1 12:47:37 soak-10 kernel: sd 0:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, queueing MODE_SELECT command
            Mar  1 12:47:37 soak-10 kernel: Lustre: soaked-MDT0002: Not available for connect from 192.168.1.123@o2ib (not set up)
            Mar  1 12:47:38 soak-10 kernel: sd 0:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, MODE_SELECT completed
            Mar  1 12:47:38 soak-10 kernel: Lustre: soaked-MDT0002: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
            Mar  1 12:47:38 soak-10 kernel: LustreError: 13072:0:(llog.c:588:llog_process_thread()) soaked-MDT0003-osp-MDT0002: [0x2:0xc0030980:0x2] Invalid record: index 6 but expe
            cted 2
            Mar  1 12:47:38 soak-10 kernel: LustreError: 13072:0:(lod_dev.c:419:lod_sub_recovery_thread()) soaked-MDT0003-osp-MDT0002 getting update log failed: rc = -34
            Mar  1 12:47:39 soak-10 sshd[12270]: pam_unix(sshd:session): session closed for user root
            Mar  1 12:47:39 soak-10 systemd-logind: Removed session 3.
            Mar  1 12:47:39 soak-10 systemd: Removed slice User Slice of root.
            Mar  1 12:47:39 soak-10 kernel: Lustre: soaked-MDT0002: Connection restored to b61568ce-eadb-273d-d151-58976b27b2a3 (at 192.168.1.122@o2ib)
            Mar  1 12:47:39 soak-10 kernel: Lustre: 13073:0:(ldlm_lib.c:2061:target_recovery_overseer()) recovery is aborted, evict exports in recovery
            Mar  1 12:47:39 soak-10 kernel: Lustre: soaked-MDT0002: disconnecting 16 stale clients
            Mar  1 12:47:41 soak-10 kernel: Lustre: soaked-MDT0002: Connection restored to 4ae090bf-6bf9-e882-df72-d1d71f1e04c8 (at 192.168.1.120@o2ib)
            Mar  1 12:47:42 soak-10 multipathd: 360080e50001fedb80000015752012949: sdg - rdac checker reports path is up
            Mar  1 12:47:42 soak-10 multipathd: 8:96: reinstated
            Mar  1 12:47:42 soak-10 multipathd: 360080e50001fedb80000015752012949: load table [0 31247843328 multipath 2 queue_if_no_path retain_attached_hw_handler 1 rdac 1 1 round
            -robin 0 1 1 8:96 1]
            Mar  1 12:47:44 soak-10 kernel: Lustre: soaked-MDT0002: Connection restored to d4732544-d66c-f143-5fc3-3aff0109b21f (at 192.168.1.135@o2ib)
            Mar  1 12:47:46 soak-10 kernel: Lustre: soaked-MDT0002: Connection restored to  (at 192.168.1.124@o2ib)
            Mar  1 12:47:46 soak-10 kernel: Lustre: Skipped 13 previous similar messages
            Mar  1 12:47:47 soak-10 multipathd: 360080e50001fedb80000015752012949: sdg - rdac checker reports path is ghost
            Mar  1 12:47:47 soak-10 multipathd: 8:96: reinstated
            Mar  1 12:47:48 soak-10 kernel: sd 0:0:0:44: [sdg] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
            Mar  1 12:47:48 soak-10 kernel: sd 0:0:0:44: [sdg] Sense Key : Illegal Request [current] 
            Mar  1 12:47:48 soak-10 kernel: sd 0:0:0:44: [sdg] <<vendor>>ASC=0x94 ASCQ=0x1 
            Mar  1 12:47:48 soak-10 kernel: sd 0:0:0:44: [sdg] CDB: Write(16) 8a 00 00 00 00 00 00 b1 23 78 00 00 00 08 00 00
            Mar  1 12:47:48 soak-10 kernel: blk_update_request: I/O error, dev sdg, sector 11608952
            Mar  1 12:47:48 soak-10 kernel: device-mapper: multipath: Failing path 8:96.
            Mar  1 12:47:48 soak-10 multipathd: sdg: mark as failed
            
            sarah Sarah Liu added a comment - update of lustre-reviews-ib build #4241 ( https://review.whamcloud.com/34346 ) After 17 hours running with only mds_failover and mds_restart, the fail/pass rate is: 97 fail / 453 pass. With this modified patch, the pass rate were still worse than the previous revert patch. Check one of the failed application mdtestfpp.out lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none lmv_stripe_count: 0 lmv_stripe_offset: -1 lmv_hash_type: none srun: Warning: can't honor --ntasks-per-node set to 2 which doesn't match the requested tasks 12 with the number of requested nodes 8. Ignoring --ntasks-per-node. -- started at 03/01/2019 12:45:51 -- mdtest-1.9.3 was launched with 12 total task(s) on 6 node(s) Command line used: /mnt/soaked/bin/mdtest -d /mnt/soaked/soaktest/test/mdtestfpp/241953 -i 5 -n 105 -u 03/01/2019 12:49:18: Process 0(): FAILED in show_file_system_size, unable to statfs() file system: Input/output error -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 1. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: *** STEP 241953.0 ON soak-28 CANCELLED AT 2019-03-01T12:49:18 *** then check the client soak-28 syslog around 3/1 12:49, it was evicted by MDS #2(soak-10) Mar 1 12:46:46 soak-28 kernel: Lustre: 10153:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Mar 1 12:47:05 soak-28 kernel: LNet: 10142:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.1.110@o2ib: 6 seconds Mar 1 12:47:56 soak-28 kernel: LNet: 10142:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.1.110@o2ib: 0 seconds Mar 1 12:49:16 soak-28 kernel: LustreError: 65561:0:(lmv_obd.c:1387:lmv_statfs()) can't stat MDS #2 (soaked-MDT0002-mdc-ffff98d1ac0b5000), error -19 Mar 1 12:49:16 soak-28 kernel: LustreError: 65561:0:(lmv_obd.c:1387:lmv_statfs()) Skipped 1 previous similar message Mar 1 12:49:16 soak-28 kernel: LustreError: 65561:0:(llite_lib.c:1759:ll_statfs_internal()) md_statfs fails: rc = -19 Mar 1 12:49:17 soak-28 kernel: LustreError: 65561:0:(llite_lib.c:1759:ll_statfs_internal()) Skipped 3 previous similar messages Mar 1 12:49:18 soak-28 kernel: LustreError: 167-0: soaked-MDT0002-mdc-ffff98d1ac0b5000: This client was evicted by soaked-MDT0002; in progress operations using this service will fail. Mar 1 12:49:18 soak-28 kernel: LustreError: 65550:0:(lmv_obd.c:1387:lmv_statfs()) can't stat MDS #2 (soaked-MDT0002-mdc-ffff98d1ac0b5000), error -5 Mar 1 12:49:18 soak-28 kernel: Lustre: soaked-MDT0002-mdc-ffff98d1ac0b5000: Connection restored to 192.168.1.110@o2ib (at 192.168.1.110@o2ib) Mar 1 12:49:18 soak-28 kernel: LustreError: 65550:0:(llite_lib.c:1759:ll_statfs_internal()) md_statfs fails: rc = -5 Mar 1 12:49:20 soak-28 sshd[65650]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key Mar 1 12:49:20 soak-28 sshd[65650]: Accepted publickey for root from 10.10.1.116 port 36250 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM Mar 1 12:49:20 soak-28 systemd: Created slice User Slice of root. Mar 1 12:49:20 soak-28 systemd-logind: New session 1419 of user root. Mar 1 12:49:20 soak-28 systemd: Started Session 1419 of user root. Mar 1 12:49:20 soak-28 sshd[65650]: pam_unix(sshd:session): session opened for user root by (uid=0) Mar 1 12:49:21 soak-28 sshd[65650]: pam_unix(sshd:session): session closed for user root Mar 1 12:49:21 soak-28 systemd-logind: Removed session 1419. Mar 1 12:49:21 soak-28 systemd: Removed slice User Slice of root. Mar 1 12:49:30 soak-28 sshd[65684]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key Mar 1 12:49:30 soak-28 sshd[65684]: Accepted publickey for root from 10.10.1.116 port 36388 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM Mar 1 12:49:30 soak-28 systemd: Created slice User Slice of root. Mar 1 12:49:30 soak-28 systemd-logind: New session 1420 of user root. Mar 1 12:49:30 soak-28 systemd: Started Session 1420 of user root. Mar 1 12:49:30 soak-28 sshd[65684]: pam_unix(sshd:session): session opened for user root by (uid=0) Mar 1 12:49:30 soak-28 sshd[65684]: Received disconnect from 10.10.1.116 port 36388:11: disconnected by user Mar 1 12:49:30 soak-28 sshd[65684]: Disconnected from 10.10.1.116 port 36388 Mar 1 12:49:30 soak-28 sshd[65684]: pam_unix(sshd:session): session closed for user root Mar 1 12:49:30 soak-28 systemd-logind: Removed session 1420. Mar 1 12:49:30 soak-28 systemd: Removed slice User Slice of root. Mar 1 12:49:56 soak-28 kernel: LustreError: 65742:0:(llite_lib.c:2345:ll_prep_inode()) new_inode -fatal: rc -2 Mar 1 12:49:57 soak-28 kernel: LustreError: 65772:0:(llite_lib.c:2345:ll_prep_inode()) new_inode -fatal: rc -2 Mar 1 12:49:57 soak-28 kernel: LustreError: 65772:0:(llite_lib.c:2345:ll_prep_inode()) Skipped 10 previous similar messages Check soak-10 around the same time Mar 1 12:47:37 soak-10 multipathd: 360080e50001fedb80000015752012949: remaining active paths: 1 Mar 1 12:47:37 soak-10 kernel: sd 0:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, queueing MODE_SELECT command Mar 1 12:47:37 soak-10 kernel: Lustre: soaked-MDT0002: Not available for connect from 192.168.1.123@o2ib (not set up) Mar 1 12:47:38 soak-10 kernel: sd 0:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, MODE_SELECT completed Mar 1 12:47:38 soak-10 kernel: Lustre: soaked-MDT0002: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 Mar 1 12:47:38 soak-10 kernel: LustreError: 13072:0:(llog.c:588:llog_process_thread()) soaked-MDT0003-osp-MDT0002: [0x2:0xc0030980:0x2] Invalid record: index 6 but expe cted 2 Mar 1 12:47:38 soak-10 kernel: LustreError: 13072:0:(lod_dev.c:419:lod_sub_recovery_thread()) soaked-MDT0003-osp-MDT0002 getting update log failed: rc = -34 Mar 1 12:47:39 soak-10 sshd[12270]: pam_unix(sshd:session): session closed for user root Mar 1 12:47:39 soak-10 systemd-logind: Removed session 3. Mar 1 12:47:39 soak-10 systemd: Removed slice User Slice of root. Mar 1 12:47:39 soak-10 kernel: Lustre: soaked-MDT0002: Connection restored to b61568ce-eadb-273d-d151-58976b27b2a3 (at 192.168.1.122@o2ib) Mar 1 12:47:39 soak-10 kernel: Lustre: 13073:0:(ldlm_lib.c:2061:target_recovery_overseer()) recovery is aborted, evict exports in recovery Mar 1 12:47:39 soak-10 kernel: Lustre: soaked-MDT0002: disconnecting 16 stale clients Mar 1 12:47:41 soak-10 kernel: Lustre: soaked-MDT0002: Connection restored to 4ae090bf-6bf9-e882-df72-d1d71f1e04c8 (at 192.168.1.120@o2ib) Mar 1 12:47:42 soak-10 multipathd: 360080e50001fedb80000015752012949: sdg - rdac checker reports path is up Mar 1 12:47:42 soak-10 multipathd: 8:96: reinstated Mar 1 12:47:42 soak-10 multipathd: 360080e50001fedb80000015752012949: load table [0 31247843328 multipath 2 queue_if_no_path retain_attached_hw_handler 1 rdac 1 1 round -robin 0 1 1 8:96 1] Mar 1 12:47:44 soak-10 kernel: Lustre: soaked-MDT0002: Connection restored to d4732544-d66c-f143-5fc3-3aff0109b21f (at 192.168.1.135@o2ib) Mar 1 12:47:46 soak-10 kernel: Lustre: soaked-MDT0002: Connection restored to (at 192.168.1.124@o2ib) Mar 1 12:47:46 soak-10 kernel: Lustre: Skipped 13 previous similar messages Mar 1 12:47:47 soak-10 multipathd: 360080e50001fedb80000015752012949: sdg - rdac checker reports path is ghost Mar 1 12:47:47 soak-10 multipathd: 8:96: reinstated Mar 1 12:47:48 soak-10 kernel: sd 0:0:0:44: [sdg] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Mar 1 12:47:48 soak-10 kernel: sd 0:0:0:44: [sdg] Sense Key : Illegal Request [current] Mar 1 12:47:48 soak-10 kernel: sd 0:0:0:44: [sdg] <<vendor>>ASC=0x94 ASCQ=0x1 Mar 1 12:47:48 soak-10 kernel: sd 0:0:0:44: [sdg] CDB: Write(16) 8a 00 00 00 00 00 00 b1 23 78 00 00 00 08 00 00 Mar 1 12:47:48 soak-10 kernel: blk_update_request: I/O error, dev sdg, sector 11608952 Mar 1 12:47:48 soak-10 kernel: device-mapper: multipath: Failing path 8:96. Mar 1 12:47:48 soak-10 multipathd: sdg: mark as failed

            Changed fixes to associate with this ticket.

            Master:
            https://review.whamcloud.com/#/c/34347/

            b2_10:

            https://review.whamcloud.com/#/c/34346/

             

            pfarrell Patrick Farrell (Inactive) added a comment - Changed fixes to associate with this ticket. Master: https://review.whamcloud.com/#/c/34347/ b2_10: https://review.whamcloud.com/#/c/34346/  

            Suggested fix is here:
            https://review.whamcloud.com/34346

            I'll push it for master shortly.

            pfarrell Patrick Farrell (Inactive) added a comment - Suggested fix is here: https://review.whamcloud.com/34346 I'll push it for master shortly.

            We are now up to 54 failovers without an abort.  I'll update the other bug - I'm not sure how, but that patch is causing a lot of trouble.

            pfarrell Patrick Farrell (Inactive) added a comment - We are now up to 54 failovers without an abort.  I'll update the other bug - I'm not sure how, but that patch is causing a lot of trouble.
            sarah Sarah Liu added a comment -

            review build #4213 has been running 20 hours, the fail/pass number is 19 fail/384 pass, only simul failed due to time out.

            no crash on servers, but 1 client hit kernel NULL pointer
            soak-40 console

            soak-40 login: [   60.771568] IPv6: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
            [ 4976.239443] LNet: HW NUMA nodes: 2, HW CPU cores: 88, npartitions: 2
            [ 4976.252443] alg: No test for adler32 (adler32-zlib)
            [ 4977.043702] Lustre: Lustre: Build Version: 2.10.6_64_g1307d8a
            [ 4977.175814] LNet: Added LNI 192.168.1.140@o2ib [8/256/0/180]
            [ 8232.416484] LustreError: 11-0: soaked-OST000c-osc-ffff9847cefdd800: operation ost_connect to node 192.168.1.104@o2ib failed: rc = -16
            [ 8232.431730] Lustre: Mounted soaked-client
            [10013.355513] perf: interrupt took too long (3164 > 3156), lowering kernel.perf_event_max_sample_rate to 63000
            [11889.420578] Lustre: 14411:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1551302693/real 0]  req@ffff9840c6e6b300 
            x1626651565453584/t0(0) o400->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 224/224 e 0 to 1 dl 1551302700 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
            [11889.451642] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail
            [11895.468004] Lustre: 14364:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1551302700/real 0]  req@ffff9840c6e6e600 
            x1626651565467584/t0(0) o250->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 520/544 e 0 to 1 dl 1551302706 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
            [11907.508055] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
            [11907.516881] IP: [<ffffffffc099d920>] kiblnd_connect_peer+0x70/0x670 [ko2iblnd]
            [11907.525014] PGD 0 
            [11907.527286] Oops: 0000 [#1] SMP 
            [11907.530934] Modules linked in: osc(OE) mgc(OE) lustre(OE) lmv(OE) fld(OE) mdc(OE) fid(OE) lov(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt mxm_wmi iTCO_vendor_support ipmi_ssif mei_me joydev pcspkr mei ioatdma sg lpc_ich i2c_i801 ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_power_meter acpi_pad auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 mlx4_ib(OE) ib_core(OE) sd_mod crc_t10dif crct10dif_generic mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops igb ttm ahci mlx4_core(OE) drm ptp libahci crct10dif_pclmul crct10dif_common pps_core crc32c_intel devlink libata dca drm_panel_orientation_quirks mlx_compat(OE) i2c_algo_bit
            [11907.634252] CPU: 33 PID: 20895 Comm: mdtest Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.el7.x86_64 #1
            [11907.646651] Hardware name: Intel Corporation S2600WT2R/S2600WT2R, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
            [11907.658563] task: ffff9847d5e030c0 ti: ffff983faff90000 task.ti: ffff983faff90000
            [11907.666964] RIP: 0010:[<ffffffffc099d920>]  [<ffffffffc099d920>] kiblnd_connect_peer+0x70/0x670 [ko2iblnd]
            [11907.677817] RSP: 0018:ffff983faff937a8  EFLAGS: 00010202
            [11907.683782] RAX: 0000000000000000 RBX: ffff9847cda39800 RCX: 0000000000000106
            [11907.691792] RDX: ffff984386f09d80 RSI: ffffffffc09a4460 RDI: ffff984386f09d80
            [11907.699803] RBP: ffff983faff937f8 R08: 0000000000000002 R09: ffffffffc09abd24
            [11907.707813] R10: ffff9838ffc07900 R11: ffff984386f09d80 R12: 00050000c0a8016c
            [11907.715825] R13: ffff984386f09d80 R14: ffff984386f09d80 R15: ffff9847d77cfa00
            [11907.723838] FS:  0000000000000000(0000) GS:ffff9847de0c0000(0000) knlGS:0000000000000000
            [11907.732921] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [11907.739371] CR2: 0000000000000028 CR3: 00000008cca10000 CR4: 00000000003607e0
            [11907.747383] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            [11907.757744] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
            [11907.768095] Call Trace:
            [11907.773201]  [<ffffffffc099ff6c>] kiblnd_launch_tx+0x90c/0xc20 [ko2iblnd]
            [11907.783193]  [<ffffffffc09a05d7>] kiblnd_send+0x357/0xa10 [ko2iblnd]
            [11907.792681]  [<ffffffffc0af1bc1>] lnet_ni_send+0x41/0xd0 [lnet]
            [11907.801654]  [<ffffffffc0af6fb7>] lnet_send+0x77/0x180 [lnet]
            [11907.810429]  [<ffffffffc0af7305>] LNetPut+0x245/0x7a0 [lnet]
            [11907.819136]  [<ffffffffc0e94006>] ptl_send_buf+0x146/0x530 [ptlrpc]
            [11907.828470]  [<ffffffffc0aeca44>] ? LNetMDAttach+0x3f4/0x450 [lnet]
            [11907.837843]  [<ffffffffc0e95bfd>] ptl_send_rpc+0x67d/0xe60 [ptlrpc]
            [11907.847119]  [<ffffffffc0e8b4a8>] ptlrpc_send_new_req+0x468/0xa60 [ptlrpc]
            [11907.857087]  [<ffffffffc0ed439a>] ? null_alloc_reqbuf+0x19a/0x3a0 [ptlrpc]
            [11907.867016]  [<ffffffffc0e900c1>] ptlrpc_set_wait+0x3d1/0x920 [ptlrpc]
            [11907.876515]  [<ffffffffc0b6b8c9>] ? lustre_get_jobid+0x99/0x4d0 [obdclass]
            [11907.886425]  [<ffffffffc0e9b635>] ? lustre_msg_set_jobid+0x95/0x100 [ptlrpc]
            [11907.896423]  [<ffffffffc0e9068d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc]
            [11907.905865]  [<ffffffffc0cff58c>] mdc_close+0x1bc/0x8a0 [mdc]
            [11907.914312]  [<ffffffffc0d3707c>] lmv_close+0x21c/0x550 [lmv]
            [11907.922702]  [<ffffffffc0fbb6fe>] ll_close_inode_openhandle+0x2fe/0xe20 [lustre]
            [11907.932901]  [<ffffffffc0fbe5c0>] ll_md_real_close+0xf0/0x1e0 [lustre]
            [11907.942087]  [<ffffffffc0fbecf8>] ll_file_release+0x648/0xa80 [lustre]
            [11907.951236]  [<ffffffff936433dc>] __fput+0xec/0x260
            [11907.958513]  [<ffffffff9364363e>] ____fput+0xe/0x10
            [11907.965763]  [<ffffffff934be79b>] task_work_run+0xbb/0xe0
            [11907.973631]  [<ffffffff9349dc61>] do_exit+0x2d1/0xa40
            [11907.981039]  [<ffffffff93b6f608>] ? __do_page_fault+0x228/0x500
            [11907.989417]  [<ffffffff9349e44f>] do_group_exit+0x3f/0xa0
            [11907.997175]  [<ffffffff9349e4c4>] SyS_exit_group+0x14/0x20
            [11908.004972]  [<ffffffff93b74ddb>] system_call_fastpath+0x22/0x27
            [11908.013306] Code: 48 8b 04 25 80 0e 01 00 48 8b 80 60 07 00 00 49 c7 c1 24 bd 9a c0 41 b8 02 00 00 00 b9 06 01 00 00 4c 89 ea 48 c7 c6 60 44 9a c0 <48> 8b 78 28 e8 37 61 c9 ff 48 3d 00 f0 ff ff 49 89 c6 0f 87 d1 
            [11908.038427] RIP  [<ffffffffc099d920>] kiblnd_connect_peer+0x70/0x670 [ko2iblnd]
            [11908.048166]  RSP <ffff983faff937a8>
            [11908.053569] CR2: 0000000000000028
            [    0.000000] Initializing cgroup subsys cpuset
            [    0.000000] Initializing cgroup subsys cpu
            [    0.000000] Initializing cgroup subsys cpuacct
            
            sarah Sarah Liu added a comment - review build #4213 has been running 20 hours, the fail/pass number is 19 fail/384 pass, only simul failed due to time out. no crash on servers, but 1 client hit kernel NULL pointer soak-40 console soak-40 login: [ 60.771568] IPv6: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready [ 4976.239443] LNet: HW NUMA nodes: 2, HW CPU cores: 88, npartitions: 2 [ 4976.252443] alg: No test for adler32 (adler32-zlib) [ 4977.043702] Lustre: Lustre: Build Version: 2.10.6_64_g1307d8a [ 4977.175814] LNet: Added LNI 192.168.1.140@o2ib [8/256/0/180] [ 8232.416484] LustreError: 11-0: soaked-OST000c-osc-ffff9847cefdd800: operation ost_connect to node 192.168.1.104@o2ib failed: rc = -16 [ 8232.431730] Lustre: Mounted soaked-client [10013.355513] perf: interrupt took too long (3164 > 3156), lowering kernel.perf_event_max_sample_rate to 63000 [11889.420578] Lustre: 14411:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1551302693/real 0] req@ffff9840c6e6b300 x1626651565453584/t0(0) o400->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 224/224 e 0 to 1 dl 1551302700 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 [11889.451642] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail [11895.468004] Lustre: 14364:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1551302700/real 0] req@ffff9840c6e6e600 x1626651565467584/t0(0) o250->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 520/544 e 0 to 1 dl 1551302706 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 [11907.508055] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028 [11907.516881] IP: [<ffffffffc099d920>] kiblnd_connect_peer+0x70/0x670 [ko2iblnd] [11907.525014] PGD 0 [11907.527286] Oops: 0000 [#1] SMP [11907.530934] Modules linked in: osc(OE) mgc(OE) lustre(OE) lmv(OE) fld(OE) mdc(OE) fid(OE) lov(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt mxm_wmi iTCO_vendor_support ipmi_ssif mei_me joydev pcspkr mei ioatdma sg lpc_ich i2c_i801 ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_power_meter acpi_pad auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 mlx4_ib(OE) ib_core(OE) sd_mod crc_t10dif crct10dif_generic mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops igb ttm ahci mlx4_core(OE) drm ptp libahci crct10dif_pclmul crct10dif_common pps_core crc32c_intel devlink libata dca drm_panel_orientation_quirks mlx_compat(OE) i2c_algo_bit [11907.634252] CPU: 33 PID: 20895 Comm: mdtest Kdump: loaded Tainted: G OE ------------ 3.10.0-957.el7.x86_64 #1 [11907.646651] Hardware name: Intel Corporation S2600WT2R/S2600WT2R, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016 [11907.658563] task: ffff9847d5e030c0 ti: ffff983faff90000 task.ti: ffff983faff90000 [11907.666964] RIP: 0010:[<ffffffffc099d920>] [<ffffffffc099d920>] kiblnd_connect_peer+0x70/0x670 [ko2iblnd] [11907.677817] RSP: 0018:ffff983faff937a8 EFLAGS: 00010202 [11907.683782] RAX: 0000000000000000 RBX: ffff9847cda39800 RCX: 0000000000000106 [11907.691792] RDX: ffff984386f09d80 RSI: ffffffffc09a4460 RDI: ffff984386f09d80 [11907.699803] RBP: ffff983faff937f8 R08: 0000000000000002 R09: ffffffffc09abd24 [11907.707813] R10: ffff9838ffc07900 R11: ffff984386f09d80 R12: 00050000c0a8016c [11907.715825] R13: ffff984386f09d80 R14: ffff984386f09d80 R15: ffff9847d77cfa00 [11907.723838] FS: 0000000000000000(0000) GS:ffff9847de0c0000(0000) knlGS:0000000000000000 [11907.732921] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [11907.739371] CR2: 0000000000000028 CR3: 00000008cca10000 CR4: 00000000003607e0 [11907.747383] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [11907.757744] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [11907.768095] Call Trace: [11907.773201] [<ffffffffc099ff6c>] kiblnd_launch_tx+0x90c/0xc20 [ko2iblnd] [11907.783193] [<ffffffffc09a05d7>] kiblnd_send+0x357/0xa10 [ko2iblnd] [11907.792681] [<ffffffffc0af1bc1>] lnet_ni_send+0x41/0xd0 [lnet] [11907.801654] [<ffffffffc0af6fb7>] lnet_send+0x77/0x180 [lnet] [11907.810429] [<ffffffffc0af7305>] LNetPut+0x245/0x7a0 [lnet] [11907.819136] [<ffffffffc0e94006>] ptl_send_buf+0x146/0x530 [ptlrpc] [11907.828470] [<ffffffffc0aeca44>] ? LNetMDAttach+0x3f4/0x450 [lnet] [11907.837843] [<ffffffffc0e95bfd>] ptl_send_rpc+0x67d/0xe60 [ptlrpc] [11907.847119] [<ffffffffc0e8b4a8>] ptlrpc_send_new_req+0x468/0xa60 [ptlrpc] [11907.857087] [<ffffffffc0ed439a>] ? null_alloc_reqbuf+0x19a/0x3a0 [ptlrpc] [11907.867016] [<ffffffffc0e900c1>] ptlrpc_set_wait+0x3d1/0x920 [ptlrpc] [11907.876515] [<ffffffffc0b6b8c9>] ? lustre_get_jobid+0x99/0x4d0 [obdclass] [11907.886425] [<ffffffffc0e9b635>] ? lustre_msg_set_jobid+0x95/0x100 [ptlrpc] [11907.896423] [<ffffffffc0e9068d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc] [11907.905865] [<ffffffffc0cff58c>] mdc_close+0x1bc/0x8a0 [mdc] [11907.914312] [<ffffffffc0d3707c>] lmv_close+0x21c/0x550 [lmv] [11907.922702] [<ffffffffc0fbb6fe>] ll_close_inode_openhandle+0x2fe/0xe20 [lustre] [11907.932901] [<ffffffffc0fbe5c0>] ll_md_real_close+0xf0/0x1e0 [lustre] [11907.942087] [<ffffffffc0fbecf8>] ll_file_release+0x648/0xa80 [lustre] [11907.951236] [<ffffffff936433dc>] __fput+0xec/0x260 [11907.958513] [<ffffffff9364363e>] ____fput+0xe/0x10 [11907.965763] [<ffffffff934be79b>] task_work_run+0xbb/0xe0 [11907.973631] [<ffffffff9349dc61>] do_exit+0x2d1/0xa40 [11907.981039] [<ffffffff93b6f608>] ? __do_page_fault+0x228/0x500 [11907.989417] [<ffffffff9349e44f>] do_group_exit+0x3f/0xa0 [11907.997175] [<ffffffff9349e4c4>] SyS_exit_group+0x14/0x20 [11908.004972] [<ffffffff93b74ddb>] system_call_fastpath+0x22/0x27 [11908.013306] Code: 48 8b 04 25 80 0e 01 00 48 8b 80 60 07 00 00 49 c7 c1 24 bd 9a c0 41 b8 02 00 00 00 b9 06 01 00 00 4c 89 ea 48 c7 c6 60 44 9a c0 <48> 8b 78 28 e8 37 61 c9 ff 48 3d 00 f0 ff ff 49 89 c6 0f 87 d1 [11908.038427] RIP [<ffffffffc099d920>] kiblnd_connect_peer+0x70/0x670 [ko2iblnd] [11908.048166] RSP <ffff983faff937a8> [11908.053569] CR2: 0000000000000028 [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Initializing cgroup subsys cpuacct
            sarah Sarah Liu added a comment - - edited

            A quick update: soak has been running with Patrick's patch(https://build.whamcloud.com/job/lustre-reviews-ib/4213/) for about 6 hours, with only mds_failover and mds_restart, 105 pass / 4 fail(simul only). It seems the revert works. I will leave it running overnight and update the ticket tomorrow

            > show
            Job results:
                    FAIL
                            simul: 4
                                    240992 241065 241032 241086 
                            Total: 4
                    QUEUED
                            mdtestfpp: 3
                            iorssf: 2
                            iorfpp: 1
                            fiorandom: 10
                            blogbench: 10
                            mdtestssf: 4
                            fiosequent: 10
                            fiosas: 10
                            kcompile: 10
                            simul: 8
                            Total: 68
                    SUCCESS
                            mdtestssf: 23
                            iorssf: 23
                            iorfpp: 21
                            mdtestfpp: 26
                            simul: 12
                            Total: 105
            Fault injection results:
                    CURRENT
                            mds_failover in progress
                    COMPLETED
                            mds_failover: 5
                            mds_restart: 5
                            Total: 10
            
            sarah Sarah Liu added a comment - - edited A quick update: soak has been running with Patrick's patch( https://build.whamcloud.com/job/lustre-reviews-ib/4213/ ) for about 6 hours, with only mds_failover and mds_restart, 105 pass / 4 fail(simul only). It seems the revert works. I will leave it running overnight and update the ticket tomorrow > show Job results: FAIL simul: 4 240992 241065 241032 241086 Total: 4 QUEUED mdtestfpp: 3 iorssf: 2 iorfpp: 1 fiorandom: 10 blogbench: 10 mdtestssf: 4 fiosequent: 10 fiosas: 10 kcompile: 10 simul: 8 Total: 68 SUCCESS mdtestssf: 23 iorssf: 23 iorfpp: 21 mdtestfpp: 26 simul: 12 Total: 105 Fault injection results: CURRENT mds_failover in progress COMPLETED mds_failover: 5 mds_restart: 5 Total: 10

            Quick early readout.  I don't have application failure rates (Sarah will have to provide that), but I did look at the failovers and evictions today.  Overall failover still looks pretty choppy - quite a few have evictions - but the MDT failover is (tentatively) much improved.

            Over the last month or so, about 50% of the ~320 MDT failovers we did resulted in an aborted recovery.  All or almost all of those were from llog problems.  So far today we've done 9 MDT failovers, with 0 aborts.

            So...  Tentatively, much improved.  We'll have to see what the failure rate is like, and how things shake out overnight.

            pfarrell Patrick Farrell (Inactive) added a comment - Quick early readout.  I don't have application failure rates (Sarah will have to provide that), but I did look at the failovers and evictions today.  Overall failover still looks pretty choppy - quite a few have evictions - but the MDT failover is (tentatively) much improved. Over the last month or so, about 50% of the ~320 MDT failovers we did resulted in an aborted recovery.  All or almost all of those were from llog problems.  So far today we've done 9 MDT failovers, with 0 aborts. So...  Tentatively, much improved.  We'll have to see what the failure rate is like, and how things shake out overnight.

            Sounds perfect, thanks!

            pfarrell Patrick Farrell (Inactive) added a comment - Sounds perfect, thanks!
            sarah Sarah Liu added a comment -

            No, I didn't rerun #92 with 3600s inverval, instead I tried #98 first with removal of update log and it helped the pass rate, then triage team suggested move to tip of b2_10

            Let me first try the review build, if it doesn't help with the pass rate, then I could move to #92 to make sure it is the latest good one.

            sarah Sarah Liu added a comment - No, I didn't rerun #92 with 3600s inverval, instead I tried #98 first with removal of update log and it helped the pass rate, then triage team suggested move to tip of b2_10 Let me first try the review build, if it doesn't help with the pass rate, then I could move to #92 to make sure it is the latest good one.

            People

              pfarrell Patrick Farrell (Inactive)
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: