Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11943

many input/output error after soak running for couple of hours

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.10.7
    • Lustre version=2.10.6_35_g2bc39a1 lustre-b2_10-ib build #98 EL7.6 mlnx DNE; soak has PFL enabled in this test
    • 3
    • 9223372036854775807

    Description

      In the first 4/5 hours running, soak seems working well, but after that, applications started to fail a lot with sth like

      mdtest-1.9.3 was launched with 13 total task(s) on 6 node(s)
      Command line used: /mnt/soaked/bin/mdtest -d /mnt/soaked/soaktest/test/mdtestfpp/233652 -i 5 -n 158 -u
      Path: /mnt/soaked/soaktest/test/mdtestfpp
      FS: 83.4 TiB   Used FS: 1.6%   Inodes: 82.1 Mi   Used Inodes: 0.1%
      02/07/2019 18:13:12: Process 4(soak-27.spirit.whamcloud.com): FAILED in mdtest_stat, unable to stat file: Cannot send after transport endpoint shutdown
      02/07/2019 18:13:29: Process 5(soak-28.spirit.whamcloud.com): FAILED in mdtest_stat, unable to stat file: Cannot send after transport endpoint shutdown
      slurmstepd: error: *** STEP 233652.0 ON soak-18 CANCELLED AT 2019-02-07T18:12:14 ***
      srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
      srun: error: soak-28: task 5: Exited with exit code 1
      srun: Terminating job step 233652.0
      
      task 5 writing /mnt/soaked/soaktest/test/iorssf/233650/ssf
      WARNING: Task 5 requested transfer of 28199936 bytes,
               but transferred 2093056 bytes at offset 17483960320
      WARNING: This file system requires support of partial write()s, in aiori-POSIX.c (line 272).
      WARNING: Requested xfer of 28199936 bytes, but xferred 2093056 bytes
      Only transferred 2093056 of 28199936 bytes
      ** error **
      ERROR in aiori-POSIX.c (line 256): transfer failed.
      ERROR: Input/output error
      ** exiting **
      srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
      srun: error: soak-27: task 5: Exited with exit code 255
      srun: Terminating job step 233650.0
      

      Around the time when application failed, checking the OSS side, saw many lock time out error

      Feb  7 01:02:00 soak-4 systemd-logind: New session 30 of user root.
      Feb  7 01:02:00 soak-4 systemd: Started Session 30 of user root.
      Feb  7 01:02:00 soak-4 sshd[97515]: pam_unix(sshd:session): session opened for user root by (uid=0)
      Feb  7 01:02:00 soak-4 sshd[97515]: pam_unix(sshd:session): session closed for user root
      Feb  7 01:02:00 soak-4 systemd-logind: Removed session 30.
      Feb  7 01:02:00 soak-4 systemd: Removed slice User Slice of root.
      Feb  7 01:02:09 soak-4 sshd[97553]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
      Feb  7 01:02:09 soak-4 sshd[97553]: Accepted publickey for root from 10.10.1.116 port 43696 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdK
      AWxM
      Feb  7 01:02:09 soak-4 systemd: Created slice User Slice of root.
      Feb  7 01:02:09 soak-4 systemd-logind: New session 31 of user root.
      Feb  7 01:02:09 soak-4 systemd: Started Session 31 of user root.
      Feb  7 01:02:09 soak-4 sshd[97553]: pam_unix(sshd:session): session opened for user root by (uid=0)
      Feb  7 01:02:10 soak-4 sshd[97553]: Received disconnect from 10.10.1.116 port 43696:11: disconnected by user
      Feb  7 01:02:10 soak-4 sshd[97553]: Disconnected from 10.10.1.116 port 43696
      Feb  7 01:02:10 soak-4 sshd[97553]: pam_unix(sshd:session): session closed for user root
      Feb  7 01:02:10 soak-4 systemd-logind: Removed session 31.
      Feb  7 01:02:10 soak-4 systemd: Removed slice User Slice of root.
      Feb  7 01:03:03 soak-4 kernel: LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 192.168.1.128@o2ib  ns: filter-soaked-OST0004_UUID lock: ffff89923febd800/0x2a3d985a1228d475 lrc: 3/0,0 mode: PW/PW res: [0x400000402:0x4cf183:0x0].0x0 rrc: 17 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x60000000020020 nid: 192.168.1.128@o2ib remote: 0x5c14a0ad2708d24d expref: 6 pid: 27943 timeout: 4295139340 lvb_type: 0
      Feb  7 01:03:19 soak-4 kernel: Lustre: soaked-OST0004: Connection restored to ffd42784-63d2-739c-d487-9f95d4a84e57 (at 192.168.1.128@o2ib)
      Feb  7 01:03:19 soak-4 kernel: Lustre: Skipped 2 previous similar messages
      Feb  7 01:03:44 soak-4 sshd[105089]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
      Feb  7 01:03:44 soak-4 sshd[105089]: Accepted publickey for root from 10.10.1.116 port 43960 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM
      
      Feb  7 01:08:06 soak-4 kernel: LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 152s: evicting client at 192.168.1.127@o2ib  ns: filter-soaked-OST0004_UUID lock: ffff89964ba94200/0x2a3d985a1228d47c lrc: 3/0,0 mode: PW/PW res: [0x400000402:0x4cf183:0x0].0x0 rrc: 14 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x60000000020020 nid: 192.168.1.127@o2ib remote: 0xd8fc8581f4c8de49 expref: 6 pid: 74474 timeout: 4295441140 lvb_type: 0
      Feb  7 01:08:18 soak-4 kernel: Lustre: soaked-OST0004: Connection restored to 44cc5f6f-1cd3-3d43-f778-9409d7aeb164 (at 192.168.1.127@o2ib)
      Feb  7 01:08:46 soak-4 kernel: Lustre: 26413:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1549501726/real 1549501726]  req@ffff89968a49b000 x1624769706421056/t0(0) o38->soaked-MDT0001-lwp-OST0008@192.168.1.109@o2ib:12/10 lens 520/544 e 0 to 1 dl 1549501742 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
      Feb  7 01:08:46 soak-4 kernel: Lustre: 26413:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
      Feb  7 01:09:36 soak-4 kernel: Lustre: 26413:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1549501776/real 1549501776]  req@ffff89968a498f00 x1624769706421792/t0(0) o38->soaked-MDT0001-lwp-OST000c@192.168.1.109@o2ib:12/10 lens 520/544 e 0 to 1 dl 1549501797 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
      Feb  7 01:09:36 soak-4 kernel: Lustre: 26413:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
      Feb  7 01:09:43 soak-4 kernel: Lustre: soaked-OST0004: haven't heard from client acbada78-7ef5-2938-ae9f-1535dd1eb89f (at 192.168.1.121@o2ib) in 229 seconds. I think it's dead, and I am evicting it. exp ffff89923fa98c00, cur 1549501783 expire 1549501633 last 1549501554
      Feb  7 01:10:01 soak-4 systemd: Created slice User Slice of root.
      Feb  7 01:10:01 soak-4 systemd: Started Session 44 of user root.
      Feb  7 01:10:01 soak-4 CROND[187082]: (root) CMD (/usr/lib64/sa/sa1 1 1)
      

      Later on MDS side, saw following error

      Feb  7 09:15:03 soak-8 kernel: LNet: Service thread pid 13609 was inactive for 200.68s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Feb  7 09:15:03 soak-8 kernel: Pid: 13609, comm: mdt00_013 3.10.0-957.el7_lustre.x86_64 #1 SMP Mon Jan 7 20:06:41 UTC 2019
      Feb  7 09:15:03 soak-8 kernel: Call Trace:
      Feb  7 09:15:03 soak-8 kernel: [<ffffffffc0f9ba11>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
      Feb  7 09:15:03 soak-8 kernel: [<ffffffffc0f9dabb>] ldlm_cli_enqueue_fini+0x93b/0xdc0 [ptlrpc]
      Feb  7 09:15:03 soak-8 kernel: [<ffffffffc0fa0672>] ldlm_cli_enqueue+0x6c2/0x810 [ptlrpc]
      Feb  7 09:15:03 soak-8 kernel: [<ffffffffc1640622>] osp_md_object_lock+0x172/0x2e0 [osp]
      Feb  7 09:15:03 soak-8 kernel: [<ffffffffc1574ea3>] lod_object_lock+0xf3/0x950 [lod]
      Feb  7 09:15:03 soak-8 kernel: [<ffffffffc15e448e>] mdd_object_lock+0x3e/0xe0 [mdd]
      Feb  7 09:15:03 soak-8 kernel: [<ffffffffc1487fc5>] mdt_remote_object_lock+0x1e5/0x710 [mdt]
      Feb  7 09:15:03 soak-8 kernel: [<ffffffffc1489516>] mdt_object_lock_internal+0x166/0x300 [mdt]
      Feb  7 09:15:03 soak-8 kernel: [<ffffffffc14896f0>] mdt_reint_object_lock+0x20/0x60 [mdt]
      Feb  7 09:15:03 soak-8 kernel: [<ffffffffc14a05f4>] mdt_reint_link+0x7e4/0xc30 [mdt]
      Feb  7 09:15:03 soak-8 kernel: [<ffffffffc14a2c73>] mdt_reint_rec+0x83/0x210 [mdt]
      Feb  7 09:15:03 soak-8 kernel: [<ffffffffc148418b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
      Feb  7 09:15:03 soak-8 kernel: [<ffffffffc148ff77>] mdt_reint+0x67/0x140 [mdt]
      Feb  7 09:15:03 soak-8 kernel: [<ffffffffc10342aa>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      Feb  7 09:15:03 soak-8 kernel: [<ffffffffc0fdcd5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      Feb  7 09:15:03 soak-8 kernel: [<ffffffffc0fe04a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      Feb  7 09:15:03 soak-8 kernel: [<ffffffff884c1c31>] kthread+0xd1/0xe0
      Feb  7 09:15:03 soak-8 kernel: [<ffffffff88b74c37>] ret_from_fork_nospec_end+0x0/0x39
      Feb  7 09:15:03 soak-8 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
      Feb  7 09:15:03 soak-8 kernel: LustreError: dumping log to /tmp/lustre-log.1549530903.13609
      Feb  7 09:15:05 soak-8 kernel: LNet: Service thread pid 13582 was inactive for 202.07s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Feb  7 09:15:05 soak-8 kernel: Pid: 13582, comm: mdt01_007 3.10.0-957.el7_lustre.x86_64 #1 SMP Mon Jan 7 20:06:41 UTC 2019
      Feb  7 09:15:05 soak-8 kernel: Call Trace:
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0f9ba11>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0f9dabb>] ldlm_cli_enqueue_fini+0x93b/0xdc0 [ptlrpc]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0fa0672>] ldlm_cli_enqueue+0x6c2/0x810 [ptlrpc]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc1640622>] osp_md_object_lock+0x172/0x2e0 [osp]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc1574ea3>] lod_object_lock+0xf3/0x950 [lod]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc15e448e>] mdd_object_lock+0x3e/0xe0 [mdd]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc1487fc5>] mdt_remote_object_lock+0x1e5/0x710 [mdt]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc1489516>] mdt_object_lock_internal+0x166/0x300 [mdt]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc14896f0>] mdt_reint_object_lock+0x20/0x60 [mdt]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc14a05f4>] mdt_reint_link+0x7e4/0xc30 [mdt]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc14a2c73>] mdt_reint_rec+0x83/0x210 [mdt]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc148418b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc148ff77>] mdt_reint+0x67/0x140 [mdt]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc10342aa>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0fdcd5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0fe04a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffff884c1c31>] kthread+0xd1/0xe0
      Feb  7 09:15:05 soak-8 kernel: [<ffffffff88b74c37>] ret_from_fork_nospec_end+0x0/0x39
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
      Feb  7 09:15:05 soak-8 kernel: Pid: 18685, comm: mdt00_026 3.10.0-957.el7_lustre.x86_64 #1 SMP Mon Jan 7 20:06:41 UTC 2019
      Feb  7 09:15:05 soak-8 kernel: Call Trace:
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0f9ba11>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0f9dabb>] ldlm_cli_enqueue_fini+0x93b/0xdc0 [ptlrpc]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0fa0672>] ldlm_cli_enqueue+0x6c2/0x810 [ptlrpc]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc1640622>] osp_md_object_lock+0x172/0x2e0 [osp]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc1574ea3>] lod_object_lock+0xf3/0x950 [lod]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc15e448e>] mdd_object_lock+0x3e/0xe0 [mdd]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc1487fc5>] mdt_remote_object_lock+0x1e5/0x710 [mdt]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc1489516>] mdt_object_lock_internal+0x166/0x300 [mdt]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc14896f0>] mdt_reint_object_lock+0x20/0x60 [mdt]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc14a05f4>] mdt_reint_link+0x7e4/0xc30 [mdt]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc14a2c73>] mdt_reint_rec+0x83/0x210 [mdt]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc148418b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc148ff77>] mdt_reint+0x67/0x140 [mdt]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc10342aa>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0fdcd5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffc0fe04a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      Feb  7 09:15:05 soak-8 kernel: [<ffffffff884c1c31>] kthread+0xd1/0xe0
      Feb  7 09:15:05 soak-8 kernel: [<ffffffff88b74c37>] ret_from_fork_nospec_end+0x0/0x39
      Feb  7 09:15:05 soak-8 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
      Feb  7 09:15:14 soak-8 sshd[78874]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
      Feb  7 09:15:14 soak-8 sshd[78874]: Accepted publickey for root from 10.10.1.116 port 59708 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM
      Feb  7 09:16:40 soak-8 systemd: Removed slice User Slice of root.
      
      Feb  7 09:16:43 soak-8 kernel: LustreError: 13609:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1549530702, 300s ago), entering recovery for soaked-MDT0001_UUID@192.168.1.108@o2ib ns: soaked-MDT0001-osp-MDT0000 lock: ffff92322f761800/0x7ebebf899932ee34 lrc: 4/0,1 mode: --/EX res: [0x24000cf43:0x16f41:0x0].0x0 bits 0x2 rrc: 4 type: IBT flags: 0x1000001000000 nid: local remote: 0xc7baaf97e1b53a16 expref: -99 pid: 13609 timeout: 0 lvb_type: 0
      Feb  7 09:16:43 soak-8 kernel: LustreError: 13609:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 2 previous similar messages
      Feb  7 09:16:55 soak-8 sshd[79087]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
      Feb  7 09:16:56 soak-8 sshd[79087]: Accepted publickey for root from 10.10.1.116 port 59970 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM
      

      Attachments

        1. 240435-mdtestfpp.out
          14 kB
        2. console-soak-10.log-20190224.gz
          515 kB
        3. console-soak-23.log-20190224.gz
          121 kB
        4. soak-10.log-console
          1.13 MB
        5. sys-soak-10.log-20190224.gz
          1.58 MB
        6. sys-soak-23.log-20190224.gz
          1.04 MB

        Issue Links

          Activity

            [LU-11943] many input/output error after soak running for couple of hours
            sarah Sarah Liu added a comment - - edited

            After system running about 48 hours, 1 MDS crashed after it was restarted. and I think it hit LU-11781. Also attach the console log of soak-10, soak-10-console

            soak-10

            [ 6965.662564] device-mapper: multipath: Reinstating path 8:96.
            [ 6965.669190] sd 7:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, queueing MODE_SELECT command
            [ 6966.303860] sd 7:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, MODE_SELECT completed
            [ 6970.938822] LustreError: 16808:0:(osd_oi.c:734:osd_oi_insert()) dm-6: the FID [0x280043a38:0xebde:0x0] is used by two objects: 163053849/1029500233 283115771/1029500235
            [ 6970.955975] LNet: 12293:0:(o2iblnd_cb.c:396:kiblnd_handle_rx()) PUT_NACK from 192.168.1.109@o2ib
            [ 6971.241968] Lustre: soaked-MDT0003-osp-MDT0002: Connection restored to 192.168.1.111@o2ib (at 192.168.1.111@o2ib)
            [ 6972.374798] LNet: Service thread pid 17480 completed after 341.59s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            [ 7054.766137] BUG: unable to handle kernel NULL pointer dereference at           (null)
            [ 7054.774915] IP: [<          (null)>]           (null)
            [ 7054.780574] PGD 0 
            [ 7054.782850] Oops: 0010 [#1] SMP 
            [ 7054.786480] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm dm_round_robin irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul ipmi_ssif iTCO_wdt glue_helper ablk_helper iTCO_vendor_support sg cryptd pcspkr joydev mei_me ioatdma mei lpc_ich ipmi_si i2c_i801 ipmi_devintf ipmi_msghandler wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops igb mlx4_core(OE) ttm mpt3sas ahci isci crct10dif_pclmul crct10dif_common libahci crc32c_intel ptp drm devlink libsas raid_class pps_core libata dca mlx_compat(OE) drm_panel_orientation_quirks scsi_transport_sas i2c_algo_bit
            [ 7054.897558] CPU: 15 PID: 16809 Comm: mdt_out01_007 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.el7_lustre.x86_64 #1
            [ 7054.911247] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
            [ 7054.923780] task: ffff9618ed955140 ti: ffff9618dc770000 task.ti: ffff9618dc770000
            [ 7054.932139] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
            [ 7054.940496] RSP: 0018:ffff9618dc773b98  EFLAGS: 00010246
            [ 7054.946427] RAX: 0000000000000000 RBX: ffff961cdf4601e0 RCX: 0000000000000002
            [ 7054.954392] RDX: ffffffffc12bf7e0 RSI: ffff9618dc773ba0 RDI: ffff961cdf4601e8
            [ 7054.962358] RBP: ffff9618dc773bd8 R08: 0000000000000004 R09: 0000000000000000
            [ 7054.970325] R10: 0000000000000001 R11: 00000000007fffff R12: ffff9619b9a5eb00
            [ 7054.978291] R13: ffff961910518948 R14: ffff961cbd713800 R15: ffff961cdf4601e8
            [ 7054.986260] FS:  0000000000000000(0000) GS:ffff961d2d9c0000(0000) knlGS:0000000000000000
            [ 7054.995292] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [ 7055.001707] CR2: 0000000000000000 CR3: 000000003f810000 CR4: 00000000000607e0
            [ 7055.009663] Call Trace:
            [ 7055.012418]  [<ffffffffc12bde7e>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]
            [ 7055.020706]  [<ffffffffc12be057>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]
            [ 7055.028356]  [<ffffffffc0ca53c8>] dt_index_walk+0xf8/0x430 [obdclass]
            [ 7055.035596]  [<ffffffffc0ca5700>] ? dt_index_walk+0x430/0x430 [obdclass]
            [ 7055.043127]  [<ffffffffc0ca67dd>] dt_index_read+0x39d/0x6b0 [obdclass]
            [ 7055.050525]  [<ffffffffc0f2bd32>] tgt_obd_idx_read+0x612/0x860 [ptlrpc]
            [ 7055.057982]  [<ffffffffc0f2e38a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
            [ 7055.065720]  [<ffffffffc0ed6e3b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
            [ 7055.074343]  [<ffffffffc0ed3468>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
            [ 7055.081929]  [<ffffffffa18d67c2>] ? default_wake_function+0x12/0x20
            [ 7055.088943]  [<ffffffffa18cba9b>] ? __wake_up_common+0x5b/0x90
            [ 7055.095532]  [<ffffffffc0eda582>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
            [ 7055.102601]  [<ffffffffc0ed9af0>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
            [ 7055.110867]  [<ffffffffa18c1c31>] kthread+0xd1/0xe0
            [ 7055.116316]  [<ffffffffa18c1b60>] ? insert_kthread_work+0x40/0x40
            [ 7055.123126]  [<ffffffffa1f74c37>] ret_from_fork_nospec_begin+0x21/0x21
            [ 7055.130416]  [<ffffffffa18c1b60>] ? insert_kthread_work+0x40/0x40
            [ 7055.137216] Code:  Bad RIP value.
            [ 7055.140957] RIP  [<          (null)>]           (null)
            [ 7055.146714]  RSP <ffff9618dc773b98>
            [ 7055.150607] CR2: 0000000000000000
            [    0.000000] Initializing cgroup subsys cpuset
            [    0.000000] Initializing cgroup subsys cpu
            [    0.000000] Initializing cgroup subsys cpuacct
            
            sarah Sarah Liu added a comment - - edited After system running about 48 hours, 1 MDS crashed after it was restarted. and I think it hit LU-11781 . Also attach the console log of soak-10, soak-10-console soak-10 [ 6965.662564] device-mapper: multipath: Reinstating path 8:96. [ 6965.669190] sd 7:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, queueing MODE_SELECT command [ 6966.303860] sd 7:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, MODE_SELECT completed [ 6970.938822] LustreError: 16808:0:(osd_oi.c:734:osd_oi_insert()) dm-6: the FID [0x280043a38:0xebde:0x0] is used by two objects: 163053849/1029500233 283115771/1029500235 [ 6970.955975] LNet: 12293:0:(o2iblnd_cb.c:396:kiblnd_handle_rx()) PUT_NACK from 192.168.1.109@o2ib [ 6971.241968] Lustre: soaked-MDT0003-osp-MDT0002: Connection restored to 192.168.1.111@o2ib (at 192.168.1.111@o2ib) [ 6972.374798] LNet: Service thread pid 17480 completed after 341.59s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [ 7054.766137] BUG: unable to handle kernel NULL pointer dereference at (null) [ 7054.774915] IP: [< (null)>] (null) [ 7054.780574] PGD 0 [ 7054.782850] Oops: 0010 [#1] SMP [ 7054.786480] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm dm_round_robin irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul ipmi_ssif iTCO_wdt glue_helper ablk_helper iTCO_vendor_support sg cryptd pcspkr joydev mei_me ioatdma mei lpc_ich ipmi_si i2c_i801 ipmi_devintf ipmi_msghandler wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops igb mlx4_core(OE) ttm mpt3sas ahci isci crct10dif_pclmul crct10dif_common libahci crc32c_intel ptp drm devlink libsas raid_class pps_core libata dca mlx_compat(OE) drm_panel_orientation_quirks scsi_transport_sas i2c_algo_bit [ 7054.897558] CPU: 15 PID: 16809 Comm: mdt_out01_007 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.el7_lustre.x86_64 #1 [ 7054.911247] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 [ 7054.923780] task: ffff9618ed955140 ti: ffff9618dc770000 task.ti: ffff9618dc770000 [ 7054.932139] RIP: 0010:[<0000000000000000>] [< (null)>] (null) [ 7054.940496] RSP: 0018:ffff9618dc773b98 EFLAGS: 00010246 [ 7054.946427] RAX: 0000000000000000 RBX: ffff961cdf4601e0 RCX: 0000000000000002 [ 7054.954392] RDX: ffffffffc12bf7e0 RSI: ffff9618dc773ba0 RDI: ffff961cdf4601e8 [ 7054.962358] RBP: ffff9618dc773bd8 R08: 0000000000000004 R09: 0000000000000000 [ 7054.970325] R10: 0000000000000001 R11: 00000000007fffff R12: ffff9619b9a5eb00 [ 7054.978291] R13: ffff961910518948 R14: ffff961cbd713800 R15: ffff961cdf4601e8 [ 7054.986260] FS: 0000000000000000(0000) GS:ffff961d2d9c0000(0000) knlGS:0000000000000000 [ 7054.995292] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7055.001707] CR2: 0000000000000000 CR3: 000000003f810000 CR4: 00000000000607e0 [ 7055.009663] Call Trace: [ 7055.012418] [<ffffffffc12bde7e>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs] [ 7055.020706] [<ffffffffc12be057>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs] [ 7055.028356] [<ffffffffc0ca53c8>] dt_index_walk+0xf8/0x430 [obdclass] [ 7055.035596] [<ffffffffc0ca5700>] ? dt_index_walk+0x430/0x430 [obdclass] [ 7055.043127] [<ffffffffc0ca67dd>] dt_index_read+0x39d/0x6b0 [obdclass] [ 7055.050525] [<ffffffffc0f2bd32>] tgt_obd_idx_read+0x612/0x860 [ptlrpc] [ 7055.057982] [<ffffffffc0f2e38a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [ 7055.065720] [<ffffffffc0ed6e3b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [ 7055.074343] [<ffffffffc0ed3468>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [ 7055.081929] [<ffffffffa18d67c2>] ? default_wake_function+0x12/0x20 [ 7055.088943] [<ffffffffa18cba9b>] ? __wake_up_common+0x5b/0x90 [ 7055.095532] [<ffffffffc0eda582>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [ 7055.102601] [<ffffffffc0ed9af0>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc] [ 7055.110867] [<ffffffffa18c1c31>] kthread+0xd1/0xe0 [ 7055.116316] [<ffffffffa18c1b60>] ? insert_kthread_work+0x40/0x40 [ 7055.123126] [<ffffffffa1f74c37>] ret_from_fork_nospec_begin+0x21/0x21 [ 7055.130416] [<ffffffffa18c1b60>] ? insert_kthread_work+0x40/0x40 [ 7055.137216] Code: Bad RIP value. [ 7055.140957] RIP [< (null)>] (null) [ 7055.146714] RSP <ffff9618dc773b98> [ 7055.150607] CR2: 0000000000000000 [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Initializing cgroup subsys cpuacct
            sarah Sarah Liu added a comment -

            Here is an update of patch #34346 tested on soak(running about 48h) 200 fail / 1192 pass
            I did see some application failures with input/output error, here is an example failure:

            application mdtestfpp failed as

            -- started at 03/02/2019 01:10:03 --
            
            mdtest-1.9.3 was launched with 15 total task(s) on 15 node(s)
            Command line used: /mnt/soaked/bin/mdtest -d /mnt/soaked/soaktest/test/mdtestfpp/242297 -i 5 -n 467 -u
            Path: /mnt/soaked/soaktest/test/mdtestfpp
            FS: 83.4 TiB   Used FS: 7.9%   Inodes: 77.7 Mi   Used Inodes: 1.1%
            srun: error: soak-25: task 1: Exited with exit code 1
            03/02/2019 01:41:10: Process 1(soak-25.spirit.whamcloud.com): FAILED in create_remove_items_helper, unable to remove directory: Input/output error
            srun: Terminating job step 242297.0
            slurmstepd: error: *** STEP 242297.0 ON soak-24 CANCELLED AT 2019-03-02T01:41:04 ***
            srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
            

            check the syslog of client node soak-24 and it was evicted by MDS1 which is soak-9

            Mar  2 01:40:48 soak-24 kernel: Lustre: soaked-MDT0001-mdc-ffff9eb348691000: Connection to soaked-MDT0001 (at 192.168.1.108@o2ib) was lost; in progre
            ss operations using this service will wait for recovery to complete
            Mar  2 01:41:08 soak-24 kernel: LustreError: 167-0: soaked-MDT0001-mdc-ffff9eb348691000: This client was evicted by soaked-MDT0001; in progress operations using this service will fail.
            

            check soak-9, around the time

            Mar  2 01:40:03 soak-9 systemd-logind: New session 4 of user root.
            Mar  2 01:40:03 soak-9 systemd: Started Session 4 of user root.
            Mar  2 01:40:03 soak-9 sshd[12203]: pam_unix(sshd:session): session opened for user root by (uid=0)
            Mar  2 01:40:03 soak-9 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 32, npartitions: 2
            Mar  2 01:40:03 soak-9 kernel: alg: No test for adler32 (adler32-zlib)
            Mar  2 01:40:04 soak-9 kernel: Lustre: Lustre: Build Version: 2.10.6_64_g56b299e
            Mar  2 01:40:04 soak-9 kernel: LNet: Added LNI 192.168.1.109@o2ib [8/256/0/180]
            Mar  2 01:40:05 soak-9 kernel: LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc
            Mar  2 01:40:06 soak-9 kernel: Lustre: soaked-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
            Mar  2 01:40:06 soak-9 kernel: LustreError: 12482:0:(llog_osd.c:978:llog_osd_next_block()) soaked-MDT0002-osp-MDT0001: missed desired record? 17 > 1
            Mar  2 01:40:06 soak-9 kernel: LustreError: 12482:0:(lod_dev.c:419:lod_sub_recovery_thread()) soaked-MDT0002-osp-MDT0001 getting update log failed: rc = -2
            Mar  2 01:40:06 soak-9 sshd[12203]: pam_unix(sshd:session): session closed for user root
            Mar  2 01:40:06 soak-9 systemd-logind: Removed session 4.
            Mar  2 01:40:06 soak-9 systemd: Removed slice User Slice of root.
            Mar  2 01:40:12 soak-9 kernel: Lustre: soaked-MDT0001: Connection restored to 8ca4d6ac-9a33-9228-6f4a-bf7741f4f05f (at 192.168.1.117@o2ib)
            Mar  2 01:40:12 soak-9 kernel: Lustre: 12484:0:(ldlm_lib.c:2061:target_recovery_overseer()) recovery is aborted, evict exports in recovery
            Mar  2 01:40:12 soak-9 kernel: Lustre: soaked-MDT0001: disconnecting 26 stale clients
            
            sarah Sarah Liu added a comment - Here is an update of patch #34346 tested on soak(running about 48h) 200 fail / 1192 pass I did see some application failures with input/output error, here is an example failure: application mdtestfpp failed as -- started at 03/02/2019 01:10:03 -- mdtest-1.9.3 was launched with 15 total task(s) on 15 node(s) Command line used: /mnt/soaked/bin/mdtest -d /mnt/soaked/soaktest/test/mdtestfpp/242297 -i 5 -n 467 -u Path: /mnt/soaked/soaktest/test/mdtestfpp FS: 83.4 TiB Used FS: 7.9% Inodes: 77.7 Mi Used Inodes: 1.1% srun: error: soak-25: task 1: Exited with exit code 1 03/02/2019 01:41:10: Process 1(soak-25.spirit.whamcloud.com): FAILED in create_remove_items_helper, unable to remove directory: Input/output error srun: Terminating job step 242297.0 slurmstepd: error: *** STEP 242297.0 ON soak-24 CANCELLED AT 2019-03-02T01:41:04 *** srun: Job step aborted: Waiting up to 32 seconds for job step to finish. check the syslog of client node soak-24 and it was evicted by MDS1 which is soak-9 Mar 2 01:40:48 soak-24 kernel: Lustre: soaked-MDT0001-mdc-ffff9eb348691000: Connection to soaked-MDT0001 (at 192.168.1.108@o2ib) was lost; in progre ss operations using this service will wait for recovery to complete Mar 2 01:41:08 soak-24 kernel: LustreError: 167-0: soaked-MDT0001-mdc-ffff9eb348691000: This client was evicted by soaked-MDT0001; in progress operations using this service will fail. check soak-9, around the time Mar 2 01:40:03 soak-9 systemd-logind: New session 4 of user root. Mar 2 01:40:03 soak-9 systemd: Started Session 4 of user root. Mar 2 01:40:03 soak-9 sshd[12203]: pam_unix(sshd:session): session opened for user root by (uid=0) Mar 2 01:40:03 soak-9 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 32, npartitions: 2 Mar 2 01:40:03 soak-9 kernel: alg: No test for adler32 (adler32-zlib) Mar 2 01:40:04 soak-9 kernel: Lustre: Lustre: Build Version: 2.10.6_64_g56b299e Mar 2 01:40:04 soak-9 kernel: LNet: Added LNI 192.168.1.109@o2ib [8/256/0/180] Mar 2 01:40:05 soak-9 kernel: LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc Mar 2 01:40:06 soak-9 kernel: Lustre: soaked-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 Mar 2 01:40:06 soak-9 kernel: LustreError: 12482:0:(llog_osd.c:978:llog_osd_next_block()) soaked-MDT0002-osp-MDT0001: missed desired record? 17 > 1 Mar 2 01:40:06 soak-9 kernel: LustreError: 12482:0:(lod_dev.c:419:lod_sub_recovery_thread()) soaked-MDT0002-osp-MDT0001 getting update log failed: rc = -2 Mar 2 01:40:06 soak-9 sshd[12203]: pam_unix(sshd:session): session closed for user root Mar 2 01:40:06 soak-9 systemd-logind: Removed session 4. Mar 2 01:40:06 soak-9 systemd: Removed slice User Slice of root. Mar 2 01:40:12 soak-9 kernel: Lustre: soaked-MDT0001: Connection restored to 8ca4d6ac-9a33-9228-6f4a-bf7741f4f05f (at 192.168.1.117@o2ib) Mar 2 01:40:12 soak-9 kernel: Lustre: 12484:0:(ldlm_lib.c:2061:target_recovery_overseer()) recovery is aborted, evict exports in recovery Mar 2 01:40:12 soak-9 kernel: Lustre: soaked-MDT0001: disconnecting 26 stale clients

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34346/
            Subject: LU-11943 llog: Reset current log on ENOSPC
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set:
            Commit: 0cba3cc14fa68b8f43833abfb6f7bc3e547bce03

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34346/ Subject: LU-11943 llog: Reset current log on ENOSPC Project: fs/lustre-release Branch: b2_10 Current Patch Set: Commit: 0cba3cc14fa68b8f43833abfb6f7bc3e547bce03

            Sarah,

            Just reviewed with Minh.

            He made a mistake yesterday.  4241 is an unrelated build.
            The one we need is:
            https://build.whamcloud.com/job/lustre-reviews-ib/4248/

            Sorry.  Are you able to install and test that on Soak?

            pfarrell Patrick Farrell (Inactive) added a comment - Sarah, Just reviewed with Minh. He made a mistake yesterday.  4241 is an unrelated build. The one we need is: https://build.whamcloud.com/job/lustre-reviews-ib/4248/ Sorry.  Are you able to install and test that on Soak?
            sarah Sarah Liu added a comment - - edited

            the lustre version is

            ----------------
            soak-[4-11,17-31,35-43]
            ----------------
            version=2.10.6_64_g155f4e8
            

            lustre-reviews-ib build #4241 shows the same version, unfortunately the 4241 build was gone from jenkins.

            We don't assert server, failover by hard reboot. If you want to check the soak code, it is under

            [root@soak-16 fsmgmt]# pwd
            /scratch/soak-merge/soak-git/soak-test-harness/harness/fsmgmt
            [root@soak-16 fsmgmt]# less builtin.py
            
            sarah Sarah Liu added a comment - - edited the lustre version is ---------------- soak-[4-11,17-31,35-43] ---------------- version=2.10.6_64_g155f4e8 lustre-reviews-ib build #4241 shows the same version, unfortunately the 4241 build was gone from jenkins. We don't assert server, failover by hard reboot. If you want to check the soak code, it is under [root@soak-16 fsmgmt]# pwd /scratch/soak-merge/soak-git/soak-test-harness/harness/fsmgmt [root@soak-16 fsmgmt]# less builtin.py

            Would we know if the MDS were asserting during shutdown/failover?  How do we cause the MDS failover - What do we do to it exactly?

            pfarrell Patrick Farrell (Inactive) added a comment - Would we know if the MDS were asserting during shutdown/failover?  How do we cause the MDS failover - What do we do to it exactly?

            Woah.  That's very confusing.  Unless we made a mistake with the build, that should be impossible.  But I do see the errors.  That's extremely confusing.

            I'll see what I can figure out.

            pfarrell Patrick Farrell (Inactive) added a comment - Woah.  That's very confusing.  Unless we made a mistake with the build, that should be impossible.  But I do see the errors.  That's extremely confusing. I'll see what I can figure out.
            sarah Sarah Liu added a comment -

            update of lustre-reviews-ib build #4241 (https://review.whamcloud.com/34346)

            After 17 hours running with only mds_failover and mds_restart, the fail/pass rate is: 97 fail / 453 pass. With this modified patch, the pass rate were still worse than the previous revert patch.

            Check one of the failed application mdtestfpp.out

            lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none
            lmv_stripe_count: 0 lmv_stripe_offset: -1 lmv_hash_type: none
            srun: Warning: can't honor --ntasks-per-node set to 2 which doesn't match the requested tasks 12 with the number of requested nodes 8. Ignoring --ntasks-per-node.
            -- started at 03/01/2019 12:45:51 --
            
            mdtest-1.9.3 was launched with 12 total task(s) on 6 node(s)
            Command line used: /mnt/soaked/bin/mdtest -d /mnt/soaked/soaktest/test/mdtestfpp/241953 -i 5 -n 105 -u
            03/01/2019 12:49:18: Process 0(): FAILED in show_file_system_size, unable to statfs() file system: Input/output error
            --------------------------------------------------------------------------
            MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
            with errorcode 1.
            
            NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
            You may or may not see output from other processes, depending on
            exactly when Open MPI kills them.
            --------------------------------------------------------------------------
            srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
            slurmstepd: error: *** STEP 241953.0 ON soak-28 CANCELLED AT 2019-03-01T12:49:18 ***
            

            then check the client soak-28 syslog around 3/1 12:49, it was evicted by MDS #2(soak-10)

            Mar  1 12:46:46 soak-28 kernel: Lustre: 10153:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
            Mar  1 12:47:05 soak-28 kernel: LNet: 10142:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.1.110@o2ib: 6 seconds
            Mar  1 12:47:56 soak-28 kernel: LNet: 10142:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.1.110@o2ib: 0 seconds
            Mar  1 12:49:16 soak-28 kernel: LustreError: 65561:0:(lmv_obd.c:1387:lmv_statfs()) can't stat MDS #2 (soaked-MDT0002-mdc-ffff98d1ac0b5000), error -19
            Mar  1 12:49:16 soak-28 kernel: LustreError: 65561:0:(lmv_obd.c:1387:lmv_statfs()) Skipped 1 previous similar message
            Mar  1 12:49:16 soak-28 kernel: LustreError: 65561:0:(llite_lib.c:1759:ll_statfs_internal()) md_statfs fails: rc = -19
            Mar  1 12:49:17 soak-28 kernel: LustreError: 65561:0:(llite_lib.c:1759:ll_statfs_internal()) Skipped 3 previous similar messages
            Mar  1 12:49:18 soak-28 kernel: LustreError: 167-0: soaked-MDT0002-mdc-ffff98d1ac0b5000: This client was evicted by soaked-MDT0002; in progress operations using this service will fail.
            Mar  1 12:49:18 soak-28 kernel: LustreError: 65550:0:(lmv_obd.c:1387:lmv_statfs()) can't stat MDS #2 (soaked-MDT0002-mdc-ffff98d1ac0b5000), error -5
            Mar  1 12:49:18 soak-28 kernel: Lustre: soaked-MDT0002-mdc-ffff98d1ac0b5000: Connection restored to 192.168.1.110@o2ib (at 192.168.1.110@o2ib)
            Mar  1 12:49:18 soak-28 kernel: LustreError: 65550:0:(llite_lib.c:1759:ll_statfs_internal()) md_statfs fails: rc = -5
            Mar  1 12:49:20 soak-28 sshd[65650]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
            Mar  1 12:49:20 soak-28 sshd[65650]: Accepted publickey for root from 10.10.1.116 port 36250 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM
            Mar  1 12:49:20 soak-28 systemd: Created slice User Slice of root.
            Mar  1 12:49:20 soak-28 systemd-logind: New session 1419 of user root.
            Mar  1 12:49:20 soak-28 systemd: Started Session 1419 of user root.
            Mar  1 12:49:20 soak-28 sshd[65650]: pam_unix(sshd:session): session opened for user root by (uid=0)
            Mar  1 12:49:21 soak-28 sshd[65650]: pam_unix(sshd:session): session closed for user root
            Mar  1 12:49:21 soak-28 systemd-logind: Removed session 1419.
            Mar  1 12:49:21 soak-28 systemd: Removed slice User Slice of root.
            Mar  1 12:49:30 soak-28 sshd[65684]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
            Mar  1 12:49:30 soak-28 sshd[65684]: Accepted publickey for root from 10.10.1.116 port 36388 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM
            Mar  1 12:49:30 soak-28 systemd: Created slice User Slice of root.
            Mar  1 12:49:30 soak-28 systemd-logind: New session 1420 of user root.
            Mar  1 12:49:30 soak-28 systemd: Started Session 1420 of user root.
            Mar  1 12:49:30 soak-28 sshd[65684]: pam_unix(sshd:session): session opened for user root by (uid=0)
            Mar  1 12:49:30 soak-28 sshd[65684]: Received disconnect from 10.10.1.116 port 36388:11: disconnected by user
            Mar  1 12:49:30 soak-28 sshd[65684]: Disconnected from 10.10.1.116 port 36388
            Mar  1 12:49:30 soak-28 sshd[65684]: pam_unix(sshd:session): session closed for user root
            Mar  1 12:49:30 soak-28 systemd-logind: Removed session 1420.
            Mar  1 12:49:30 soak-28 systemd: Removed slice User Slice of root.
            Mar  1 12:49:56 soak-28 kernel: LustreError: 65742:0:(llite_lib.c:2345:ll_prep_inode()) new_inode -fatal: rc -2
            Mar  1 12:49:57 soak-28 kernel: LustreError: 65772:0:(llite_lib.c:2345:ll_prep_inode()) new_inode -fatal: rc -2
            Mar  1 12:49:57 soak-28 kernel: LustreError: 65772:0:(llite_lib.c:2345:ll_prep_inode()) Skipped 10 previous similar messages
            

            Check soak-10 around the same time

            Mar  1 12:47:37 soak-10 multipathd: 360080e50001fedb80000015752012949: remaining active paths: 1
            Mar  1 12:47:37 soak-10 kernel: sd 0:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, queueing MODE_SELECT command
            Mar  1 12:47:37 soak-10 kernel: Lustre: soaked-MDT0002: Not available for connect from 192.168.1.123@o2ib (not set up)
            Mar  1 12:47:38 soak-10 kernel: sd 0:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, MODE_SELECT completed
            Mar  1 12:47:38 soak-10 kernel: Lustre: soaked-MDT0002: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
            Mar  1 12:47:38 soak-10 kernel: LustreError: 13072:0:(llog.c:588:llog_process_thread()) soaked-MDT0003-osp-MDT0002: [0x2:0xc0030980:0x2] Invalid record: index 6 but expe
            cted 2
            Mar  1 12:47:38 soak-10 kernel: LustreError: 13072:0:(lod_dev.c:419:lod_sub_recovery_thread()) soaked-MDT0003-osp-MDT0002 getting update log failed: rc = -34
            Mar  1 12:47:39 soak-10 sshd[12270]: pam_unix(sshd:session): session closed for user root
            Mar  1 12:47:39 soak-10 systemd-logind: Removed session 3.
            Mar  1 12:47:39 soak-10 systemd: Removed slice User Slice of root.
            Mar  1 12:47:39 soak-10 kernel: Lustre: soaked-MDT0002: Connection restored to b61568ce-eadb-273d-d151-58976b27b2a3 (at 192.168.1.122@o2ib)
            Mar  1 12:47:39 soak-10 kernel: Lustre: 13073:0:(ldlm_lib.c:2061:target_recovery_overseer()) recovery is aborted, evict exports in recovery
            Mar  1 12:47:39 soak-10 kernel: Lustre: soaked-MDT0002: disconnecting 16 stale clients
            Mar  1 12:47:41 soak-10 kernel: Lustre: soaked-MDT0002: Connection restored to 4ae090bf-6bf9-e882-df72-d1d71f1e04c8 (at 192.168.1.120@o2ib)
            Mar  1 12:47:42 soak-10 multipathd: 360080e50001fedb80000015752012949: sdg - rdac checker reports path is up
            Mar  1 12:47:42 soak-10 multipathd: 8:96: reinstated
            Mar  1 12:47:42 soak-10 multipathd: 360080e50001fedb80000015752012949: load table [0 31247843328 multipath 2 queue_if_no_path retain_attached_hw_handler 1 rdac 1 1 round
            -robin 0 1 1 8:96 1]
            Mar  1 12:47:44 soak-10 kernel: Lustre: soaked-MDT0002: Connection restored to d4732544-d66c-f143-5fc3-3aff0109b21f (at 192.168.1.135@o2ib)
            Mar  1 12:47:46 soak-10 kernel: Lustre: soaked-MDT0002: Connection restored to  (at 192.168.1.124@o2ib)
            Mar  1 12:47:46 soak-10 kernel: Lustre: Skipped 13 previous similar messages
            Mar  1 12:47:47 soak-10 multipathd: 360080e50001fedb80000015752012949: sdg - rdac checker reports path is ghost
            Mar  1 12:47:47 soak-10 multipathd: 8:96: reinstated
            Mar  1 12:47:48 soak-10 kernel: sd 0:0:0:44: [sdg] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
            Mar  1 12:47:48 soak-10 kernel: sd 0:0:0:44: [sdg] Sense Key : Illegal Request [current] 
            Mar  1 12:47:48 soak-10 kernel: sd 0:0:0:44: [sdg] <<vendor>>ASC=0x94 ASCQ=0x1 
            Mar  1 12:47:48 soak-10 kernel: sd 0:0:0:44: [sdg] CDB: Write(16) 8a 00 00 00 00 00 00 b1 23 78 00 00 00 08 00 00
            Mar  1 12:47:48 soak-10 kernel: blk_update_request: I/O error, dev sdg, sector 11608952
            Mar  1 12:47:48 soak-10 kernel: device-mapper: multipath: Failing path 8:96.
            Mar  1 12:47:48 soak-10 multipathd: sdg: mark as failed
            
            sarah Sarah Liu added a comment - update of lustre-reviews-ib build #4241 ( https://review.whamcloud.com/34346 ) After 17 hours running with only mds_failover and mds_restart, the fail/pass rate is: 97 fail / 453 pass. With this modified patch, the pass rate were still worse than the previous revert patch. Check one of the failed application mdtestfpp.out lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none lmv_stripe_count: 0 lmv_stripe_offset: -1 lmv_hash_type: none srun: Warning: can't honor --ntasks-per-node set to 2 which doesn't match the requested tasks 12 with the number of requested nodes 8. Ignoring --ntasks-per-node. -- started at 03/01/2019 12:45:51 -- mdtest-1.9.3 was launched with 12 total task(s) on 6 node(s) Command line used: /mnt/soaked/bin/mdtest -d /mnt/soaked/soaktest/test/mdtestfpp/241953 -i 5 -n 105 -u 03/01/2019 12:49:18: Process 0(): FAILED in show_file_system_size, unable to statfs() file system: Input/output error -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 1. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: *** STEP 241953.0 ON soak-28 CANCELLED AT 2019-03-01T12:49:18 *** then check the client soak-28 syslog around 3/1 12:49, it was evicted by MDS #2(soak-10) Mar 1 12:46:46 soak-28 kernel: Lustre: 10153:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Mar 1 12:47:05 soak-28 kernel: LNet: 10142:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.1.110@o2ib: 6 seconds Mar 1 12:47:56 soak-28 kernel: LNet: 10142:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.1.110@o2ib: 0 seconds Mar 1 12:49:16 soak-28 kernel: LustreError: 65561:0:(lmv_obd.c:1387:lmv_statfs()) can't stat MDS #2 (soaked-MDT0002-mdc-ffff98d1ac0b5000), error -19 Mar 1 12:49:16 soak-28 kernel: LustreError: 65561:0:(lmv_obd.c:1387:lmv_statfs()) Skipped 1 previous similar message Mar 1 12:49:16 soak-28 kernel: LustreError: 65561:0:(llite_lib.c:1759:ll_statfs_internal()) md_statfs fails: rc = -19 Mar 1 12:49:17 soak-28 kernel: LustreError: 65561:0:(llite_lib.c:1759:ll_statfs_internal()) Skipped 3 previous similar messages Mar 1 12:49:18 soak-28 kernel: LustreError: 167-0: soaked-MDT0002-mdc-ffff98d1ac0b5000: This client was evicted by soaked-MDT0002; in progress operations using this service will fail. Mar 1 12:49:18 soak-28 kernel: LustreError: 65550:0:(lmv_obd.c:1387:lmv_statfs()) can't stat MDS #2 (soaked-MDT0002-mdc-ffff98d1ac0b5000), error -5 Mar 1 12:49:18 soak-28 kernel: Lustre: soaked-MDT0002-mdc-ffff98d1ac0b5000: Connection restored to 192.168.1.110@o2ib (at 192.168.1.110@o2ib) Mar 1 12:49:18 soak-28 kernel: LustreError: 65550:0:(llite_lib.c:1759:ll_statfs_internal()) md_statfs fails: rc = -5 Mar 1 12:49:20 soak-28 sshd[65650]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key Mar 1 12:49:20 soak-28 sshd[65650]: Accepted publickey for root from 10.10.1.116 port 36250 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM Mar 1 12:49:20 soak-28 systemd: Created slice User Slice of root. Mar 1 12:49:20 soak-28 systemd-logind: New session 1419 of user root. Mar 1 12:49:20 soak-28 systemd: Started Session 1419 of user root. Mar 1 12:49:20 soak-28 sshd[65650]: pam_unix(sshd:session): session opened for user root by (uid=0) Mar 1 12:49:21 soak-28 sshd[65650]: pam_unix(sshd:session): session closed for user root Mar 1 12:49:21 soak-28 systemd-logind: Removed session 1419. Mar 1 12:49:21 soak-28 systemd: Removed slice User Slice of root. Mar 1 12:49:30 soak-28 sshd[65684]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key Mar 1 12:49:30 soak-28 sshd[65684]: Accepted publickey for root from 10.10.1.116 port 36388 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM Mar 1 12:49:30 soak-28 systemd: Created slice User Slice of root. Mar 1 12:49:30 soak-28 systemd-logind: New session 1420 of user root. Mar 1 12:49:30 soak-28 systemd: Started Session 1420 of user root. Mar 1 12:49:30 soak-28 sshd[65684]: pam_unix(sshd:session): session opened for user root by (uid=0) Mar 1 12:49:30 soak-28 sshd[65684]: Received disconnect from 10.10.1.116 port 36388:11: disconnected by user Mar 1 12:49:30 soak-28 sshd[65684]: Disconnected from 10.10.1.116 port 36388 Mar 1 12:49:30 soak-28 sshd[65684]: pam_unix(sshd:session): session closed for user root Mar 1 12:49:30 soak-28 systemd-logind: Removed session 1420. Mar 1 12:49:30 soak-28 systemd: Removed slice User Slice of root. Mar 1 12:49:56 soak-28 kernel: LustreError: 65742:0:(llite_lib.c:2345:ll_prep_inode()) new_inode -fatal: rc -2 Mar 1 12:49:57 soak-28 kernel: LustreError: 65772:0:(llite_lib.c:2345:ll_prep_inode()) new_inode -fatal: rc -2 Mar 1 12:49:57 soak-28 kernel: LustreError: 65772:0:(llite_lib.c:2345:ll_prep_inode()) Skipped 10 previous similar messages Check soak-10 around the same time Mar 1 12:47:37 soak-10 multipathd: 360080e50001fedb80000015752012949: remaining active paths: 1 Mar 1 12:47:37 soak-10 kernel: sd 0:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, queueing MODE_SELECT command Mar 1 12:47:37 soak-10 kernel: Lustre: soaked-MDT0002: Not available for connect from 192.168.1.123@o2ib (not set up) Mar 1 12:47:38 soak-10 kernel: sd 0:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, MODE_SELECT completed Mar 1 12:47:38 soak-10 kernel: Lustre: soaked-MDT0002: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 Mar 1 12:47:38 soak-10 kernel: LustreError: 13072:0:(llog.c:588:llog_process_thread()) soaked-MDT0003-osp-MDT0002: [0x2:0xc0030980:0x2] Invalid record: index 6 but expe cted 2 Mar 1 12:47:38 soak-10 kernel: LustreError: 13072:0:(lod_dev.c:419:lod_sub_recovery_thread()) soaked-MDT0003-osp-MDT0002 getting update log failed: rc = -34 Mar 1 12:47:39 soak-10 sshd[12270]: pam_unix(sshd:session): session closed for user root Mar 1 12:47:39 soak-10 systemd-logind: Removed session 3. Mar 1 12:47:39 soak-10 systemd: Removed slice User Slice of root. Mar 1 12:47:39 soak-10 kernel: Lustre: soaked-MDT0002: Connection restored to b61568ce-eadb-273d-d151-58976b27b2a3 (at 192.168.1.122@o2ib) Mar 1 12:47:39 soak-10 kernel: Lustre: 13073:0:(ldlm_lib.c:2061:target_recovery_overseer()) recovery is aborted, evict exports in recovery Mar 1 12:47:39 soak-10 kernel: Lustre: soaked-MDT0002: disconnecting 16 stale clients Mar 1 12:47:41 soak-10 kernel: Lustre: soaked-MDT0002: Connection restored to 4ae090bf-6bf9-e882-df72-d1d71f1e04c8 (at 192.168.1.120@o2ib) Mar 1 12:47:42 soak-10 multipathd: 360080e50001fedb80000015752012949: sdg - rdac checker reports path is up Mar 1 12:47:42 soak-10 multipathd: 8:96: reinstated Mar 1 12:47:42 soak-10 multipathd: 360080e50001fedb80000015752012949: load table [0 31247843328 multipath 2 queue_if_no_path retain_attached_hw_handler 1 rdac 1 1 round -robin 0 1 1 8:96 1] Mar 1 12:47:44 soak-10 kernel: Lustre: soaked-MDT0002: Connection restored to d4732544-d66c-f143-5fc3-3aff0109b21f (at 192.168.1.135@o2ib) Mar 1 12:47:46 soak-10 kernel: Lustre: soaked-MDT0002: Connection restored to (at 192.168.1.124@o2ib) Mar 1 12:47:46 soak-10 kernel: Lustre: Skipped 13 previous similar messages Mar 1 12:47:47 soak-10 multipathd: 360080e50001fedb80000015752012949: sdg - rdac checker reports path is ghost Mar 1 12:47:47 soak-10 multipathd: 8:96: reinstated Mar 1 12:47:48 soak-10 kernel: sd 0:0:0:44: [sdg] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Mar 1 12:47:48 soak-10 kernel: sd 0:0:0:44: [sdg] Sense Key : Illegal Request [current] Mar 1 12:47:48 soak-10 kernel: sd 0:0:0:44: [sdg] <<vendor>>ASC=0x94 ASCQ=0x1 Mar 1 12:47:48 soak-10 kernel: sd 0:0:0:44: [sdg] CDB: Write(16) 8a 00 00 00 00 00 00 b1 23 78 00 00 00 08 00 00 Mar 1 12:47:48 soak-10 kernel: blk_update_request: I/O error, dev sdg, sector 11608952 Mar 1 12:47:48 soak-10 kernel: device-mapper: multipath: Failing path 8:96. Mar 1 12:47:48 soak-10 multipathd: sdg: mark as failed

            Changed fixes to associate with this ticket.

            Master:
            https://review.whamcloud.com/#/c/34347/

            b2_10:

            https://review.whamcloud.com/#/c/34346/

             

            pfarrell Patrick Farrell (Inactive) added a comment - Changed fixes to associate with this ticket. Master: https://review.whamcloud.com/#/c/34347/ b2_10: https://review.whamcloud.com/#/c/34346/  

            Suggested fix is here:
            https://review.whamcloud.com/34346

            I'll push it for master shortly.

            pfarrell Patrick Farrell (Inactive) added a comment - Suggested fix is here: https://review.whamcloud.com/34346 I'll push it for master shortly.

            We are now up to 54 failovers without an abort.  I'll update the other bug - I'm not sure how, but that patch is causing a lot of trouble.

            pfarrell Patrick Farrell (Inactive) added a comment - We are now up to 54 failovers without an abort.  I'll update the other bug - I'm not sure how, but that patch is causing a lot of trouble.

            People

              pfarrell Patrick Farrell (Inactive)
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: