[LU-9340] PFL fails performance testsSpirit Created: 13/Apr/17 Updated: 20/May/17 Resolved: 20/May/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0 |
| Fix Version/s: | Lustre 2.10.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Cliff White (Inactive) | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | pfl | ||
| Environment: |
Spirit performance cluster |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
Attempting to run P02 and P03 performance tests, with striping set as: Immediate MPI failures with IOR Commencing write performance test: Thu Apr 13 21:04:16 2017 024: ior ERROR: write() failed, errno 61, No data available (aiori-POSIX.c:335) 024: -------------------------------------------------------------------------- 024: MPI_ABORT was invoked on rank 24 in communicator MPI_COMM_WORLD -- .......... 231: ior ERROR: write() failed, errno 61, No data available (aiori-POSIX.c:335) 088: In: PMI_Abort(-1, N/A) 287: ior ERROR: write() failed, errno 61, No data available (aiori-POSIX.c:335) 134: In: PMI_Abort(-1, N/A) 057: -------------------------------------------------------------------------- 057: MPI_ABORT was invoked on rank 57 in communicator MPI_COMM_WORLD -- 057: 057: NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. 057: You may or may not see output from other processes, depending on 057: exactly when Open MPI kills them. 057: ------------------------------------------- Lustre Errors on all nodes attached. |
| Comments |
| Comment by Jinshan Xiong (Inactive) [ 14/Apr/17 ] |
|
From the command above, it seems you only wanted to put the first component in '$ior_ostPool'. Is this exactly what you want, or you actually want to all objects in the pool? |
| Comment by Cliff White (Inactive) [ 17/Apr/17 ] |
|
Need to have all objects in the pool of course. Please correct the syntax as needed. |
| Comment by Jinshan Xiong (Inactive) [ 17/Apr/17 ] |
|
Just write this as a record: --pool option has to be added for each -E to allocate objects from corresponding OST pool. |
| Comment by Jinshan Xiong (Inactive) [ 17/Apr/17 ] |
|
The kernel crashed when it was running performance test: Apr 17 17:04:34 spirit-28 systemd-logind: Removed session c259. Apr 17 17:04:45 spirit-28 kernel: LustreError: 3782:0:(osc_object.c:393:osc_req_attr_set()) page@ffff880ba8ebce00[2 ffff8807e85adc90 3 1 (null)] Apr 17 17:04:45 spirit-28 kernel: LustreError: 3783:0:(osc_object.c:393:osc_req_attr_set()) page@ffff880b506dde00[2 ffff8807e85adc90 3 1 (null)] Apr 17 17:04:45 spirit-28 kernel: LustreError: 3783:0:(osc_object.c:393:osc_req_attr_set()) vvp-page@ffff880b506dde50(1:0) vm@ffffea00367ef600 6fffff00000821 2:0 ffff880b506dde00 16640 lru Apr 17 17:04:45 spirit-28 kernel: LustreError: 3783:0:(osc_object.c:393:osc_req_attr_set()) lov-page@ffff880b506dde90, raid0 Apr 17 17:04:45 spirit-28 kernel: LustreError: 3783:0:(osc_object.c:393:osc_req_attr_set()) osc-page@ffff880b506ddef8 4096: 1< 0x845fed 257 0 + + > 2< 16777216 0 4096 0x7 0x108 | (null) ffff88100cad6170 ffff88101a113310 > 3< 1 6 0 > 4< 0 0 8 33579008 - | - - - + > 5< - - - + | 0 - | 0 - -> Apr 17 17:04:45 spirit-28 kernel: LustreError: 3783:0:(osc_object.c:393:osc_req_attr_set()) end page@ffff880b506dde00 Apr 17 17:04:45 spirit-28 kernel: LustreError: 3783:0:(osc_object.c:393:osc_req_attr_set()) uncovered page! Apr 17 17:04:45 spirit-28 kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000016 Apr 17 17:04:45 spirit-28 kernel: IP: [<ffffffffa0b66836>] ldlm_resource_dump+0x86/0x530 [ptlrpc] Apr 17 17:04:45 spirit-28 kernel: PGD 0 Apr 17 17:04:45 spirit-28 kernel: Oops: 0000 [#1] SMP Apr 17 17:04:45 spirit-28 kernel: Modules linked in: osc(OE) mgc(OE) lustre(OE) lmv(OE) fld(OE) mdc(OE) fid(OE) lov(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core intel_powerclamp coretemp kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel ipmi_devintf aesni_intel ipmi_ssif lrw gf128mul glue_helper ablk_helper cryptd mei_me ipmi_si mei ipmi_msghandler sg iTCO_wdt sb_edac iTCO_vendor_support edac_core ioatdma lpc_ich shpchp wmi pcspkr i2c_i801 nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 mlx4_en sd_mod crc_t10dif crct10dif_generic Apr 17 17:04:45 spirit-28 kernel: mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops isci igb ttm libsas crct10dif_pclmul crct10dif_common ptp ahci crc32c_intel scsi_transport_sas libahci pps_core drm mlx4_core libata dca i2c_algo_bit devlink i2c_core fjes Apr 17 17:04:45 spirit-28 kernel: CPU: 23 PID: 3783 Comm: ptlrpcd_01_05 Tainted: G OE ------------ 3.10.0-514.10.2.el7.x86_64 #1 Apr 17 17:04:45 spirit-28 kernel: Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.01.0002.082220131453 08/22/2013 Apr 17 17:04:45 spirit-28 kernel: task: ffff880805458fb0 ti: ffff880805464000 task.ti: ffff880805464000 Apr 17 17:04:45 spirit-28 kernel: RIP: 0010:[<ffffffffa0b66836>] [<ffffffffa0b66836>] ldlm_resource_dump+0x86/0x530 [ptlrpc] Apr 17 17:04:45 spirit-28 kernel: RSP: 0018:ffff880805467990 EFLAGS: 00010206 Apr 17 17:04:45 spirit-28 kernel: RAX: 00000000ffffffff RBX: 0000000000020000 RCX: 0000000000020000 Apr 17 17:04:45 spirit-28 kernel: RDX: 00000000ffffffff RSI: ffffffffa0bf7170 RDI: ffffffffa0c3a0c0 Apr 17 17:04:45 spirit-28 kernel: RBP: ffff8808054679d8 R08: 0000000000000003 R09: ffff880805467e58 Apr 17 17:04:45 spirit-28 kernel: R10: 00000000000d8948 R11: 0000000000100000 R12: ffff880b506ddef8 Apr 17 17:04:45 spirit-28 kernel: R13: ffff880805633400 R14: fffffffffffffffe R15: ffff88101a113310 Apr 17 17:04:45 spirit-28 kernel: FS: 0000000000000000(0000) GS:ffff88101dac0000(0000) knlGS:0000000000000000 Apr 17 17:04:45 spirit-28 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Apr 17 17:04:45 spirit-28 kernel: CR2: 0000000000000016 CR3: 00000000019ba000 CR4: 00000000001407e0 Apr 17 17:04:45 spirit-28 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Apr 17 17:04:45 spirit-28 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Apr 17 17:04:45 spirit-28 kernel: Stack: Apr 17 17:04:45 spirit-28 kernel: ffff8808054679e8 ffff88100cbae000 000200000000004c fffffffffffffffe Apr 17 17:04:45 spirit-28 kernel: ffff88100b2d3c70 ffff880b506ddef8 ffff880805633400 ffff88100c713430 Apr 17 17:04:45 spirit-28 kernel: ffff88101a113310 ffff880805467a18 ffffffffa0e55da1 ffff880805467e58 Apr 17 17:04:45 spirit-28 kernel: Call Trace: Apr 17 17:04:45 spirit-28 kernel: [<ffffffffa0e55da1>] osc_req_attr_set+0x4c1/0x720 [osc] Apr 17 17:04:45 spirit-28 kernel: [<ffffffffa09ccea0>] cl_req_attr_set+0x60/0x150 [obdclass] Apr 17 17:04:45 spirit-28 kernel: [<ffffffffa0e50813>] osc_build_rpc+0x463/0xfb0 [osc] Apr 17 17:04:45 spirit-28 kernel: [<ffffffffa0e69290>] osc_io_unplug0+0xb70/0x1950 [osc] Apr 17 17:04:45 spirit-28 kernel: [<ffffffffa0e6ac70>] osc_io_unplug+0x10/0x20 [osc] Apr 17 17:04:45 spirit-28 kernel: [<ffffffffa0e46491>] brw_queue_work+0x31/0xd0 [osc] Apr 17 17:04:45 spirit-28 kernel: [<ffffffffa0b921f7>] work_interpreter+0x37/0xf0 [ptlrpc] Apr 17 17:04:45 spirit-28 kernel: [<ffffffffa0b8eed5>] ptlrpc_check_set.part.23+0x425/0x1dd0 [ptlrpc] Apr 17 17:04:45 spirit-28 kernel: [<ffffffffa0b908db>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] Apr 17 17:04:45 spirit-28 kernel: [<ffffffffa0bbc04b>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc] Apr 17 17:04:45 spirit-28 kernel: [<ffffffffa0bbc3fb>] ptlrpcd+0x2bb/0x560 [ptlrpc] Apr 17 17:04:45 spirit-28 kernel: [<ffffffff810c5080>] ? wake_up_state+0x20/0x20 Apr 17 17:04:45 spirit-28 kernel: [<ffffffffa0bbc140>] ? ptlrpcd_check+0x5d0/0x5d0 [ptlrpc] Apr 17 17:04:45 spirit-28 kernel: [<ffffffff810b06ff>] kthread+0xcf/0xe0 Apr 17 17:04:45 spirit-28 kernel: [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140 Apr 17 17:04:45 spirit-28 kernel: [<ffffffff81696a58>] ret_from_fork+0x58/0x90 Apr 17 17:04:45 spirit-28 kernel: [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140 Apr 17 17:04:45 spirit-28 kernel: Code: 00 00 00 00 48 c7 c7 c0 a0 c3 a0 c7 05 b0 38 0d 00 00 00 01 00 48 c7 05 95 38 0d 00 10 70 bf a0 48 c7 05 92 38 0d 00 60 41 be a0 <41> 8b 46 18 4d 8b 4e 68 4d 8d 66 20 4d 8b 46 60 49 8b 4e 58 49 Apr 17 17:04:45 spirit-28 kernel: RIP [<ffffffffa0b66836>] ldlm_resource_dump+0x86/0x530 [ptlrpc] Apr 17 17:04:45 spirit-28 kernel: RSP <ffff880805467990> Apr 17 17:04:45 spirit-28 kernel: CR2: 0000000000000016 First it failed to find a lock covering a page in RPC, and then kernel crashed when it tried to dump dlm lock. |
| Comment by Gerrit Updater [ 18/Apr/17 ] |
|
Jinshan Xiong (jinshan.xiong@intel.com) merged in patch https://review.whamcloud.com/26677/ |
| Comment by Gerrit Updater [ 27/Apr/17 ] |
|
Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/26861 |
| Comment by Oleg Drokin [ 01/May/17 ] |
|
Hm, seems I am hitting this too. I'll try to add the patch to my testing. |
| Comment by Andreas Dilger [ 02/May/17 ] |
|
It looks like https://review.whamcloud.com/26861 fixes the problem with the readahead, but there should also be a patch to prevent ldlm_resource_dump() from crashing in common mis-use cases (e.g. res == NULL or IS_ERR(res)). |
| Comment by Jinshan Xiong (Inactive) [ 02/May/17 ] |
|
I noticed that too - but if the code gets there and there is no ldlm resource existing, things are already wrong. The difference is dying due to deferencing NULL pointer or later with LASSERT |
| Comment by Cliff White (Inactive) [ 03/May/17 ] |
|
Re-tried tests with new build - IOR fails immediately |
| Comment by Cliff White (Inactive) [ 03/May/17 ] |
|
Attached output of lfs getstripe on the IOR parent directory |
| Comment by Gerrit Updater [ 05/May/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26861/ |
| Comment by Peter Jones [ 05/May/17 ] |
|
Landed for 2.10 |
| Comment by James Nunez (Inactive) [ 05/May/17 ] |
|
We need to reopen this ticket because with the patch that landed did not fix the issue on Spirit. |
| Comment by Jinshan Xiong (Inactive) [ 11/May/17 ] |
|
cliff - are you able to reproduce this issue on spirit? |
| Comment by Gerrit Updater [ 12/May/17 ] |
|
Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/27097 |
| Comment by James A Simmons [ 12/May/17 ] |
|
I'm preparing our small test system to see if this patch fixes the 50% drop in performance we see in our PFL testing. |
| Comment by Jinshan Xiong (Inactive) [ 12/May/17 ] |
|
James - this patch won't address any performance issues. |
| Comment by Gerrit Updater [ 15/May/17 ] |
|
Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/27116 |
| Comment by Gerrit Updater [ 17/May/17 ] |
|
Andreas Dilger (andreas.dilger@intel.com) merged in patch https://review.whamcloud.com/27116/ |
| Comment by Gerrit Updater [ 20/May/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27097/ |
| Comment by Peter Jones [ 20/May/17 ] |
|
Landed for 2.10 |