[LU-9394] lu_object_find_try - kernel NULL pointer dereference Created: 24/Apr/17 Updated: 10/May/17 Resolved: 09/May/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0 |
| Fix Version/s: | Lustre 2.10.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Cliff White (Inactive) | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
Soak performance cluster. |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Soak cluster had completed MDS failover of MDT0002, lfsck was aborted on the MDS after timeout. Apr 22 15:02:44 soak-2 kernel: LustreError: Skipped 3 previous similar messages Apr 22 15:02:44 soak-2 kernel: Lustre: soaked-MDT0002-lwp-OST0006: Connection restored to 192.168.1.110@o2ib10 (at 192.168.1.110@o2ib10) Apr 22 15:02:44 soak-2 kernel: Lustre: Skipped 4 previous similar messages Apr 22 15:03:09 soak-2 kernel: LustreError: 167-0: soaked-MDT0002-lwp-OST0012: This client was evicted by soaked-MDT0002; in progress operations using this service will fail. Apr 22 15:03:09 soak-2 kernel: LustreError: Skipped 1 previous similar message Apr 22 15:03:09 soak-2 kernel: Lustre: soaked-MDT0002-lwp-OST0012: Connection restored to 192.168.1.110@o2ib10 (at 192.168.1.110@o2ib10) Apr 22 15:03:09 soak-2 kernel: Lustre: Skipped 1 previous similar message Apr 22 15:07:11 soak-2 kernel: Lustre: soaked-OST0006: deleting orphan objects from 0x480000401:15286395 to 0x480000401:15292401 Apr 22 15:07:11 soak-2 kernel: Lustre: soaked-OST0000: deleting orphan objects from 0x300000400:15323501 to 0x300000400:15333313 Apr 22 15:07:11 soak-2 kernel: Lustre: soaked-OST000c: deleting orphan objects from 0x600000400:15327168 to 0x600000400:15338321 Apr 22 15:07:11 soak-2 kernel: Lustre: soaked-OST0012: deleting orphan objects from 0x740000402:15313619 to 0x740000402:15327665 Apr 22 15:21:56 soak-2 rsyslogd: [origin software="rsyslogd" swVersion="7.4.7" x-pid="3806" x-info="http://www.rsyslog.com"] start Dmesg from vmcore: 32845.239323] Lustre: soaked-OST000c: deleting orphan objects from 0x600000400:15327168 to 0x600000400:15338321 [32845.239325] Lustre: soaked-OST0012: deleting orphan objects from 0x740000402:15313619 to 0x740000402:15327665 [33465.017357] LustreError: 15131:0:(osd_object.c:427:osd_object_init()) soaked-OST000c: lookup [0x1000c0000:0x21c57a8:0x0]/0x387ef2 failed: rc = 17 [33465.035118] BUG: unable to handle kernel NULL pointer dereference at 0000000000000011 [33465.045485] IP: [<ffffffffa0a590d8>] lu_object_find_try+0x178/0x2b0 [obdclass] [33465.055153] PGD 0 [33465.058735] Oops: 0000 [#1] SMP [33465.063750] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd dm_round_robin iTCO_wdt iTCO_vendor_support mei_me pcspkr ses ipmi_devintf enclosure mei ntb sg ipmi_si i2c_i801 lpc_ich sb_edac ipmi_msghandler ioatdma edac_core shpchp wmi nfsd dm_multipath nfs_acl dm_mod lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache [33465.151994] jbd2 zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate sd_mod crc_t10dif crct10dif_generic mlx4_en mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops igb isci ttm ahci crct10dif_pclmul crct10dif_common libsas ptp crc32c_intel libahci pps_core mlx4_core drm mpt2sas dca libata raid_class i2c_algo_bit scsi_transport_sas devlink i2c_core fjes [33465.195938] CPU: 15 PID: 15131 Comm: ldlm_cn01_018 Tainted: P OE ------------ 3.10.0-514.10.2.el7_lustre.x86_64 #1 [33465.211995] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 [33465.225976] task: ffff88042a2dbec0 ti: ffff88073b4f8000 task.ti: ffff88073b4f8000 [33465.236319] RIP: 0010:[<ffffffffa0a590d8>] [<ffffffffa0a590d8>] lu_object_find_try+0x178/0x2b0 [obdclass] [33465.249161] RSP: 0018:ffff88073b4fbaa8 EFLAGS: 00010207 [33465.256998] RAX: 0000000000000011 RBX: ffff880825cd6cc0 RCX: ffff8800ad22a918 [33465.266794] RDX: 00000001002a0021 RSI: ffffea0009502a80 RDI: ffff8800ad22a930 [33465.276797] RBP: ffff88073b4fbb08 R08: ffff8802540aaa80 R09: 00000001002a0020 [33465.286649] R10: 00000000540aad01 R11: ffffea0009502a80 R12: ffff8803fd4d5020 [33465.296218] R13: ffff8804b928e1f0 R14: 0000000000000000 R15: ffff88080b3d9000 [33465.305840] FS: 0000000000000000(0000) GS:ffff88082d9c0000(0000) knlGS:0000000000000000 [33465.316722] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [33465.324590] CR2: 0000000000000011 CR3: 00000000019ba000 CR4: 00000000000407e0 [33465.333919] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [33465.343445] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [33465.353428] Stack: [33465.357274] ffff88073b4fbc00 fffffffffffffffe ffff88073b4fbb30 000000000002246f [33465.367105] ffff8800ad22a800 ffff88000000000d 00000000a848b037 ffff88073b4fbc00 [33465.376714] ffff88080b3d9000 ffff8803fd4d5020 0000000000000000 00000001000c0000 [33465.386284] Call Trace: [33465.390755] [<ffffffffa0a592bc>] lu_object_find_at+0xac/0xe0 [obdclass] [33465.400101] [<ffffffffa0fb27db>] ? ofd_key_init+0x3b/0xd0 [ofd] [33465.408740] [<ffffffffa0a59306>] lu_object_find+0x16/0x20 [obdclass] [33465.417192] [<ffffffffa0fca4e5>] ofd_object_find+0x35/0x100 [ofd] [33465.425447] [<ffffffffa0fd8d12>] ofd_lvbo_update+0x4b2/0xfc8 [ofd] [33465.434020] [<ffffffffa0fd7ca9>] ? ofd_lvbo_free+0x59/0xe0 [ofd] [33465.442680] [<ffffffffa0c566f2>] ldlm_request_cancel+0x132/0x720 [ptlrpc] [33465.451967] [<ffffffffa0c5dc6a>] ldlm_handle_cancel+0xba/0x250 [ptlrpc] [33465.461254] [<ffffffffa0c5df41>] ldlm_cancel_handler+0x141/0x490 [ptlrpc] [33465.470567] [<ffffffffa0c8df1b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [33465.480874] [<ffffffffa09359a8>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [33465.489808] [<ffffffffa0c8bac8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [33465.498533] [<ffffffff810c5092>] ? default_wake_function+0x12/0x20 [33465.507323] [<ffffffff810ba2e8>] ? __wake_up_common+0x58/0x90 [33465.515728] [<ffffffffa0c91f40>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] [33465.524220] [<ffffffffa0c914a0>] ? ptlrpc_register_service+0xe60/0xe60 [ptlrpc] [33465.534131] [<ffffffff810b06ff>] kthread+0xcf/0xe0 [33465.540714] [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140 [33465.549134] [<ffffffff81696c98>] ret_from_fork+0x58/0x90 [33465.556096] [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140 [33465.564386] Code: c6 62 e0 48 83 f8 fe 0f 85 54 ff ff ff 48 8b 7d a0 4c 89 f1 4c 89 e2 4c 89 fe e8 24 fb ff ff 48 3d 00 f0 ff ff 0f 87 36 ff ff ff <48> 8b 38 ba 10 00 00 00 4c 89 e6 48 89 45 a8 e8 04 80 8c e0 85 [33465.588258] RIP [<ffffffffa0a590d8>] lu_object_find_try+0x178/0x2b0 [obdclass] [33465.597871] RSP <ffff88073b4fbaa8> |
| Comments |
| Comment by Cliff White (Inactive) [ 25/Apr/17 ] |
|
System crashes consistently with this error, stopping testing of this build on soak. |
| Comment by Peter Jones [ 25/Apr/17 ] |
|
Fan Yong Could you please advise on this one? Peter |
| Comment by Joseph Gmitter (Inactive) [ 26/Apr/17 ] |
|
Hi Alex, Can you please take this issue? Thanks. |
| Comment by Cliff White (Inactive) [ 26/Apr/17 ] |
|
Attempting to test https://review.whamcloud.com/26751 - hit this bug again, appears to be the same. Hard crash on soak-2. The crash dump is available on the node. [93807.458922] Lustre: 15422:0:(client.c:2115:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1493238567/real 1493238567] req@ffff8800498d1800 x1565678189641584/t0(0) o105->soaked-OST000c@192.168.1.138@o2ib100:15/16 lens 360/224 e 0 to 1 dl 1493238574 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
[93807.496438] Lustre: 15422:0:(client.c:2115:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
[95810.236195] LustreError: 16392:0:(osd_object.c:427:osd_object_init()) soaked-OST000c: lookup [0x1000c0000:0x2a8aea9:0x0]/0x505415 failed: rc = 17
[95810.253694] BUG: unable to handle kernel NULL pointer dereference at 0000000000000011
[95810.264019] IP: [<ffffffffa0a82038>] lu_object_find_try+0x178/0x2b0 [obdclass]
[95810.273549] PGD 0
[95810.277182] Oops: 0000 [#1] SMP
|
| Comment by Cliff White (Inactive) [ 28/Apr/17 ] |
|
Hit this again on three nodes, - soak-[2-4] Dropping this build on soak until I see a fix. |
| Comment by Alex Zhuravlev [ 28/Apr/17 ] |
|
Cliff, was it ZFS on OST ? |
| Comment by Gerrit Updater [ 28/Apr/17 ] |
|
Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: https://review.whamcloud.com/26893 |
| Comment by Cliff White (Inactive) [ 28/Apr/17 ] |
|
Yes, ZFS on OST ldiskfs on MDT |
| Comment by Cliff White (Inactive) [ 01/May/17 ] |
|
Ran the patch for 24 hours without lfsck, no issues. 2017-04-30 19:39:52,529:fsmgmt.fsmgmt:INFO lfsck started on soak-8 System appears to have immediately wedged. Apr 30 19:40:19 soak-8 kernel: NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [OI_scrub:4251]
Apr 30 19:40:19 soak-8 kernel: Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul dm_round_robin ghash_clmulni_intel aesni_intel ipmi_ssif ipmi_devintf sb_edac mei_me sg lrw gf128mul glue_helper ablk_helper cryptd ipmi_si iTCO_wdt edac_core ipmi_msghandler i2c_i801 mei iTCO_vendor_support lpc_ich pcspkr
Apr 30 19:40:19 soak-8 kernel: shpchp ioatdma dm_multipath dm_mod wmi nfsd nfs_acl lockd auth_rpcgss grace sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_en mgag200 drm_kms_helper syscopyarea sysfillrect isci sysimgblt fb_sys_fops igb crct10dif_pclmul ahci ttm crct10dif_common mpt2sas ptp libahci libsas crc32c_intel raid_class pps_core drm mlx4_core scsi_transport_sas dca libata i2c_algo_bit ntb i2c_core devlink fjes
Apr 30 19:40:19 soak-8 kernel: CPU: 2 PID: 4251 Comm: OI_scrub Tainted: P OE ------------ 3.10.0-514.16.1.el7_lustre.x86_64 #1
Apr 30 19:40:19 soak-8 kernel: Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
Apr 30 19:40:19 soak-8 kernel: task: ffff880418a22f10 ti: ffff8807ef304000 task.ti: ffff8807ef304000
Apr 30 19:40:19 soak-8 kernel: RIP: 0010:[<ffffffffa117779a>] [<ffffffffa117779a>] ldiskfs_itable_unused_count+0x1a/0x30 [ldiskfs]
Apr 30 19:40:19 soak-8 kernel: RSP: 0018:ffff8807ef307d10 EFLAGS: 00000297
Apr 30 19:40:19 soak-8 kernel: RAX: 0000000000000000 RBX: ffff8807ef307d98 RCX: ffff88082b4e7800
Apr 30 19:40:19 soak-8 kernel: RDX: 00000000000072ce RSI: ffff880825d38a00 RDI: ffff88082b4e7000
Apr 30 19:40:19 soak-8 kernel: RBP: ffff8807ef307d10 R08: ffff8807ef307d57 R09: 0000000000000004
Apr 30 19:40:19 soak-8 kernel: R10: 00000000eaf1dc01 R11: ffffea001fabc740 R12: 0000000000000001
Apr 30 19:40:19 soak-8 kernel: R13: 0000000000000018 R14: ffff8807ef307c80 R15: 0000000000000202
Apr 30 19:40:19 soak-8 kernel: FS: 0000000000000000(0000) GS:ffff88042e080000(0000) knlGS:0000000000000000
Apr 30 19:40:19 soak-8 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 30 19:40:19 soak-8 kernel: CR2: 00007fb7eb77de00 CR3: 00000000019be000 CR4: 00000000000407e0
Apr 30 19:40:19 soak-8 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 30 19:40:19 soak-8 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Apr 30 19:40:19 soak-8 kernel: Stack:
Apr 30 19:40:19 soak-8 kernel: ffff8807ef307df0 ffffffffa1234ead ffffffffa1233880 ffff8804b960d46c
Apr 30 19:40:19 soak-8 kernel: ffffffffa122f7b0 ffff880420000000 ffff88082b4e7000 ffff8804b960d468
Apr 30 19:40:19 soak-8 kernel: 010000000000000c 0000000000000000 0000000000000000 ffff8804b960c000
Apr 30 19:40:19 soak-8 kernel: Call Trace:
Apr 30 19:40:19 soak-8 kernel: [<ffffffffa1234ead>] osd_inode_iteration+0x21d/0xd90 [osd_ldiskfs]
Apr 30 19:40:19 soak-8 kernel: [<ffffffffa1233880>] ? osd_ios_ROOT_scan+0x300/0x300 [osd_ldiskfs]
Apr 30 19:40:19 soak-8 kernel: [<ffffffffa122f7b0>] ? osd_preload_next+0xc0/0xc0 [osd_ldiskfs]
Apr 30 19:40:19 soak-8 kernel: [<ffffffffa1236380>] osd_scrub_main+0x960/0xf30 [osd_ldiskfs]
Apr 30 19:40:19 soak-8 kernel: [<ffffffff810c54c0>] ? wake_up_state+0x20/0x20
Apr 30 19:40:19 soak-8 kernel: [<ffffffffa1235a20>] ? osd_inode_iteration+0xd90/0xd90 [osd_ldiskfs]
Apr 30 19:40:19 soak-8 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0
Apr 30 19:40:19 soak-8 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
Apr 30 19:40:19 soak-8 kernel: [<ffffffff81697318>] ret_from_fork+0x58/0x90
Apr 30 19:40:19 soak-8 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
Apr 30 19:40:19 soak-8 kernel: Code: 30 c1 e0 10 09 d0 5d c3 66 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 8b 8f a8 03 00 00 31 c0 0f b7 56 1c 48 89 e5 48 83 39 3f <76> 07 0f b7 46 32 c1 e0 10 09 d0 5d c3 66 0f 1f 84 00 00 00 00
soak-8 is now completely wedged, available for examination |
| Comment by nasf (Inactive) [ 02/May/17 ] |
Cliff, have you applied the patch https://review.whamcloud.com/#/c/26751/ when hit above trouble? |
| Comment by Cliff White (Inactive) [ 02/May/17 ] |
|
No, was only testing the single patch above. |
| Comment by Cliff White (Inactive) [ 02/May/17 ] |
|
Soak-8 system log attached |
| Comment by Cliff White (Inactive) [ 05/May/17 ] |
|
Moved this to tip of master. Same failure on soak-8 |
| Comment by Gerrit Updater [ 09/May/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26893/ |
| Comment by Peter Jones [ 09/May/17 ] |
|
Landed for 2.10 |