Details
-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
Lustre 2.8.0
-
3
-
9223372036854775807
Description
In the course of testing LU-8370, Patch Set 5 of change http://review.whamcloud.com/21334, I ran into the following NULL pointer dereference on an MDS node:
2016-08-04 02:47:57 [115783.326226] BUG: unable to handle kernel NULL pointer dereference at (null) 2016-08-04 02:47:57 [115783.335993] IP: [<ffffffff81651232>] down_write+0x32/0x43 2016-08-04 02:47:57 [115783.342892] PGD 0 2016-08-04 02:47:57 [115783.345995] Oops: 0002 [#1] SMP 2016-08-04 02:47:57 [115783.350434] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE ) ptlrpc(OE) obdclass(OE) rpcsec_gss_krb5 ko2iblnd(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) nfsv3 ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad mlx5_ib iTCO_wdt iTCO_vendor_support intel_powerclamp coretemp intel_rapl kvm ib_core ib_addr pcspkr mlx5_core sb_ed ac edac_core mei_me lpc_ich mei mfd_core ses enclosure ipmi_devintf zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) sg znvpair(POE) spl(OE) zlib_deflate i2c_i801 ioatdma shpchp ipmi_si ipmi_msghandler acpi_power_meter acpi_cpufreq binfmt_misc nfsd nfs_acl ip_tables auth_rpcgss nfsv4 dns_resolver nfs l ockd grace fscache dm_round_robin sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect crct10dif_pclmul sysimgblt crct10dif_common i2c_ algo_bit crc32_pclmul crc32c_intel drm_kms_helper mxm_wmi ghash_clmulni_intel ttm ahci aesni_intel lrw ixgbe gf128mul glue_helper libahci ablk_helper dca drm mpt3sas ptp libata i2c_core cryptd raid_class pps_core scsi_transport_sas mdio wmi sunrpc dm_mirror dm_region_hash dm_log scsi_transport_iscsi dm_multipath dm_mod 2016-08-04 02:47:57 [115783.476527] CPU: 14 PID: 162546 Comm: mdt03_008 Tainted: P W OE ------------ 3.10.0-327.22.2.2chaos.ch6.x86_64 #1 2016-08-04 02:47:57 [115783.489957] Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016 2016-08-04 02:47:57 [115783.502713] task: ffff883f17ffb980 ti: ffff887ed50d4000 task.ti: ffff887ed50d4000 2016-08-04 02:47:57 [115783.511991] RIP: 0010:[<ffffffff81651232>] [<ffffffff81651232>] down_write+0x32/0x43 2016-08-04 02:47:57 [115783.521679] RSP: 0018:ffff887ed50d78e0 EFLAGS: 00010246 2016-08-04 02:47:57 [115783.528512] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff887ed50d7fd8 2016-08-04 02:47:57 [115783.537390] RDX: ffffffff00000001 RSI: 000000000000002f RDI: ffffffff81886372 2016-08-04 02:47:57 [115783.546265] RBP: ffff887ed50d78e8 R08: ffff8870f59ada00 R09: ffff883f7ec07b00 2016-08-04 02:47:57 [115783.555130] R10: ffffffffa0f6f1ec R11: ffff88768502ba60 R12: ffff887f246b6a00 2016-08-04 02:47:57 [115783.563974] R13: 0000000000000000 R14: ffff88768502ba40 R15: ffff8870f59ada00 2016-08-04 02:47:57 [115783.572814] FS: 0000000000000000(0000) GS:ffff887f7eac0000(0000) knlGS:0000000000000000 2016-08-04 02:47:57 [115783.582719] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 2016-08-04 02:47:57 [115783.589997] CR2: 0000000000000000 CR3: 0000000001962000 CR4: 00000000001407e0 2016-08-04 02:47:57 [115783.598840] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 2016-08-04 02:47:57 [115783.607672] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 2016-08-04 02:47:57 [115783.616493] Stack: 2016-08-04 02:47:57 [115783.619577] 0000000000000000 ffff887ed50d7938 ffffffffa0ca6f84 ffff88768502ba40 2016-08-04 02:47:57 [115783.628726] ffff8878897d8000 ffff887ef6bcd2c0 ffff887f246b6a00 ffff887ef6bcd2c0 2016-08-04 02:47:57 [115783.637872] ffff8878897d8000 ffff88768502ba40 ffff8870f59ada00 ffff887ed50d7978 2016-08-04 02:47:57 [115783.647011] Call Trace: 2016-08-04 02:47:57 [115783.650601] [<ffffffffa0ca6f84>] llog_cat_add_rec+0x1d4/0x780 [obdclass] 2016-08-04 02:47:57 [115783.659046] [<ffffffffa0c9fa3a>] llog_add+0x7a/0x1a0 [obdclass] 2016-08-04 02:47:57 [115783.666616] [<ffffffffa0f6f1ec>] ? sub_updates_write+0x7f6/0xef8 [ptlrpc] 2016-08-04 02:47:57 [115783.675142] [<ffffffffa0f6f5e3>] sub_updates_write+0xbed/0xef8 [ptlrpc] 2016-08-04 02:47:57 [115783.683642] [<ffffffffa0f5dc0f>] top_trans_stop+0x62f/0x970 [ptlrpc] 2016-08-04 02:47:57 [115783.691833] [<ffffffffa1213399>] lod_trans_stop+0x259/0x340 [lod] 2016-08-04 02:47:57 [115783.699562] [<ffffffffa0d02380>] ? linkea_add_buf+0x80/0x170 [obdclass] 2016-08-04 02:47:57 [115783.707871] [<ffffffffa12a18fa>] mdd_trans_stop+0x1a/0x1c [mdd] 2016-08-04 02:47:57 [115783.715384] [<ffffffffa128cb08>] mdd_link+0x2e8/0x930 [mdd] 2016-08-04 02:47:57 [115783.722511] [<ffffffffa0ee9522>] ? lustre_msg_get_versions+0x22/0xf0 [ptlrpc] 2016-08-04 02:47:57 [115783.731563] [<ffffffffa115f08e>] mdt_reint_link+0xade/0xc30 [mdt] 2016-08-04 02:47:57 [115783.739437] [<ffffffff81309f82>] ? strlcpy+0x42/0x60 2016-08-04 02:47:57 [115783.745864] [<ffffffffa1161470>] mdt_reint_rec+0x80/0x210 [mdt] 2016-08-04 02:47:57 [115783.753352] [<ffffffffa1144971>] mdt_reint_internal+0x5e1/0x990 [mdt] 2016-08-04 02:47:57 [115783.761426] [<ffffffffa114e0d7>] mdt_reint+0x67/0x140 [mdt] 2016-08-04 02:47:57 [115783.768525] [<ffffffffa0f49695>] tgt_request_handle+0x915/0x1320 [ptlrpc] 2016-08-04 02:47:57 [115783.776987] [<ffffffffa0ef60cb>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] 2016-08-04 02:47:57 [115783.786323] [<ffffffffa0784758>] ? lc_watchdog_touch+0x68/0x180 [libcfs] 2016-08-04 02:47:57 [115783.794830] [<ffffffffa0ef3c9b>] ? ptlrpc_wait_event+0xab/0x350 [ptlrpc] 2016-08-04 02:47:57 [115783.803172] [<ffffffff810bd4c2>] ? default_wake_function+0x12/0x20 2016-08-04 02:47:57 [115783.810903] [<ffffffff810b33f8>] ? __wake_up_common+0x58/0x90 2016-08-04 02:47:57 [115783.818150] [<ffffffffa0efa170>] ptlrpc_main+0xa90/0x1db0 [ptlrpc] 2016-08-04 02:47:57 [115783.825862] [<ffffffff81015588>] ? __switch_to+0xf8/0x4d0 2016-08-04 02:47:57 [115783.832686] [<ffffffffa0ef96e0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] 2016-08-04 02:47:57 [115783.841633] [<ffffffff810a997f>] kthread+0xcf/0xe0 2016-08-04 02:47:57 [115783.847742] [<ffffffff810a98b0>] ? kthread_create_on_node+0x140/0x140 2016-08-04 02:47:57 [115783.855707] [<ffffffff8165d658>] ret_from_fork+0x58/0x90 2016-08-04 02:47:57 [115783.862363] [<ffffffff810a98b0>] ? kthread_create_on_node+0x140/0x140 2016-08-04 02:47:57 [115783.870270] Code: d2 be 2f 00 00 00 48 89 e5 53 48 89 fb 48 c7 c7 72 63 88 81 e8 00 49 a6 ff e8 8b 10 00 00 48 ba 01 00 00 00 ff ff ff ff 48 89 d8 <f0> 48 0f c1 10 85 d2 74 05 e8 f0 e9 cb ff 5b 5d c3 55 48 89 e5 2016-08-04 02:47:57 [115783.893325] RIP [<ffffffff81651232>] down_write+0x32/0x43 2016-08-04 02:47:57 [115783.900088] RSP <ffff887ed50d78e0> 2016-08-04 02:47:57 [115783.904604] CR2: 0000000000000000
Di Wang suggested that this was due to LU-7800, and asked that we try patch 18542. We tested Patch Set version 4 of that change, but it proved disastrous in testing.
There were lots of messages of this form on the primary MDT:
2016-08-09 14:52:30 [358322.078236] LNet: Service thread pid 134876 was inactive for 200.32s. The thread might be hung, or it might only be slow and will resume later. Dumping the sta ck trace for debugging purposes: 2016-08-09 14:52:30 [358322.098689] Pid: 134876, comm: mdt03_010 2016-08-09 14:52:30 [358322.103732] 2016-08-09 14:52:30 [358322.103732] Call Trace: 2016-08-09 14:52:30 [358322.109385] [<ffffffff81652479>] schedule+0x29/0x70 2016-08-09 14:52:30 [358322.115535] [<ffffffff8164fea5>] schedule_timeout+0x175/0x310 2016-08-09 14:52:30 [358322.122680] [<ffffffff8108ef30>] ? process_timeout+0x0/0x10 2016-08-09 14:52:30 [358322.129629] [<ffffffffa0f0b950>] ? ldlm_expired_completion_wait+0x0/0x250 [ptlrpc] 2016-08-09 14:52:30 [358322.138798] [<ffffffffa0f0c141>] ldlm_completion_ast+0x5a1/0x910 [ptlrpc] 2016-08-09 14:52:30 [358322.147085] [<ffffffff810bd4f0>] ? default_wake_function+0x0/0x20 2016-08-09 14:52:30 [358322.154615] [<ffffffffa0f0e870>] ldlm_cli_enqueue_local+0x230/0x860 [ptlrpc] 2016-08-09 14:52:30 [358322.163224] [<ffffffffa0f0bba0>] ? ldlm_completion_ast+0x0/0x910 [ptlrpc] 2016-08-09 14:52:30 [358322.171534] [<ffffffffa118e5b0>] ? mdt_blocking_ast+0x0/0x2d0 [mdt] 2016-08-09 14:52:30 [358322.179253] [<ffffffffa119e10a>] mdt_object_local_lock+0x52a/0xb00 [mdt] 2016-08-09 14:52:30 [358322.187458] [<ffffffffa118e5b0>] ? mdt_blocking_ast+0x0/0x2d0 [mdt] 2016-08-09 14:52:30 [358322.195194] [<ffffffffa0f0bba0>] ? ldlm_completion_ast+0x0/0x910 [ptlrpc] 2016-08-09 14:52:30 [358322.203502] [<ffffffffa12527f5>] ? lod_index_lookup+0x25/0x30 [lod] 2016-08-09 14:52:30 [358322.211233] [<ffffffffa0bbdd4f>] ? __mdd_lookup.isra.17+0x26f/0x450 [mdd] 2016-08-09 14:52:30 [358322.219527] [<ffffffffa119e73e>] mdt_object_lock_internal+0x5e/0x2f0 [mdt] 2016-08-09 14:52:30 [358322.227934] [<ffffffffa08e5306>] ? lu_object_find+0x16/0x20 [obdclass] 2016-08-09 14:52:30 [358322.235922] [<ffffffffa119f614>] mdt_getattr_name_lock+0x8a4/0x1900 [mdt] 2016-08-09 14:52:30 [358322.244214] [<ffffffffa11a0920>] mdt_intent_getattr+0x2b0/0x470 [mdt] 2016-08-09 14:52:30 [358322.252111] [<ffffffffa11a3c2c>] mdt_intent_policy+0x5bc/0xbb0 [mdt] 2016-08-09 14:52:30 [358322.259931] [<ffffffffa0ef20d7>] ldlm_lock_enqueue+0x387/0x970 [ptlrpc] 2016-08-09 14:52:30 [358322.268035] [<ffffffffa0f1a9e2>] ldlm_handle_enqueue0+0x772/0x16b0 [ptlrpc] 2016-08-09 14:52:30 [358322.276536] [<ffffffffa0f42030>] ? lustre_swab_ldlm_request+0x0/0x30 [ptlrpc] 2016-08-09 14:52:30 [358322.285221] [<ffffffffa0f9a272>] tgt_enqueue+0x62/0x210 [ptlrpc] 2016-08-09 14:52:30 [358322.292658] [<ffffffffa0f9e695>] tgt_request_handle+0x915/0x1320 [ptlrpc] 2016-08-09 14:52:30 [358322.300956] [<ffffffffa0f4b0cb>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] 2016-08-09 14:52:30 [358322.310111] [<ffffffffa0794758>] ? lc_watchdog_touch+0x68/0x180 [libcfs] 2016-08-09 14:52:30 [358322.318353] [<ffffffffa0f48c9b>] ? ptlrpc_wait_event+0xab/0x350 [ptlrpc] 2016-08-09 14:52:30 [358322.326534] [<ffffffff810bd502>] ? default_wake_function+0x12/0x20 2016-08-09 14:52:30 [358322.334141] [<ffffffff810b3438>] ? __wake_up_common+0x58/0x90 2016-08-09 14:52:30 [358322.341281] [<ffffffffa0f4f170>] ptlrpc_main+0xa90/0x1db0 [ptlrpc] 2016-08-09 14:52:30 [358322.348907] [<ffffffffa0f4e6e0>] ? ptlrpc_main+0x0/0x1db0 [ptlrpc] 2016-08-09 14:52:30 [358322.356497] [<ffffffff810a99bf>] kthread+0xcf/0xe0 2016-08-09 14:52:30 [358322.362539] [<ffffffff810a98f0>] ? kthread+0x0/0xe0 2016-08-09 14:52:30 [358322.368674] [<ffffffff8165d9d8>] ret_from_fork+0x58/0x90 2016-08-09 14:52:30 [358322.375283] [<ffffffff810a98f0>] ? kthread+0x0/0xe0 2016-08-09 14:52:30 [358322.381399] 2016-08-09 14:52:30 [358322.383646] LustreError: dumping log to /tmp/lustre-log.1470779550.134876 2016-08-09 14:53:40 [358391.814084] LustreError: 61123:0:(ldlm_request.c:106:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1470779320, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-lquake-MDT0000_UUID lock: ffff887779436400/0xd7fdbcbd7173ced9 lrc: 3/1,0 mode: --/PR res: [0x200000411:0x2:0x0].0x0 bits 0x13 rrc: 72 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 61123 timeout: 0 lvb_type: 0 2016-08-09 14:54:10 [358421.838538] LustreError: 134876:0:(ldlm_request.c:106:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1470779350, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-lquake-MDT0000_UUID lock: ffff887779434800/0xd7fdbcbd7173cefc lrc: 3/1,0 mode: --/PR res: [0x200000411:0x2:0x0].0x0 bits 0x13 rrc: 72 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 134876 timeout: 0 lvb_type: 0 2016-08-09 14:55:17 [358489.126391] Lustre: 47712:0:(service.c:1336:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply 2016-08-09 14:55:17 [358489.126391] req@ffff8833d0102d00 x1542220339501052/t0(0) o101->9f9de1b9-b610-2a38-3f5f-841c2475a01f@192.168.128.33@o2ib18:462/0 lens 696/3384 e 22 to 0 dl 1470779722 ref 2 fl Interpret:/0/0 rc 0/0 2016-08-09 14:55:57 [358529.607401] Lustre: 134880:0:(service.c:1336:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply 2016-08-09 14:55:57 [358529.607401] req@ffff887880dff500 x1542220340548108/t0(0) o101->ad48e3df-a2c8-571f-bbf1-cbdff6fd3bfa@192.168.128.44@o2ib18:502/0 lens 696/3384 e 22 to 0 dl 1470779762 ref 2 fl Interpret:/0/0 rc 0/0 2016-08-09 14:56:07 [358539.256087] Lustre: lquake-MDT0000: Client 9f9de1b9-b610-2a38-3f5f-841c2475a01f (at 192.168.128.33@o2ib18) reconnecting 2016-08-09 14:56:07 [358539.269040] Lustre: lquake-MDT0000: Connection restored to (at 192.168.128.33@o2ib18) 2016-08-09 14:58:35 [358687.752322] Lustre: 61096:0:(service.c:1336:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply 2016-08-09 14:58:35 [358687.752322] req@ffff883e4bd06f00 x1542220323771140/t0(0) o101->b7ee1b55-5cca-e2f3-3fcc-0202bce14aec@192.168.128.186@o2ib18:660/0 lens 696/3384 e 22 to 0 dl 1470779920 ref 2 fl Interpret:/0/0 rc 0/0 2016-08-09 14:59:05 [358717.792806] Lustre: 61096:0:(service.c:1336:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply 2016-08-09 14:59:05 [358717.792806] req@ffff8833da488f00 x1542220323771456/t0(0) o101->b7ee1b55-5cca-e2f3-3fcc-0202bce14aec@192.168.128.186@o2ib18:690/0 lens 696/3384 e 11 to 0 dl 1470779950 ref 2 fl Interpret:/0/0 rc 0/0 2016-08-09 14:59:25 [358737.051472] Lustre: lquake-MDT0000: Connection restored to 12dd100c-21bd-8e8c-a8e2-05374050e7f7 (at 192.168.128.186@o2ib18) 2016-08-09 14:59:25 [358737.064963] Lustre: Skipped 1 previous similar message
Simple "ls -l" commands from clients were hanging for very long periods of time (a few minutes at least).
This was with our lustre tag 2.8.0_0.0.llnlpreview.30 installed on the servers (see repo lustre-release-fe-llnl). We reverted to patch 2.8.0_0.0.llnlpreview.29, which did not have the LU-7800 patch, and everything is working as it was previously.
Note that the LU-7800 patch did not apply cleanly to 2.8. I did need to resolve some minor conflicts. I made the simple, naive conflict resolution, so it is worth checking if there was additional change needed for 2.8.
But if porting issues weren't to blame, the LU-7800 patch caused far more trouble than it was intending to fix.