[LU-8489] llog_cat_add_rec NULL pointer dereference Created: 09/Aug/16  Updated: 12/Oct/16  Resolved: 18/Aug/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Christopher Morrone Assignee: Peter Jones
Resolution: Duplicate Votes: 0
Labels: llnl

Issue Links:
Duplicate
Related
is related to LU-8484 MDS server crash during sanity test 6... Resolved
is related to LU-7800 Panic during recovery of soak-test. Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

In the course of testing LU-8370, Patch Set 5 of change http://review.whamcloud.com/21334, I ran into the following NULL pointer dereference on an MDS node:

2016-08-04 02:47:57 [115783.326226] BUG: unable to handle kernel NULL pointer dereference at           (null)
2016-08-04 02:47:57 [115783.335993] IP: [<ffffffff81651232>] down_write+0x32/0x43
2016-08-04 02:47:57 [115783.342892] PGD 0 
2016-08-04 02:47:57 [115783.345995] Oops: 0002 [#1] SMP 
2016-08-04 02:47:57 [115783.350434] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE
) ptlrpc(OE) obdclass(OE) rpcsec_gss_krb5 ko2iblnd(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) nfsv3 ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad
 rdma_cm ib_cm iw_cm ib_sa ib_mad mlx5_ib iTCO_wdt iTCO_vendor_support intel_powerclamp coretemp intel_rapl kvm ib_core ib_addr pcspkr mlx5_core sb_ed
ac edac_core mei_me lpc_ich mei mfd_core ses enclosure ipmi_devintf zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) sg znvpair(POE) spl(OE) zlib_deflate
 i2c_i801 ioatdma shpchp ipmi_si ipmi_msghandler acpi_power_meter acpi_cpufreq binfmt_misc nfsd nfs_acl ip_tables auth_rpcgss nfsv4 dns_resolver nfs l
ockd grace fscache dm_round_robin sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect crct10dif_pclmul sysimgblt crct10dif_common i2c_
algo_bit crc32_pclmul crc32c_intel drm_kms_helper mxm_wmi ghash_clmulni_intel ttm ahci aesni_intel lrw ixgbe gf128mul glue_helper libahci ablk_helper 
dca drm mpt3sas ptp libata i2c_core cryptd raid_class pps_core scsi_transport_sas mdio wmi sunrpc dm_mirror dm_region_hash dm_log scsi_transport_iscsi
 dm_multipath dm_mod
2016-08-04 02:47:57 [115783.476527] CPU: 14 PID: 162546 Comm: mdt03_008 Tainted: P        W  OE  ------------   3.10.0-327.22.2.2chaos.ch6.x86_64 #1
2016-08-04 02:47:57 [115783.489957] Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
2016-08-04 02:47:57 [115783.502713] task: ffff883f17ffb980 ti: ffff887ed50d4000 task.ti: ffff887ed50d4000
2016-08-04 02:47:57 [115783.511991] RIP: 0010:[<ffffffff81651232>]  [<ffffffff81651232>] down_write+0x32/0x43
2016-08-04 02:47:57 [115783.521679] RSP: 0018:ffff887ed50d78e0  EFLAGS: 00010246
2016-08-04 02:47:57 [115783.528512] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff887ed50d7fd8
2016-08-04 02:47:57 [115783.537390] RDX: ffffffff00000001 RSI: 000000000000002f RDI: ffffffff81886372
2016-08-04 02:47:57 [115783.546265] RBP: ffff887ed50d78e8 R08: ffff8870f59ada00 R09: ffff883f7ec07b00
2016-08-04 02:47:57 [115783.555130] R10: ffffffffa0f6f1ec R11: ffff88768502ba60 R12: ffff887f246b6a00
2016-08-04 02:47:57 [115783.563974] R13: 0000000000000000 R14: ffff88768502ba40 R15: ffff8870f59ada00
2016-08-04 02:47:57 [115783.572814] FS:  0000000000000000(0000) GS:ffff887f7eac0000(0000) knlGS:0000000000000000
2016-08-04 02:47:57 [115783.582719] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2016-08-04 02:47:57 [115783.589997] CR2: 0000000000000000 CR3: 0000000001962000 CR4: 00000000001407e0
2016-08-04 02:47:57 [115783.598840] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2016-08-04 02:47:57 [115783.607672] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
2016-08-04 02:47:57 [115783.616493] Stack:
2016-08-04 02:47:57 [115783.619577]  0000000000000000 ffff887ed50d7938 ffffffffa0ca6f84 ffff88768502ba40
2016-08-04 02:47:57 [115783.628726]  ffff8878897d8000 ffff887ef6bcd2c0 ffff887f246b6a00 ffff887ef6bcd2c0
2016-08-04 02:47:57 [115783.637872]  ffff8878897d8000 ffff88768502ba40 ffff8870f59ada00 ffff887ed50d7978
2016-08-04 02:47:57 [115783.647011] Call Trace:
2016-08-04 02:47:57 [115783.650601]  [<ffffffffa0ca6f84>] llog_cat_add_rec+0x1d4/0x780 [obdclass]
2016-08-04 02:47:57 [115783.659046]  [<ffffffffa0c9fa3a>] llog_add+0x7a/0x1a0 [obdclass]
2016-08-04 02:47:57 [115783.666616]  [<ffffffffa0f6f1ec>] ? sub_updates_write+0x7f6/0xef8 [ptlrpc]
2016-08-04 02:47:57 [115783.675142]  [<ffffffffa0f6f5e3>] sub_updates_write+0xbed/0xef8 [ptlrpc]
2016-08-04 02:47:57 [115783.683642]  [<ffffffffa0f5dc0f>] top_trans_stop+0x62f/0x970 [ptlrpc]
2016-08-04 02:47:57 [115783.691833]  [<ffffffffa1213399>] lod_trans_stop+0x259/0x340 [lod]
2016-08-04 02:47:57 [115783.699562]  [<ffffffffa0d02380>] ? linkea_add_buf+0x80/0x170 [obdclass]
2016-08-04 02:47:57 [115783.707871]  [<ffffffffa12a18fa>] mdd_trans_stop+0x1a/0x1c [mdd]
2016-08-04 02:47:57 [115783.715384]  [<ffffffffa128cb08>] mdd_link+0x2e8/0x930 [mdd]
2016-08-04 02:47:57 [115783.722511]  [<ffffffffa0ee9522>] ? lustre_msg_get_versions+0x22/0xf0 [ptlrpc]
2016-08-04 02:47:57 [115783.731563]  [<ffffffffa115f08e>] mdt_reint_link+0xade/0xc30 [mdt]
2016-08-04 02:47:57 [115783.739437]  [<ffffffff81309f82>] ? strlcpy+0x42/0x60
2016-08-04 02:47:57 [115783.745864]  [<ffffffffa1161470>] mdt_reint_rec+0x80/0x210 [mdt]
2016-08-04 02:47:57 [115783.753352]  [<ffffffffa1144971>] mdt_reint_internal+0x5e1/0x990 [mdt]
2016-08-04 02:47:57 [115783.761426]  [<ffffffffa114e0d7>] mdt_reint+0x67/0x140 [mdt]
2016-08-04 02:47:57 [115783.768525]  [<ffffffffa0f49695>] tgt_request_handle+0x915/0x1320 [ptlrpc]
2016-08-04 02:47:57 [115783.776987]  [<ffffffffa0ef60cb>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
2016-08-04 02:47:57 [115783.786323]  [<ffffffffa0784758>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
2016-08-04 02:47:57 [115783.794830]  [<ffffffffa0ef3c9b>] ? ptlrpc_wait_event+0xab/0x350 [ptlrpc]
2016-08-04 02:47:57 [115783.803172]  [<ffffffff810bd4c2>] ? default_wake_function+0x12/0x20
2016-08-04 02:47:57 [115783.810903]  [<ffffffff810b33f8>] ? __wake_up_common+0x58/0x90
2016-08-04 02:47:57 [115783.818150]  [<ffffffffa0efa170>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
2016-08-04 02:47:57 [115783.825862]  [<ffffffff81015588>] ? __switch_to+0xf8/0x4d0
2016-08-04 02:47:57 [115783.832686]  [<ffffffffa0ef96e0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
2016-08-04 02:47:57 [115783.841633]  [<ffffffff810a997f>] kthread+0xcf/0xe0
2016-08-04 02:47:57 [115783.847742]  [<ffffffff810a98b0>] ? kthread_create_on_node+0x140/0x140
2016-08-04 02:47:57 [115783.855707]  [<ffffffff8165d658>] ret_from_fork+0x58/0x90
2016-08-04 02:47:57 [115783.862363]  [<ffffffff810a98b0>] ? kthread_create_on_node+0x140/0x140
2016-08-04 02:47:57 [115783.870270] Code: d2 be 2f 00 00 00 48 89 e5 53 48 89 fb 48 c7 c7 72 63 88 81 e8 00 49 a6 ff e8 8b 10 00 00 48 ba 01 00 00 00 ff ff ff ff 48 89 d8 <f0> 48 0f c1 10 85 d2 74 05 e8 f0 e9 cb ff 5b 5d c3 55 48 89 e5 
2016-08-04 02:47:57 [115783.893325] RIP  [<ffffffff81651232>] down_write+0x32/0x43
2016-08-04 02:47:57 [115783.900088]  RSP <ffff887ed50d78e0>
2016-08-04 02:47:57 [115783.904604] CR2: 0000000000000000

Di Wang suggested that this was due to LU-7800, and asked that we try patch 18542. We tested Patch Set version 4 of that change, but it proved disastrous in testing.

There were lots of messages of this form on the primary MDT:

2016-08-09 14:52:30 [358322.078236] LNet: Service thread pid 134876 was inactive for 200.32s. The thread might be hung, or it might only be slow and will resume later. Dumping the sta
ck trace for debugging purposes:
2016-08-09 14:52:30 [358322.098689] Pid: 134876, comm: mdt03_010
2016-08-09 14:52:30 [358322.103732] 
2016-08-09 14:52:30 [358322.103732] Call Trace:
2016-08-09 14:52:30 [358322.109385]  [<ffffffff81652479>] schedule+0x29/0x70
2016-08-09 14:52:30 [358322.115535]  [<ffffffff8164fea5>] schedule_timeout+0x175/0x310
2016-08-09 14:52:30 [358322.122680]  [<ffffffff8108ef30>] ? process_timeout+0x0/0x10
2016-08-09 14:52:30 [358322.129629]  [<ffffffffa0f0b950>] ? ldlm_expired_completion_wait+0x0/0x250 [ptlrpc]
2016-08-09 14:52:30 [358322.138798]  [<ffffffffa0f0c141>] ldlm_completion_ast+0x5a1/0x910 [ptlrpc]
2016-08-09 14:52:30 [358322.147085]  [<ffffffff810bd4f0>] ? default_wake_function+0x0/0x20
2016-08-09 14:52:30 [358322.154615]  [<ffffffffa0f0e870>] ldlm_cli_enqueue_local+0x230/0x860 [ptlrpc]
2016-08-09 14:52:30 [358322.163224]  [<ffffffffa0f0bba0>] ? ldlm_completion_ast+0x0/0x910 [ptlrpc]
2016-08-09 14:52:30 [358322.171534]  [<ffffffffa118e5b0>] ? mdt_blocking_ast+0x0/0x2d0 [mdt]
2016-08-09 14:52:30 [358322.179253]  [<ffffffffa119e10a>] mdt_object_local_lock+0x52a/0xb00 [mdt]
2016-08-09 14:52:30 [358322.187458]  [<ffffffffa118e5b0>] ? mdt_blocking_ast+0x0/0x2d0 [mdt]
2016-08-09 14:52:30 [358322.195194]  [<ffffffffa0f0bba0>] ? ldlm_completion_ast+0x0/0x910 [ptlrpc]
2016-08-09 14:52:30 [358322.203502]  [<ffffffffa12527f5>] ? lod_index_lookup+0x25/0x30 [lod]
2016-08-09 14:52:30 [358322.211233]  [<ffffffffa0bbdd4f>] ? __mdd_lookup.isra.17+0x26f/0x450 [mdd]
2016-08-09 14:52:30 [358322.219527]  [<ffffffffa119e73e>] mdt_object_lock_internal+0x5e/0x2f0 [mdt]
2016-08-09 14:52:30 [358322.227934]  [<ffffffffa08e5306>] ? lu_object_find+0x16/0x20 [obdclass]
2016-08-09 14:52:30 [358322.235922]  [<ffffffffa119f614>] mdt_getattr_name_lock+0x8a4/0x1900 [mdt]
2016-08-09 14:52:30 [358322.244214]  [<ffffffffa11a0920>] mdt_intent_getattr+0x2b0/0x470 [mdt]
2016-08-09 14:52:30 [358322.252111]  [<ffffffffa11a3c2c>] mdt_intent_policy+0x5bc/0xbb0 [mdt]
2016-08-09 14:52:30 [358322.259931]  [<ffffffffa0ef20d7>] ldlm_lock_enqueue+0x387/0x970 [ptlrpc]
2016-08-09 14:52:30 [358322.268035]  [<ffffffffa0f1a9e2>] ldlm_handle_enqueue0+0x772/0x16b0 [ptlrpc]
2016-08-09 14:52:30 [358322.276536]  [<ffffffffa0f42030>] ? lustre_swab_ldlm_request+0x0/0x30 [ptlrpc]
2016-08-09 14:52:30 [358322.285221]  [<ffffffffa0f9a272>] tgt_enqueue+0x62/0x210 [ptlrpc]
2016-08-09 14:52:30 [358322.292658]  [<ffffffffa0f9e695>] tgt_request_handle+0x915/0x1320 [ptlrpc]
2016-08-09 14:52:30 [358322.300956]  [<ffffffffa0f4b0cb>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
2016-08-09 14:52:30 [358322.310111]  [<ffffffffa0794758>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
2016-08-09 14:52:30 [358322.318353]  [<ffffffffa0f48c9b>] ? ptlrpc_wait_event+0xab/0x350 [ptlrpc]
2016-08-09 14:52:30 [358322.326534]  [<ffffffff810bd502>] ? default_wake_function+0x12/0x20
2016-08-09 14:52:30 [358322.334141]  [<ffffffff810b3438>] ? __wake_up_common+0x58/0x90
2016-08-09 14:52:30 [358322.341281]  [<ffffffffa0f4f170>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
2016-08-09 14:52:30 [358322.348907]  [<ffffffffa0f4e6e0>] ? ptlrpc_main+0x0/0x1db0 [ptlrpc]
2016-08-09 14:52:30 [358322.356497]  [<ffffffff810a99bf>] kthread+0xcf/0xe0
2016-08-09 14:52:30 [358322.362539]  [<ffffffff810a98f0>] ? kthread+0x0/0xe0
2016-08-09 14:52:30 [358322.368674]  [<ffffffff8165d9d8>] ret_from_fork+0x58/0x90
2016-08-09 14:52:30 [358322.375283]  [<ffffffff810a98f0>] ? kthread+0x0/0xe0
2016-08-09 14:52:30 [358322.381399] 
2016-08-09 14:52:30 [358322.383646] LustreError: dumping log to /tmp/lustre-log.1470779550.134876
2016-08-09 14:53:40 [358391.814084] LustreError: 61123:0:(ldlm_request.c:106:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1470779320, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-lquake-MDT0000_UUID lock: ffff887779436400/0xd7fdbcbd7173ced9 lrc: 3/1,0 mode: --/PR res: [0x200000411:0x2:0x0].0x0 bits 0x13 rrc: 72 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 61123 timeout: 0 lvb_type: 0
2016-08-09 14:54:10 [358421.838538] LustreError: 134876:0:(ldlm_request.c:106:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1470779350, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-lquake-MDT0000_UUID lock: ffff887779434800/0xd7fdbcbd7173cefc lrc: 3/1,0 mode: --/PR res: [0x200000411:0x2:0x0].0x0 bits 0x13 rrc: 72 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 134876 timeout: 0 lvb_type: 0
2016-08-09 14:55:17 [358489.126391] Lustre: 47712:0:(service.c:1336:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
2016-08-09 14:55:17 [358489.126391]   req@ffff8833d0102d00 x1542220339501052/t0(0) o101->9f9de1b9-b610-2a38-3f5f-841c2475a01f@192.168.128.33@o2ib18:462/0 lens 696/3384 e 22 to 0 dl 1470779722 ref 2 fl Interpret:/0/0 rc 0/0
2016-08-09 14:55:57 [358529.607401] Lustre: 134880:0:(service.c:1336:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
2016-08-09 14:55:57 [358529.607401]   req@ffff887880dff500 x1542220340548108/t0(0) o101->ad48e3df-a2c8-571f-bbf1-cbdff6fd3bfa@192.168.128.44@o2ib18:502/0 lens 696/3384 e 22 to 0 dl 1470779762 ref 2 fl Interpret:/0/0 rc 0/0
2016-08-09 14:56:07 [358539.256087] Lustre: lquake-MDT0000: Client 9f9de1b9-b610-2a38-3f5f-841c2475a01f (at 192.168.128.33@o2ib18) reconnecting
2016-08-09 14:56:07 [358539.269040] Lustre: lquake-MDT0000: Connection restored to  (at 192.168.128.33@o2ib18)
2016-08-09 14:58:35 [358687.752322] Lustre: 61096:0:(service.c:1336:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
2016-08-09 14:58:35 [358687.752322]   req@ffff883e4bd06f00 x1542220323771140/t0(0) o101->b7ee1b55-5cca-e2f3-3fcc-0202bce14aec@192.168.128.186@o2ib18:660/0 lens 696/3384 e 22 to 0 dl 1470779920 ref 2 fl Interpret:/0/0 rc 0/0
2016-08-09 14:59:05 [358717.792806] Lustre: 61096:0:(service.c:1336:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
2016-08-09 14:59:05 [358717.792806]   req@ffff8833da488f00 x1542220323771456/t0(0) o101->b7ee1b55-5cca-e2f3-3fcc-0202bce14aec@192.168.128.186@o2ib18:690/0 lens 696/3384 e 11 to 0 dl 1470779950 ref 2 fl Interpret:/0/0 rc 0/0
2016-08-09 14:59:25 [358737.051472] Lustre: lquake-MDT0000: Connection restored to 12dd100c-21bd-8e8c-a8e2-05374050e7f7 (at 192.168.128.186@o2ib18)
2016-08-09 14:59:25 [358737.064963] Lustre: Skipped 1 previous similar message

Simple "ls -l" commands from clients were hanging for very long periods of time (a few minutes at least).

This was with our lustre tag 2.8.0_0.0.llnlpreview.30 installed on the servers (see repo lustre-release-fe-llnl). We reverted to patch 2.8.0_0.0.llnlpreview.29, which did not have the LU-7800 patch, and everything is working as it was previously.

Note that the LU-7800 patch did not apply cleanly to 2.8. I did need to resolve some minor conflicts. I made the simple, naive conflict resolution, so it is worth checking if there was additional change needed for 2.8.

But if porting issues weren't to blame, the LU-7800 patch caused far more trouble than it was intending to fix.



 Comments   
Comment by Di Wang [ 10/Aug/16 ]

It looks like in patch 18542, up_write(&cathandle->lgh_lock) is missing in some cases. I just updated the patch, could you please try(patch set 6) again?

Comment by Peter Jones [ 10/Aug/16 ]

Should we close this ticket as duplicate of LU-7800 and continue investigations there?

Comment by Di Wang [ 10/Aug/16 ]

Probably wait until LLNL confirm this patch could fix their issue, otherwise this might be a different bug. Thanks.

Comment by Christopher Morrone [ 18/Aug/16 ]

We haven't seen this particular assertion in a few days of running the latest LU-7800 patch, but now we see LU-8514.

Comment by Peter Jones [ 18/Aug/16 ]

ok then let's close this ticket and focus ongoing efforts under LU-8489

Generated at Sat Feb 10 02:18:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.