Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8489

llog_cat_add_rec NULL pointer dereference

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.8.0
    • 3
    • 9223372036854775807

    Description

      In the course of testing LU-8370, Patch Set 5 of change http://review.whamcloud.com/21334, I ran into the following NULL pointer dereference on an MDS node:

      2016-08-04 02:47:57 [115783.326226] BUG: unable to handle kernel NULL pointer dereference at           (null)
      2016-08-04 02:47:57 [115783.335993] IP: [<ffffffff81651232>] down_write+0x32/0x43
      2016-08-04 02:47:57 [115783.342892] PGD 0 
      2016-08-04 02:47:57 [115783.345995] Oops: 0002 [#1] SMP 
      2016-08-04 02:47:57 [115783.350434] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE
      ) ptlrpc(OE) obdclass(OE) rpcsec_gss_krb5 ko2iblnd(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) nfsv3 ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad
       rdma_cm ib_cm iw_cm ib_sa ib_mad mlx5_ib iTCO_wdt iTCO_vendor_support intel_powerclamp coretemp intel_rapl kvm ib_core ib_addr pcspkr mlx5_core sb_ed
      ac edac_core mei_me lpc_ich mei mfd_core ses enclosure ipmi_devintf zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) sg znvpair(POE) spl(OE) zlib_deflate
       i2c_i801 ioatdma shpchp ipmi_si ipmi_msghandler acpi_power_meter acpi_cpufreq binfmt_misc nfsd nfs_acl ip_tables auth_rpcgss nfsv4 dns_resolver nfs l
      ockd grace fscache dm_round_robin sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect crct10dif_pclmul sysimgblt crct10dif_common i2c_
      algo_bit crc32_pclmul crc32c_intel drm_kms_helper mxm_wmi ghash_clmulni_intel ttm ahci aesni_intel lrw ixgbe gf128mul glue_helper libahci ablk_helper 
      dca drm mpt3sas ptp libata i2c_core cryptd raid_class pps_core scsi_transport_sas mdio wmi sunrpc dm_mirror dm_region_hash dm_log scsi_transport_iscsi
       dm_multipath dm_mod
      2016-08-04 02:47:57 [115783.476527] CPU: 14 PID: 162546 Comm: mdt03_008 Tainted: P        W  OE  ------------   3.10.0-327.22.2.2chaos.ch6.x86_64 #1
      2016-08-04 02:47:57 [115783.489957] Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
      2016-08-04 02:47:57 [115783.502713] task: ffff883f17ffb980 ti: ffff887ed50d4000 task.ti: ffff887ed50d4000
      2016-08-04 02:47:57 [115783.511991] RIP: 0010:[<ffffffff81651232>]  [<ffffffff81651232>] down_write+0x32/0x43
      2016-08-04 02:47:57 [115783.521679] RSP: 0018:ffff887ed50d78e0  EFLAGS: 00010246
      2016-08-04 02:47:57 [115783.528512] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff887ed50d7fd8
      2016-08-04 02:47:57 [115783.537390] RDX: ffffffff00000001 RSI: 000000000000002f RDI: ffffffff81886372
      2016-08-04 02:47:57 [115783.546265] RBP: ffff887ed50d78e8 R08: ffff8870f59ada00 R09: ffff883f7ec07b00
      2016-08-04 02:47:57 [115783.555130] R10: ffffffffa0f6f1ec R11: ffff88768502ba60 R12: ffff887f246b6a00
      2016-08-04 02:47:57 [115783.563974] R13: 0000000000000000 R14: ffff88768502ba40 R15: ffff8870f59ada00
      2016-08-04 02:47:57 [115783.572814] FS:  0000000000000000(0000) GS:ffff887f7eac0000(0000) knlGS:0000000000000000
      2016-08-04 02:47:57 [115783.582719] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      2016-08-04 02:47:57 [115783.589997] CR2: 0000000000000000 CR3: 0000000001962000 CR4: 00000000001407e0
      2016-08-04 02:47:57 [115783.598840] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      2016-08-04 02:47:57 [115783.607672] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      2016-08-04 02:47:57 [115783.616493] Stack:
      2016-08-04 02:47:57 [115783.619577]  0000000000000000 ffff887ed50d7938 ffffffffa0ca6f84 ffff88768502ba40
      2016-08-04 02:47:57 [115783.628726]  ffff8878897d8000 ffff887ef6bcd2c0 ffff887f246b6a00 ffff887ef6bcd2c0
      2016-08-04 02:47:57 [115783.637872]  ffff8878897d8000 ffff88768502ba40 ffff8870f59ada00 ffff887ed50d7978
      2016-08-04 02:47:57 [115783.647011] Call Trace:
      2016-08-04 02:47:57 [115783.650601]  [<ffffffffa0ca6f84>] llog_cat_add_rec+0x1d4/0x780 [obdclass]
      2016-08-04 02:47:57 [115783.659046]  [<ffffffffa0c9fa3a>] llog_add+0x7a/0x1a0 [obdclass]
      2016-08-04 02:47:57 [115783.666616]  [<ffffffffa0f6f1ec>] ? sub_updates_write+0x7f6/0xef8 [ptlrpc]
      2016-08-04 02:47:57 [115783.675142]  [<ffffffffa0f6f5e3>] sub_updates_write+0xbed/0xef8 [ptlrpc]
      2016-08-04 02:47:57 [115783.683642]  [<ffffffffa0f5dc0f>] top_trans_stop+0x62f/0x970 [ptlrpc]
      2016-08-04 02:47:57 [115783.691833]  [<ffffffffa1213399>] lod_trans_stop+0x259/0x340 [lod]
      2016-08-04 02:47:57 [115783.699562]  [<ffffffffa0d02380>] ? linkea_add_buf+0x80/0x170 [obdclass]
      2016-08-04 02:47:57 [115783.707871]  [<ffffffffa12a18fa>] mdd_trans_stop+0x1a/0x1c [mdd]
      2016-08-04 02:47:57 [115783.715384]  [<ffffffffa128cb08>] mdd_link+0x2e8/0x930 [mdd]
      2016-08-04 02:47:57 [115783.722511]  [<ffffffffa0ee9522>] ? lustre_msg_get_versions+0x22/0xf0 [ptlrpc]
      2016-08-04 02:47:57 [115783.731563]  [<ffffffffa115f08e>] mdt_reint_link+0xade/0xc30 [mdt]
      2016-08-04 02:47:57 [115783.739437]  [<ffffffff81309f82>] ? strlcpy+0x42/0x60
      2016-08-04 02:47:57 [115783.745864]  [<ffffffffa1161470>] mdt_reint_rec+0x80/0x210 [mdt]
      2016-08-04 02:47:57 [115783.753352]  [<ffffffffa1144971>] mdt_reint_internal+0x5e1/0x990 [mdt]
      2016-08-04 02:47:57 [115783.761426]  [<ffffffffa114e0d7>] mdt_reint+0x67/0x140 [mdt]
      2016-08-04 02:47:57 [115783.768525]  [<ffffffffa0f49695>] tgt_request_handle+0x915/0x1320 [ptlrpc]
      2016-08-04 02:47:57 [115783.776987]  [<ffffffffa0ef60cb>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
      2016-08-04 02:47:57 [115783.786323]  [<ffffffffa0784758>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
      2016-08-04 02:47:57 [115783.794830]  [<ffffffffa0ef3c9b>] ? ptlrpc_wait_event+0xab/0x350 [ptlrpc]
      2016-08-04 02:47:57 [115783.803172]  [<ffffffff810bd4c2>] ? default_wake_function+0x12/0x20
      2016-08-04 02:47:57 [115783.810903]  [<ffffffff810b33f8>] ? __wake_up_common+0x58/0x90
      2016-08-04 02:47:57 [115783.818150]  [<ffffffffa0efa170>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
      2016-08-04 02:47:57 [115783.825862]  [<ffffffff81015588>] ? __switch_to+0xf8/0x4d0
      2016-08-04 02:47:57 [115783.832686]  [<ffffffffa0ef96e0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
      2016-08-04 02:47:57 [115783.841633]  [<ffffffff810a997f>] kthread+0xcf/0xe0
      2016-08-04 02:47:57 [115783.847742]  [<ffffffff810a98b0>] ? kthread_create_on_node+0x140/0x140
      2016-08-04 02:47:57 [115783.855707]  [<ffffffff8165d658>] ret_from_fork+0x58/0x90
      2016-08-04 02:47:57 [115783.862363]  [<ffffffff810a98b0>] ? kthread_create_on_node+0x140/0x140
      2016-08-04 02:47:57 [115783.870270] Code: d2 be 2f 00 00 00 48 89 e5 53 48 89 fb 48 c7 c7 72 63 88 81 e8 00 49 a6 ff e8 8b 10 00 00 48 ba 01 00 00 00 ff ff ff ff 48 89 d8 <f0> 48 0f c1 10 85 d2 74 05 e8 f0 e9 cb ff 5b 5d c3 55 48 89 e5 
      2016-08-04 02:47:57 [115783.893325] RIP  [<ffffffff81651232>] down_write+0x32/0x43
      2016-08-04 02:47:57 [115783.900088]  RSP <ffff887ed50d78e0>
      2016-08-04 02:47:57 [115783.904604] CR2: 0000000000000000
      

      Di Wang suggested that this was due to LU-7800, and asked that we try patch 18542. We tested Patch Set version 4 of that change, but it proved disastrous in testing.

      There were lots of messages of this form on the primary MDT:

      2016-08-09 14:52:30 [358322.078236] LNet: Service thread pid 134876 was inactive for 200.32s. The thread might be hung, or it might only be slow and will resume later. Dumping the sta
      ck trace for debugging purposes:
      2016-08-09 14:52:30 [358322.098689] Pid: 134876, comm: mdt03_010
      2016-08-09 14:52:30 [358322.103732] 
      2016-08-09 14:52:30 [358322.103732] Call Trace:
      2016-08-09 14:52:30 [358322.109385]  [<ffffffff81652479>] schedule+0x29/0x70
      2016-08-09 14:52:30 [358322.115535]  [<ffffffff8164fea5>] schedule_timeout+0x175/0x310
      2016-08-09 14:52:30 [358322.122680]  [<ffffffff8108ef30>] ? process_timeout+0x0/0x10
      2016-08-09 14:52:30 [358322.129629]  [<ffffffffa0f0b950>] ? ldlm_expired_completion_wait+0x0/0x250 [ptlrpc]
      2016-08-09 14:52:30 [358322.138798]  [<ffffffffa0f0c141>] ldlm_completion_ast+0x5a1/0x910 [ptlrpc]
      2016-08-09 14:52:30 [358322.147085]  [<ffffffff810bd4f0>] ? default_wake_function+0x0/0x20
      2016-08-09 14:52:30 [358322.154615]  [<ffffffffa0f0e870>] ldlm_cli_enqueue_local+0x230/0x860 [ptlrpc]
      2016-08-09 14:52:30 [358322.163224]  [<ffffffffa0f0bba0>] ? ldlm_completion_ast+0x0/0x910 [ptlrpc]
      2016-08-09 14:52:30 [358322.171534]  [<ffffffffa118e5b0>] ? mdt_blocking_ast+0x0/0x2d0 [mdt]
      2016-08-09 14:52:30 [358322.179253]  [<ffffffffa119e10a>] mdt_object_local_lock+0x52a/0xb00 [mdt]
      2016-08-09 14:52:30 [358322.187458]  [<ffffffffa118e5b0>] ? mdt_blocking_ast+0x0/0x2d0 [mdt]
      2016-08-09 14:52:30 [358322.195194]  [<ffffffffa0f0bba0>] ? ldlm_completion_ast+0x0/0x910 [ptlrpc]
      2016-08-09 14:52:30 [358322.203502]  [<ffffffffa12527f5>] ? lod_index_lookup+0x25/0x30 [lod]
      2016-08-09 14:52:30 [358322.211233]  [<ffffffffa0bbdd4f>] ? __mdd_lookup.isra.17+0x26f/0x450 [mdd]
      2016-08-09 14:52:30 [358322.219527]  [<ffffffffa119e73e>] mdt_object_lock_internal+0x5e/0x2f0 [mdt]
      2016-08-09 14:52:30 [358322.227934]  [<ffffffffa08e5306>] ? lu_object_find+0x16/0x20 [obdclass]
      2016-08-09 14:52:30 [358322.235922]  [<ffffffffa119f614>] mdt_getattr_name_lock+0x8a4/0x1900 [mdt]
      2016-08-09 14:52:30 [358322.244214]  [<ffffffffa11a0920>] mdt_intent_getattr+0x2b0/0x470 [mdt]
      2016-08-09 14:52:30 [358322.252111]  [<ffffffffa11a3c2c>] mdt_intent_policy+0x5bc/0xbb0 [mdt]
      2016-08-09 14:52:30 [358322.259931]  [<ffffffffa0ef20d7>] ldlm_lock_enqueue+0x387/0x970 [ptlrpc]
      2016-08-09 14:52:30 [358322.268035]  [<ffffffffa0f1a9e2>] ldlm_handle_enqueue0+0x772/0x16b0 [ptlrpc]
      2016-08-09 14:52:30 [358322.276536]  [<ffffffffa0f42030>] ? lustre_swab_ldlm_request+0x0/0x30 [ptlrpc]
      2016-08-09 14:52:30 [358322.285221]  [<ffffffffa0f9a272>] tgt_enqueue+0x62/0x210 [ptlrpc]
      2016-08-09 14:52:30 [358322.292658]  [<ffffffffa0f9e695>] tgt_request_handle+0x915/0x1320 [ptlrpc]
      2016-08-09 14:52:30 [358322.300956]  [<ffffffffa0f4b0cb>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
      2016-08-09 14:52:30 [358322.310111]  [<ffffffffa0794758>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
      2016-08-09 14:52:30 [358322.318353]  [<ffffffffa0f48c9b>] ? ptlrpc_wait_event+0xab/0x350 [ptlrpc]
      2016-08-09 14:52:30 [358322.326534]  [<ffffffff810bd502>] ? default_wake_function+0x12/0x20
      2016-08-09 14:52:30 [358322.334141]  [<ffffffff810b3438>] ? __wake_up_common+0x58/0x90
      2016-08-09 14:52:30 [358322.341281]  [<ffffffffa0f4f170>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
      2016-08-09 14:52:30 [358322.348907]  [<ffffffffa0f4e6e0>] ? ptlrpc_main+0x0/0x1db0 [ptlrpc]
      2016-08-09 14:52:30 [358322.356497]  [<ffffffff810a99bf>] kthread+0xcf/0xe0
      2016-08-09 14:52:30 [358322.362539]  [<ffffffff810a98f0>] ? kthread+0x0/0xe0
      2016-08-09 14:52:30 [358322.368674]  [<ffffffff8165d9d8>] ret_from_fork+0x58/0x90
      2016-08-09 14:52:30 [358322.375283]  [<ffffffff810a98f0>] ? kthread+0x0/0xe0
      2016-08-09 14:52:30 [358322.381399] 
      2016-08-09 14:52:30 [358322.383646] LustreError: dumping log to /tmp/lustre-log.1470779550.134876
      2016-08-09 14:53:40 [358391.814084] LustreError: 61123:0:(ldlm_request.c:106:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1470779320, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-lquake-MDT0000_UUID lock: ffff887779436400/0xd7fdbcbd7173ced9 lrc: 3/1,0 mode: --/PR res: [0x200000411:0x2:0x0].0x0 bits 0x13 rrc: 72 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 61123 timeout: 0 lvb_type: 0
      2016-08-09 14:54:10 [358421.838538] LustreError: 134876:0:(ldlm_request.c:106:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1470779350, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-lquake-MDT0000_UUID lock: ffff887779434800/0xd7fdbcbd7173cefc lrc: 3/1,0 mode: --/PR res: [0x200000411:0x2:0x0].0x0 bits 0x13 rrc: 72 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 134876 timeout: 0 lvb_type: 0
      2016-08-09 14:55:17 [358489.126391] Lustre: 47712:0:(service.c:1336:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
      2016-08-09 14:55:17 [358489.126391]   req@ffff8833d0102d00 x1542220339501052/t0(0) o101->9f9de1b9-b610-2a38-3f5f-841c2475a01f@192.168.128.33@o2ib18:462/0 lens 696/3384 e 22 to 0 dl 1470779722 ref 2 fl Interpret:/0/0 rc 0/0
      2016-08-09 14:55:57 [358529.607401] Lustre: 134880:0:(service.c:1336:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
      2016-08-09 14:55:57 [358529.607401]   req@ffff887880dff500 x1542220340548108/t0(0) o101->ad48e3df-a2c8-571f-bbf1-cbdff6fd3bfa@192.168.128.44@o2ib18:502/0 lens 696/3384 e 22 to 0 dl 1470779762 ref 2 fl Interpret:/0/0 rc 0/0
      2016-08-09 14:56:07 [358539.256087] Lustre: lquake-MDT0000: Client 9f9de1b9-b610-2a38-3f5f-841c2475a01f (at 192.168.128.33@o2ib18) reconnecting
      2016-08-09 14:56:07 [358539.269040] Lustre: lquake-MDT0000: Connection restored to  (at 192.168.128.33@o2ib18)
      2016-08-09 14:58:35 [358687.752322] Lustre: 61096:0:(service.c:1336:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
      2016-08-09 14:58:35 [358687.752322]   req@ffff883e4bd06f00 x1542220323771140/t0(0) o101->b7ee1b55-5cca-e2f3-3fcc-0202bce14aec@192.168.128.186@o2ib18:660/0 lens 696/3384 e 22 to 0 dl 1470779920 ref 2 fl Interpret:/0/0 rc 0/0
      2016-08-09 14:59:05 [358717.792806] Lustre: 61096:0:(service.c:1336:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
      2016-08-09 14:59:05 [358717.792806]   req@ffff8833da488f00 x1542220323771456/t0(0) o101->b7ee1b55-5cca-e2f3-3fcc-0202bce14aec@192.168.128.186@o2ib18:690/0 lens 696/3384 e 11 to 0 dl 1470779950 ref 2 fl Interpret:/0/0 rc 0/0
      2016-08-09 14:59:25 [358737.051472] Lustre: lquake-MDT0000: Connection restored to 12dd100c-21bd-8e8c-a8e2-05374050e7f7 (at 192.168.128.186@o2ib18)
      2016-08-09 14:59:25 [358737.064963] Lustre: Skipped 1 previous similar message
      

      Simple "ls -l" commands from clients were hanging for very long periods of time (a few minutes at least).

      This was with our lustre tag 2.8.0_0.0.llnlpreview.30 installed on the servers (see repo lustre-release-fe-llnl). We reverted to patch 2.8.0_0.0.llnlpreview.29, which did not have the LU-7800 patch, and everything is working as it was previously.

      Note that the LU-7800 patch did not apply cleanly to 2.8. I did need to resolve some minor conflicts. I made the simple, naive conflict resolution, so it is worth checking if there was additional change needed for 2.8.

      But if porting issues weren't to blame, the LU-7800 patch caused far more trouble than it was intending to fix.

      Attachments

        Issue Links

          Activity

            People

              pjones Peter Jones
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: