[LU-12615] Lustre mdt_object_remote() bug Created: 30/Jul/19  Updated: 12/Sep/19  Resolved: 21/Aug/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: Lustre 2.13.0, Lustre 2.12.3

Type: Bug Priority: Critical
Reporter: Alibaba Cloud Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-12605 Lustre target_handle_connect() bug Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

In the latest version of lustre file system, the mdt module has a null pointer dereference bug due to the lack of validation for specific fields of packets sent by client.

The kernel panic:

 

[93830.742518] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
[93830.745373] IP: [<ffffffffc101cc2a>] mdt_object_lock_internal+0x4a/0x360 [mdt]
[93830.748121] PGD 0
[93830.750311] Oops: 0000 [#1] SMP
[93830.752631] Modules linked in: ofd(OE) ost(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) loop lustre(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) crc_t10dif crct10dif_generic ksocklnd(OE) lnet(OE) libcfs(OE) dm_flakey dm_mod nfit libnvdimm iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ppdev ablk_helper parport_pc cryptd parport virtio_balloon joydev i2c_piix4 pcspkr ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_net virtio_console virtio_blk cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ata_piix crct10dif_pclmul drm crct10dif_common crc32c_intel libata serio_raw virtio_pci virtio_ring virtio drm_panel_orientation_quirks floppy [last unloaded: stap_f794b15c5a4eaa0a0f39cf777c459f09_5532]
[93830.772472]
[93830.774428] CPU: 0 PID: 5222 Comm: mdt00_004 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.10.1.el7_lustre.x86_64 #1
[93830.779447] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014
[93830.782131] task: ffff9e93fb009040 ti: ffff9e90369d8000 task.ti: ffff9e90369d8000
[93830.784762] RIP: 0010:[<ffffffffc101cc2a>]  [<ffffffffc101cc2a>] mdt_object_lock_internal+0x4a/0x360 [mdt]
[93830.787609] RSP: 0018:ffff9e90369dba80  EFLAGS: 00010246
[93830.789999] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff9e90369dbae8
[93830.792546] RDX: ffff9e93ef99a0f0 RSI: 0000000000000000 RDI: ffff9e93ef99a000
[93830.795054] RBP: ffff9e90369dbad0 R08: 0000000000000000 R09: 0000000000000000
[93830.797536] R10: ffff9e9400aa3c00 R11: 0000000000000020 R12: ffff9e93ef99a0f0
[93830.799980] R13: ffff9e93ef99a000 R14: ffff9e90369dbae8 R15: 0000000000000000
[93830.802402] FS:  0000000000000000(0000) GS:ffff9e943fc00000(0000) knlGS:0000000000000000
[93830.804901] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[93830.807152] CR2: 0000000000000050 CR3: 0000000422b0e000 CR4: 00000000003606f0
[93830.809521] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[93830.811846] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[93830.814138] Call Trace:
[93830.816037]  [<ffffffffc094c2dc>] ? lustre_msg_get_flags+0x2c/0xa0 [ptlrpc]
[93830.818297]  [<ffffffffc1025f33>] mdt_intent_getxattr+0xb3/0x2c0 [mdt]
[93830.820529]  [<ffffffffc10229e4>] mdt_intent_policy+0x2d4/0xdd0 [mdt]
[93830.822702]  [<ffffffffc1025e80>] ? mdt_intent_getattr+0x480/0x480 [mdt]
[93830.824902]  [<ffffffffc08ffc66>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]
[93830.827040]  [<ffffffffc05da3d3>] ? cfs_hash_bd_add_locked+0x63/0x80 [libcfs]
[93830.829184]  [<ffffffffc05dd96e>] ? cfs_hash_add+0xbe/0x1a0 [libcfs]
[93830.831278]  [<ffffffffc0928587>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
[93830.833470]  [<ffffffffc0950520>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
[93830.835699]  [<ffffffffc09b1082>] tgt_enqueue+0x62/0x210 [ptlrpc]
[93830.837786]  [<ffffffffc09b72ca>] tgt_request_handle+0x91a/0x15c0 [ptlrpc]
[93830.839875]  [<ffffffffc05d6fa7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[93830.841956]  [<ffffffffc095a88e>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc]
[93830.844066]  [<ffffffffb1ccbadb>] ? __wake_up_common+0x5b/0x90
[93830.846014]  [<ffffffffc095e384>] ptlrpc_main+0xbb4/0x20f0 [ptlrpc]
[93830.847946]  [<ffffffffb1cd08c0>] ? finish_task_switch+0x50/0x1c0
[93830.849866]  [<ffffffffc095d7d0>] ? ptlrpc_register_service+0xfa0/0xfa0 [ptlrpc]
[93830.851842]  [<ffffffffb1cc1c71>] kthread+0xd1/0xe0
[93830.853559]  [<ffffffffb1cc1ba0>] ? insert_kthread_work+0x40/0x40
[93830.855362]  [<ffffffffb2375c1d>] ret_from_fork_nospec_begin+0x7/0x21
[93830.857178]  [<ffffffffb1cc1ba0>] ? insert_kthread_work+0x40/0x40
[93830.858951] Code: 89 f3 48 83 ec 28 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 f6 05 bd a3 5d ff 01 74 0d f6 05 b8 a3 5d ff 04 0f 85 7e 00 00 00 <48> 8b 43 50 f6 40 1c 02 0f 85 d6 00 00 00 45 0f b6 c9 4d 89 f8 
[93830.864078] RIP  [<ffffffffc101cc2a>] mdt_object_lock_internal+0x4a/0x360 [mdt]
[93830.866025]  RSP <ffff9e90369dba80>
[93830.867566] CR2: 0000000000000050

In function mdt_object_remote(), there is no check about the pointer o, if the value of o is null, a null pointer dereference bug will happen.

static inline int mdt_object_remote(const struct mdt_object *o){ 
return lu_object_remote(&o->mot_obj);
}


 Comments   
Comment by Peter Jones [ 31/Jul/19 ]

Hongchao

Could you please investigate

Thanks

Peter

Comment by Hongchao Zhang [ 01/Aug/19 ]

Hi,
What is the client version?
if the mdt_object is NULL, there could be two cases, the first is there is no MDT_BODY field in the request,
the second is no OBD_MD_FLID flag is set on "body->mbo_valid".

btw, is the kernel dump available?
Thanks!

Comment by Andreas Dilger [ 01/Aug/19 ]

Hongchao, this is being induced by specially-crafted RPC requests to test error handling on the server.

It looks like this is via:

mdt_intent_getxattr()
  ->mdt_object_lock(info->mti_object)
    ->mdt_object_lock_internal()
      ->mdt_object_remote()
        ->lu_object_remote(&o->mot_obj)

The info->mti_object field is assigned in mdt_body_unpack(), but that is checking !IS_ERR(obj) before assigning it. It looks like the problem is caused by the request not containing the OBD_MD_FLID flag to indicate that the mbo_fid1 field is valid, which causes mdt_body_unpack() to return without setting mti_object, without returning an error. For some RPCs (e.g. mdt_sync()) there is an MDT_BODY field, but is not a requirement for OBD_MD_FLFID to be set because it can be used for both whole-device sync or single-file sync, so this is correct behavior for mdt_body_unpack(). The other place that mti_object is assigned is tsi2mdt_info() using tsi_corpus, but that is itself set in mdt_body_unpack() so is the same root cause.

I checked all of the mdt_reint_* routines and they appear to handle the error from mdt_object_find() properly, so you should focus on other places that are using mti_object directly. Looking at mdt_sync(), the handling is a bit strange because it is checking if mbo_fid1 is unset for the whole-device sync, then assumes mti_object is OK otherwise (which it may not be if OBD_MD_FLFID is unset).

I think there are two potential ways to handle this:

  • add checks for mti_object == NULL at the start of high-level RPC handling routines mdt_intent_getxattr(), mdt_getattr(), mdt_getxattr(), mdt_swap_layouts(), and mdt_sync() (some of which LASSERT(obj != NULL) or equivalent, and should instead just return -EPROTO in that case)
  • add checks in low-level routines like mdt_object_remote(), mdt_object_exists(), mdt_object_child(), mdt_object_fid() that are accessing mot_obj, but those functions are called in a lot of places and if mti_object is checked once at the top level then we don't need to check it repeatedly, so I'd rather not do this.
Comment by Gerrit Updater [ 11/Aug/19 ]

Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35764
Subject: LU-12615 mdt: check mdt_object
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9ad3ae7465eaedcc86772c2518e88def4b18b8e8

Comment by Gerrit Updater [ 21/Aug/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35764/
Subject: LU-12615 mdt: check mdt_object
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e5e0bdb7a5c2d47ceaa2d1c190806d1be4999129

Comment by Peter Jones [ 21/Aug/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 22/Aug/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35869
Subject: LU-12615 mdt: check mdt_object
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: f562e6cea802aa408bd94c84427e983eeae55730

Comment by Gerrit Updater [ 12/Sep/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35869/
Subject: LU-12615 mdt: check mdt_object
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 41e2bd9752851a7f3989c6edfe2134742d0ab59f

Generated at Sat Feb 10 02:54:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.