Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5357

GPF in lod_trans_stop()

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.7.0
    • Lustre 2.6.0, Lustre 2.7.0
    • 3
    • 14940

    Description

      While removing a striped directory if out_create_update_req() cannot allocate a update request or its 8K buffer we have the following GPF (likely use after free).

      export MDSCOUNT=4
      llmount.sh
      cd /mnt/lustre
      lfs mkdir -c4 d0
      echo /root/lustre-release/lustre/ptlrpc/../../lustre/target/out_lib.c:70 > /proc/fs/lustre/alloc_fail # fail to allocate dt_update in out_create_update_req()
      rmdir d0
      
      [   85.080526] LustreError: 4395:0:(class_obd.c:198:obd_alloc_fail()) force kmalloc of dt_update (72 bytes) failed at /root/lustre-release/lustre/ptlrpc/../../lustre/target/out_lib.c:70
      [   85.085387] LustreError: 4395:0:(class_obd.c:205:obd_alloc_fail()) 63673246 total bytes and 1048576 total pages (256 bytes) allocated by Lustre, 407071412 total bytes by LNET
      [   94.798945] Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000300000400-0x0000000340000400):2:mdt
      [   94.801799] Lustre: Skipped 1 previous similar message
      [   94.803637] Lustre: cli-ctl-lustre-MDT0002: Allocated super-sequence [0x0000000300000400-0x0000000340000400):2:mdt]
      [  103.921859] LustreError: 4804:0:(class_obd.c:198:obd_alloc_fail()) force kmalloc of dt_update (72 bytes) failed at /root/lustre-release/lustre/ptlrpc/../../lustre/target/out_lib.c:70
      [  103.927270] LustreError: 4804:0:(class_obd.c:205:obd_alloc_fail()) 63959198 total bytes and 1048576 total pages (256 bytes) allocated by Lustre, 407352740 total bytes by LNET
      [  103.932656] LustreError: 4804:0:(osp_md_object.c:237:osp_md_declare_attr_set()) lustre-MDT0001-osp-MDT0000: Get OSP update buf failed: -12
      [  103.936893] LustreError: 4804:0:(lod_object.c:1081:lod_declare_attr_set()) failed declaration: -12
      [  103.939281] general protection fault: 0000 [#1] SMP
      [  103.940247] last sysfs file: /sys/devices/system/cpu/possible
      [  103.940247] CPU 0
      [  103.940247] Modules linked in: lustre(U) ofd(U) osp(U) lod(U) ost(U) mdt(U) mdd(U) mgs(U) nodemap(U) osd_ldiskfs(U) ldiskfs(U) exportfs lquota(U) lfsck(U) jbd obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) ksocklnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc ipv6 microcode virtio_balloon virtio_net i2c_piix4 i2c_core ext4 jbd2 mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
      [  103.940247]
      [  103.940247] Pid: 4804, comm: mdt00_004 Not tainted 2.6.32-431.5.1.el6.lustre.x86_64 #1 Bochs Bochs
      [  103.940247] RIP: 0010:[<ffffffffa0d3d890>]  [<ffffffffa0d3d890>] lod_trans_stop+0x110/0x210 [lod]
      [  103.940247] RSP: 0018:ffff8801e244bab0  EFLAGS: 00010292
      [  103.940247] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8801e85d6080 RCX: 0000000000000000
      [  103.940247] RDX: 0000000000000000 RSI: ffff8801e244cf58 RDI: ffff880219e58000
      [  103.940247] RBP: ffff8801e244bae0 R08: 0000000000000000 R09: 0000000000000000
      [  103.940247] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      [  103.940247] R13: ffff8802199fcab0 R14: ffff8801fbac8ae0 R15: ffff8801fe3fdba8
      [  103.940247] FS:  0000000000000000(0000) GS:ffff88002f800000(0000) knlGS:0000000000000000
      [  103.940247] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      [  103.940247] CR2: 000000377fedc920 CR3: 00000002162cd000 CR4: 00000000000006f0
      [  103.940247] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  103.940247] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [  103.940247] Process mdt00_004 (pid: 4804, threadinfo ffff8801e244a000, task ffff8801e97a8580)
      [  103.940247] Stack:
      [  103.940247]  ffff8801e244bad0 00000000fffffff4 ffff8801fbac8ae0 ffff8801eda3c820
      [  103.940247] <d> ffff8801fe3b2c30 ffff8801fe3fdba8 ffff8801e244baf0 ffffffffa081f08d
      [  103.940247] <d> ffff8801e244bbb0 ffffffffa08097e9 ffffffffa0cb47ba ffff8801eda3a7b0
      [  103.940247] Call Trace:
      [  103.940247]  [<ffffffffa081f08d>] mdd_trans_stop+0x1d/0x20 [mdd]
      [  103.940247]  [<ffffffffa08097e9>] mdd_unlink+0x4b9/0xcc0 [mdd]
      [  103.940247]  [<ffffffffa0cb47ba>] ? mdt_reint_unlink+0x9ca/0x10b0 [mdt]
      [  103.940247]  [<ffffffffa0cab968>] mdo_unlink+0x18/0x50 [mdt]
      [  103.940247]  [<ffffffffa0cb47f4>] mdt_reint_unlink+0xa04/0x10b0 [mdt]
      [  103.940247]  [<ffffffffa0c8ee45>] ? mdt_ucred+0x15/0x20 [mdt]
      [  103.940247]  [<ffffffffa0cab701>] mdt_reint_rec+0x41/0xe0 [mdt]
      [  103.940247]  [<ffffffffa0c96a63>] mdt_reint_internal+0x4c3/0x7c0 [mdt]
      [  103.940247]  [<ffffffffa0c972eb>] mdt_reint+0x6b/0x120 [mdt]
      [  103.940247]  [<ffffffffa06e9675>] tgt_request_handle+0x245/0xad0 [ptlrpc]
      [  103.940247]  [<ffffffffa069c921>] ptlrpc_main+0xcf1/0x1870 [ptlrpc]
      [  103.940247]  [<ffffffffa069bc30>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
      [  103.940247]  [<ffffffff8109eab6>] kthread+0x96/0xa0
      [  103.940247]  [<ffffffff8100c30a>] child_rip+0xa/0x20
      [  103.940247]  [<ffffffff81554710>] ? _spin_unlock_irq+0x30/0x40
      [  103.940247]  [<ffffffff8100bb10>] ? restore_args+0x0/0x30
      [  103.940247]  [<ffffffff8109ea20>] ? kthread+0x0/0xa0
      [  103.940247]  [<ffffffff8100c300>] ? child_rip+0x0/0x20
      [  103.940247] Code: 00 bb 01 00 00 48 c7 05 eb b4 03 00 00 00 00 00 c7 05 d9 b4 03 00 01 00 00 00 e8 3c 17 59 ff e9 34 ff ff ff 49 8b 45 00 49 39 c5 <4c> 8b 38 74 4d 48 8b 70 f8 48 8b 46 40 48 8b 40 18 48 85 c0 0f
      [  103.940247] RIP  [<ffffffffa0d3d890>] lod_trans_stop+0x110/0x210 [lod]
      [  103.940247]  RSP <ffff8801e244bab0>
      

      The fault is in the list_for_each_entry() block. It seems likely that there were no updates in the list and that the first dt_trans_stop() freed the thandle.

      static int lod_trans_stop(const struct lu_env *env, struct dt_device *dt,
                                struct thandle *th)
      {
              struct thandle_update           *tu = th->th_update;
              struct dt_update_request        *update;
              struct dt_update_request        *tmp;
              int                             rc2 = 0;
              int                             rc;
              ENTRY;
      
              CERROR("dt = %p, th = %p\n", dt, th);
      
              rc = dt_trans_stop(env, th->th_dev, th);
              if (likely(tu == NULL))
                      RETURN(rc);
      
              list_for_each_entry_safe(update, tmp,
                                       &tu->tu_remote_update_list,
                                       dur_list) {
                      /* update will be freed inside dt_trans_stop */
                      rc2 = dt_trans_stop(env, update->dur_dt, th);
                      if (unlikely(rc2 != 0 && rc == 0))
                              rc = rc2;
              }
      
              RETURN(rc);
      }
      

      This was found via memory allocation fault injection.

      Attachments

        Issue Links

          Activity

            People

              di.wang Di Wang
              jhammond John Hammond
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: