Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3160

recovery-random-scale test_fail_client_mds: RIP: cl_object_top+0xe/0x150 [obdclass]

Details

    • 3
    • 7697

    Description

      While running recovery-random-scale (failing one random client and then failing mds), dd operation on the other live client (wtm-4vm5) hung and the client crashed:

      2013-04-11 03:01:35: dd run starting
      + mkdir -p /mnt/lustre/d0.dd-wtm-4vm5.rosso.whamcloud.com
      + /usr/bin/lfs setstripe -c -1 /mnt/lustre/d0.dd-wtm-4vm5.rosso.whamcloud.com
      + cd /mnt/lustre/d0.dd-wtm-4vm5.rosso.whamcloud.com
      ++ /usr/bin/lfs df /mnt/lustre/d0.dd-wtm-4vm5.rosso.whamcloud.com
      + FREE_SPACE=97381848
      + BLKS=21910915
      + echo 'Free disk space is 97381848, 4k blocks to dd is 21910915'
      + load_pid=2634
      + wait 2634
      + dd bs=4k count=21910915 status=noxfer if=/dev/zero of=/mnt/lustre/d0.dd-wtm-4vm5.rosso.whamcloud.com/dd-file
      

      Console log on wtm-4vm5:

      BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
      IP: [<ffffffffa05b4b4e>] cl_object_top+0xe/0x150 [obdclass]
      PGD 7b01e067 PUD 7b016067 PMD 0
      Oops: 0000 [#1] SMP
      last sysfs file: /sys/devices/system/cpu/possible
      CPU 0
      Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic libcfs(U) nfsd exportfs autofs4 nfs lockd fscache nfs_acl auth_rpcgss sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
      
      Pid: 2683, comm: flush-lustre-1 Not tainted 2.6.32-279.19.1.el6.x86_64 #1 Red Hat KVM
      RIP: 0010:[<ffffffffa05b4b4e>]  [<ffffffffa05b4b4e>] cl_object_top+0xe/0x150 [obdclass]
      RSP: 0018:ffff88004c3c5980  EFLAGS: 00010282
      RAX: ffff88007bc75800 RBX: ffff88007d1c21e8 RCX: 0000000000000098
      RDX: ffff88003e2bb200 RSI: ffffffffa0602400 RDI: 0000000000000098
      RBP: ffff88004c3c5990 R08: 0000000000000001 R09: 0000000000000000
      R10: 000000000000000f R11: 000000000000000f R12: ffff88007d1bc3d0
      R13: 0000000000000004 R14: 0000000000000098 R15: ffff88007bc75800
      FS:  0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      CR2: 0000000000000098 CR3: 000000007b02d000 CR4: 00000000000006f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process flush-lustre-1 (pid: 2683, threadinfo ffff88004c3c4000, task ffff88007b730080)
      Stack:
       ffff88004c3c5990 ffff88007d1c21e8 ffff88004c3c59d0 ffffffffa05c575d
      <d> 0000000000000000 ffff8800430e3c00 ffff88007d1bc378 0000000000000000
      <d> ffff88007d1bf768 ffff88007b020e80 ffff88004c3c5a30 ffffffffa09d3488
      Call Trace:
       [<ffffffffa05c575d>] cl_io_sub_init+0x3d/0xc0 [obdclass]
       [<ffffffffa09d3488>] lov_sub_get+0x218/0x690 [lov]
       [<ffffffffa09d5116>] lov_io_iter_init+0xd6/0x480 [lov]
       [<ffffffffa05c279d>] cl_io_iter_init+0x5d/0x110 [obdclass]
       [<ffffffffa05c6d3c>] cl_io_loop+0x4c/0x1b0 [obdclass]
       [<ffffffffa0a5233b>] cl_sync_file_range+0x2fb/0x4e0 [lustre]
       [<ffffffffa0a7ba7f>] ll_writepages+0x6f/0x1a0 [lustre]
       [<ffffffff811255d1>] do_writepages+0x21/0x40
       [<ffffffff8119fe8d>] writeback_single_inode+0xdd/0x290
       [<ffffffff811a029e>] writeback_sb_inodes+0xce/0x180
       [<ffffffff811a03fb>] writeback_inodes_wb+0xab/0x1b0
       [<ffffffff811a079b>] wb_writeback+0x29b/0x3f0
       [<ffffffff814e9c50>] ? thread_return+0x4e/0x76e
       [<ffffffff8107d572>] ? del_timer_sync+0x22/0x30
       [<ffffffff811a0a89>] wb_do_writeback+0x199/0x240
       [<ffffffff811a0b93>] bdi_writeback_task+0x63/0x1b0
       [<ffffffff81090857>] ? bit_waitqueue+0x17/0xd0
       [<ffffffff81134170>] ? bdi_start_fn+0x0/0x100
       [<ffffffff811341f6>] bdi_start_fn+0x86/0x100
       [<ffffffff81134170>] ? bdi_start_fn+0x0/0x100
       [<ffffffff81090626>] kthread+0x96/0xa0
       [<ffffffff8100c0ca>] child_rip+0xa/0x20
       [<ffffffff81090590>] ? kthread+0x0/0xa0
       [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      Code: 04 00 00 00 04 00 e8 52 b7 e8 ff 48 c7 c7 60 2b 60 a0 e8 16 b3 e7 ff 66 0f 1f 44 00 00 55 48 89 e5 53 48 83 ec 08 0f 1f 44 00 00 <48> 8b 07 0f 1f 80 00 00 00 00 48 89 c2 48 8b 80 88 00 00 00 48
      RIP  [<ffffffffa05b4b4e>] cl_object_top+0xe/0x150 [obdclass]
       RSP <ffff88004c3c5980>
      

      Maloo report: https://maloo.whamcloud.com/test_sets/c1c906c6-a294-11e2-81ba-52540035b04c

      The console logs in the above Maloo report were not gathered completely due to TT-1107. Please refer to the attachment for the full console logs.

      Attachments

        Activity

          [LU-3160] recovery-random-scale test_fail_client_mds: RIP: cl_object_top+0xe/0x150 [obdclass]
          pjones Peter Jones added a comment -

          Landed for 2.4

          pjones Peter Jones added a comment - Landed for 2.4

          This should be a problem of force umount that it causes failure in refresh layout. I have updated the patch, please take a look.

          jay Jinshan Xiong (Inactive) added a comment - This should be a problem of force umount that it causes failure in refresh layout. I have updated the patch, please take a look.
          jay Jinshan Xiong (Inactive) added a comment - - edited

          Do you know why we didn't flush dirty on layout lock revocation in the begining? Thanks.

          We don't want to introduce cascading problem - in case one of the OSTs is unaccessible, the client will be evicted by the MDT because it can't cancel the layout lock in time.

          If what you said was true, we should have seen this problem in migration, because we're using an orphan object to restripe. I will take a look at this.

          jay Jinshan Xiong (Inactive) added a comment - - edited Do you know why we didn't flush dirty on layout lock revocation in the begining? Thanks. We don't want to introduce cascading problem - in case one of the OSTs is unaccessible, the client will be evicted by the MDT because it can't cancel the layout lock in time. If what you said was true, we should have seen this problem in migration, because we're using an orphan object to restripe. I will take a look at this.

          Looks we chose to flush dirty when changing layout but not on layout lock revocation, not sure if it was the best choice, but the problem can be worked around by anohter way (ignore -ENOENT error on layout refresh for io).

          niu Niu Yawei (Inactive) added a comment - Looks we chose to flush dirty when changing layout but not on layout lock revocation, not sure if it was the best choice, but the problem can be worked around by anohter way (ignore -ENOENT error on layout refresh for io).

          Hi, Xiong

          Why don't we flush dirty before lost layout lock (see ll_md_blocking_ast())?

          This patch caused LBUG when running conf-sanity test_0:

          • client write data to file, then do unlink;
          • layout lock is revoked, but the dirty isn't being flushed;
          • on client umount, kernel try to flush dirty back by calling ll_writepages(), which will verify layout (what this patch did), but layout fetch will defenitely fail because the file has been removed on MDS;
          • dirty isn't flushed back at the end, and the LBUG is triggered.

          Do you know why we didn't flush dirty on layout lock revocation in the begining? Thanks.

          niu Niu Yawei (Inactive) added a comment - Hi, Xiong Why don't we flush dirty before lost layout lock (see ll_md_blocking_ast())? This patch caused LBUG when running conf-sanity test_0: client write data to file, then do unlink; layout lock is revoked, but the dirty isn't being flushed; on client umount, kernel try to flush dirty back by calling ll_writepages(), which will verify layout (what this patch did), but layout fetch will defenitely fail because the file has been removed on MDS; dirty isn't flushed back at the end, and the LBUG is triggered. Do you know why we didn't flush dirty on layout lock revocation in the begining? Thanks.

          Looks the patch revealed another problem, conf-sanity test_0 will always hit LBUG. (see LU-3230), I'm looking into it.

          niu Niu Yawei (Inactive) added a comment - Looks the patch revealed another problem, conf-sanity test_0 will always hit LBUG. (see LU-3230 ), I'm looking into it.
          niu Niu Yawei (Inactive) added a comment - http://review.whamcloud.com/6154

          Yes, this is exactly what we're doing. However, there are some cases that we need to start an IO at OSC layer, which may have already held conf_lock so it may cause deadlock. ci_ignore_layout was worked out to indicate if it needs to hold layout lock to start the IO. Usually if the IO is about page we're pretty safe because layout change has to clean up pages.

          For cl_sync_file_range() case, I set ci_ignore_layout to 1 because I thought if there exists dirty pages, layout won't be changed. Obviously I was wrong because there is race.

          For this specific problem, it can be fixed by revising ci_ignore_layout to 0. Please also check the other callers of cl_sync_file_range() to make sure it's true.

          jay Jinshan Xiong (Inactive) added a comment - Yes, this is exactly what we're doing. However, there are some cases that we need to start an IO at OSC layer, which may have already held conf_lock so it may cause deadlock. ci_ignore_layout was worked out to indicate if it needs to hold layout lock to start the IO. Usually if the IO is about page we're pretty safe because layout change has to clean up pages. For cl_sync_file_range() case, I set ci_ignore_layout to 1 because I thought if there exists dirty pages, layout won't be changed. Obviously I was wrong because there is race. For this specific problem, it can be fixed by revising ci_ignore_layout to 0. Please also check the other callers of cl_sync_file_range() to make sure it's true.

          Hi, Xiong

          I don't see why we didn't use lo_type_guard to protect layout at the beginning, which looks simpler and safer than current way (active_ios & waitq).

          • Get read lock for io: call lov_conf_freeze() in lov_io_init_raid0(), call lov_conf_thaw() in lov_io_fini();
          • Get write lock for layout change: call lov_conf_lock() & lov_conf_unlock() in lov_layout_change() to make the "layout delete -> layout reinstall" atomic;

          Did I miss anything?

          niu Niu Yawei (Inactive) added a comment - Hi, Xiong I don't see why we didn't use lo_type_guard to protect layout at the beginning, which looks simpler and safer than current way (active_ios & waitq). Get read lock for io: call lov_conf_freeze() in lov_io_init_raid0(), call lov_conf_thaw() in lov_io_fini(); Get write lock for layout change: call lov_conf_lock() & lov_conf_unlock() in lov_layout_change() to make the "layout delete -> layout reinstall" atomic; Did I miss anything?

          People

            niu Niu Yawei (Inactive)
            yujian Jian Yu
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: