Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4515

Test failure sanity-quota test_34: Used space for user 60000 is 0, expected 2048

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.6.0
    • Lustre 2.6.0
    • None
    • 3
    • 12354

    Description

      This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

      This issue relates to the following test suite run:
      http://maloo.whamcloud.com/test_sets/0c69b0d8-7e9d-11e3-b6d1-52540035b04c
      https://maloo.whamcloud.com/test_sets/e355e8f2-8148-11e3-81ba-52540035b04c

      The sub-test test_34 failed with the following error:

      Used space for user 60000 is 0, expected 2048

      Info required for matching: sanity-quota 34

      Attachments

        Issue Links

          Activity

            [LU-4515] Test failure sanity-quota test_34: Used space for user 60000 is 0, expected 2048
            niu Niu Yawei (Inactive) added a comment - http://review.whamcloud.com/10957

            Patch landed to Master. Please reopen ticket if more work is needed.

            jlevi Jodi Levi (Inactive) added a comment - Patch landed to Master. Please reopen ticket if more work is needed.

            Niu, we still need a test to remove test_34 from sanity-quota.sh ALWAYS_EXCEPT. Please just delete that ALWAYS_EXCEPT line, since these exceptions should be added at the top of the function where ALWAYS_EXCEPT is first defined.

            adilger Andreas Dilger added a comment - Niu, we still need a test to remove test_34 from sanity-quota.sh ALWAYS_EXCEPT. Please just delete that ALWAYS_EXCEPT line, since these exceptions should be added at the top of the function where ALWAYS_EXCEPT is first defined.

            The revert helped but a module loading race condition still exist when both client and server are on the same node. A patch exist to fix that issue at http://review.whamcloud.com/#/c/9038.

            simmonsja James A Simmons added a comment - The revert helped but a module loading race condition still exist when both client and server are on the same node. A patch exist to fix that issue at http://review.whamcloud.com/#/c/9038 .

            I saw this on the mds log before s-q test_34:

            Lustre: DEBUG MARKER: mkdir -p /mnt/mds1; mount -t lustre -o user_xattr,acl  		                   lustre-mdt1/mdt1 /mnt/mds1
            ------------[ cut here ]------------
            WARNING: at fs/proc/generic.c:590 proc_register+0x129/0x220() (Tainted: P        W  ---------------   )
            Hardware name: KVM
            proc_dir_entry 'lustre/osc' already registered
            Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) nodemap(U) osd_zfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic sha256_generic libcfs(U) nfsd exportfs autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
            Pid: 26088, comm: llog_process_th Tainted: P        W  ---------------    2.6.32-358.23.2.el6_lustre.ge80a1ca.x86_64 #1
            Call Trace:
             [<ffffffff8106e3e7>] ? warn_slowpath_common+0x87/0xc0
             [<ffffffff8106e4d6>] ? warn_slowpath_fmt+0x46/0x50
             [<ffffffff811f0299>] ? proc_register+0x129/0x220
             [<ffffffff811f05c2>] ? proc_mkdir_mode+0x42/0x60
             [<ffffffff811f05f6>] ? proc_mkdir+0x16/0x20
             [<ffffffffa0718c00>] ? lprocfs_seq_register+0x20/0x80 [obdclass]
             [<ffffffffa070a203>] ? class_search_type+0x53/0xa0 [obdclass]
             [<ffffffffa1098f0a>] ? osp_lprocfs_init+0x30a/0x530 [osp]
             [<ffffffffa1085c3a>] ? osp_device_alloc+0xfaa/0x29e0 [osp]
             [<ffffffffa0746263>] ? lu_context_init+0xa3/0x240 [obdclass]
             [<ffffffffa072ce2f>] ? obd_setup+0x1bf/0x290 [obdclass]
             [<ffffffffa072d108>] ? class_setup+0x208/0x870 [obdclass]
             [<ffffffffa073567c>] ? class_process_config+0xc6c/0x1ad0 [obdclass]
             [<ffffffffa072e263>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass]
             [<ffffffffa07375fb>] ? class_config_llog_handler+0xa2b/0x1880 [obdclass]
             [<ffffffff8150faee>] ? mutex_lock+0x1e/0x50
             [<ffffffffa06f3d5b>] ? llog_process_thread+0x8ab/0xd60 [obdclass]
             [<ffffffffa0741e4f>] ? keys_fill+0x6f/0x190 [obdclass]
             [<ffffffffa06f4255>] ? llog_process_thread_daemonize+0x45/0x70 [obdclass]
             [<ffffffffa06f4210>] ? llog_process_thread_daemonize+0x0/0x70 [obdclass]
             [<ffffffff81096a36>] ? kthread+0x96/0xa0
             [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
             [<ffffffff810969a0>] ? kthread+0x0/0xa0
             [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
            ---[ end trace f9bd04d09a084708 ]---
            

            Looks there is something wrong with the osp proc files, which makes wait_delete_completed() out of work.

            This should be introduced by LU-3319. Given that the patch "LU-3319 procfs: move osp proc handling to seq_files" (which exactly caused this problem) has been reverted, I think we can add back test_34 now.

            niu Niu Yawei (Inactive) added a comment - I saw this on the mds log before s-q test_34: Lustre: DEBUG MARKER: mkdir -p /mnt/mds1; mount -t lustre -o user_xattr,acl lustre-mdt1/mdt1 /mnt/mds1 ------------[ cut here ]------------ WARNING: at fs/proc/ generic .c:590 proc_register+0x129/0x220() (Tainted: P W --------------- ) Hardware name: KVM proc_dir_entry 'lustre/osc' already registered Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) nodemap(U) osd_zfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic sha256_generic libcfs(U) nfsd exportfs autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib] Pid: 26088, comm: llog_process_th Tainted: P W --------------- 2.6.32-358.23.2.el6_lustre.ge80a1ca.x86_64 #1 Call Trace: [<ffffffff8106e3e7>] ? warn_slowpath_common+0x87/0xc0 [<ffffffff8106e4d6>] ? warn_slowpath_fmt+0x46/0x50 [<ffffffff811f0299>] ? proc_register+0x129/0x220 [<ffffffff811f05c2>] ? proc_mkdir_mode+0x42/0x60 [<ffffffff811f05f6>] ? proc_mkdir+0x16/0x20 [<ffffffffa0718c00>] ? lprocfs_seq_register+0x20/0x80 [obdclass] [<ffffffffa070a203>] ? class_search_type+0x53/0xa0 [obdclass] [<ffffffffa1098f0a>] ? osp_lprocfs_init+0x30a/0x530 [osp] [<ffffffffa1085c3a>] ? osp_device_alloc+0xfaa/0x29e0 [osp] [<ffffffffa0746263>] ? lu_context_init+0xa3/0x240 [obdclass] [<ffffffffa072ce2f>] ? obd_setup+0x1bf/0x290 [obdclass] [<ffffffffa072d108>] ? class_setup+0x208/0x870 [obdclass] [<ffffffffa073567c>] ? class_process_config+0xc6c/0x1ad0 [obdclass] [<ffffffffa072e263>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass] [<ffffffffa07375fb>] ? class_config_llog_handler+0xa2b/0x1880 [obdclass] [<ffffffff8150faee>] ? mutex_lock+0x1e/0x50 [<ffffffffa06f3d5b>] ? llog_process_thread+0x8ab/0xd60 [obdclass] [<ffffffffa0741e4f>] ? keys_fill+0x6f/0x190 [obdclass] [<ffffffffa06f4255>] ? llog_process_thread_daemonize+0x45/0x70 [obdclass] [<ffffffffa06f4210>] ? llog_process_thread_daemonize+0x0/0x70 [obdclass] [<ffffffff81096a36>] ? kthread+0x96/0xa0 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 [<ffffffff810969a0>] ? kthread+0x0/0xa0 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 ---[ end trace f9bd04d09a084708 ]--- Looks there is something wrong with the osp proc files, which makes wait_delete_completed() out of work. This should be introduced by LU-3319 . Given that the patch " LU-3319 procfs: move osp proc handling to seq_files" (which exactly caused this problem) has been reverted, I think we can add back test_34 now.

            Jodi, how has this test been disabled? Could this be related to issues with TEI-1403?

            utopiabound Nathaniel Clark added a comment - Jodi, how has this test been disabled? Could this be related to issues with TEI-1403?

            Test has been disabled. Will re-enable once fix is in place.

            jlevi Jodi Levi (Inactive) added a comment - Test has been disabled. Will re-enable once fix is in place.

            As Johann said the test depends on wait_delete_completed to operate correctly. wait_delete_completed relies on lproc stats maintained in osp_sync.c. I see http://review.whamcloud.com/7158 landed that touches osp code just about the time these errors started up. I don't see anything wrong in that mod, but it does make changes in the close neighborhood and the coincidence in time seems suspicious.

            bogl Bob Glossman (Inactive) added a comment - As Johann said the test depends on wait_delete_completed to operate correctly. wait_delete_completed relies on lproc stats maintained in osp_sync.c. I see http://review.whamcloud.com/7158 landed that touches osp code just about the time these errors started up. I don't see anything wrong in that mod, but it does make changes in the close neighborhood and the coincidence in time seems suspicious.

            Bobijam, since 2.4, this is the MDT which is responsible for sending the OST_SETATTR RPCs to change ownership of objects on the OSTs. This is done through OSP and that's why the test calls wait_delete_completed to wait for OSP to flush all pending requests.

            johann Johann Lombardi (Inactive) added a comment - Bobijam, since 2.4, this is the MDT which is responsible for sending the OST_SETATTR RPCs to change ownership of objects on the OSTs. This is done through OSP and that's why the test calls wait_delete_completed to wait for OSP to flush all pending requests.
            bobijam Zhenyu Xu added a comment - - edited

            from client ll_setattr_raw(), after it set attr on MDS, client only set ost attr for size & time change, won't do anything about uid change on OST.

                    if (attr->ia_valid & (ATTR_SIZE |
                                          ATTR_ATIME | ATTR_ATIME_SET |
                                          ATTR_MTIME | ATTR_MTIME_SET)) {
                            /* For truncate and utimes sending attributes to OSTs, setting
                             * mtime/atime to the past will be performed under PW [0:EOF]
                             * extent lock (new_size:EOF for truncate).  It may seem
                             * excessive to send mtime/atime updates to OSTs when not
                             * setting times to past, but it is necessary due to possible
                             * time de-synchronization between MDT inode and OST objects */
                            rc = ll_setattr_ost(inode, attr);
                    }
            

            I verified this behavior on ldiskfs backend, and the test passed for ldiskfs backend. From what's been observed, it likely relates to the quota implementation with zfs backend.

            bobijam Zhenyu Xu added a comment - - edited from client ll_setattr_raw(), after it set attr on MDS, client only set ost attr for size & time change, won't do anything about uid change on OST. if (attr->ia_valid & (ATTR_SIZE | ATTR_ATIME | ATTR_ATIME_SET | ATTR_MTIME | ATTR_MTIME_SET)) { /* For truncate and utimes sending attributes to OSTs, setting * mtime/atime to the past will be performed under PW [0:EOF] * extent lock (new_size:EOF for truncate). It may seem * excessive to send mtime/atime updates to OSTs when not * setting times to past, but it is necessary due to possible * time de-synchronization between MDT inode and OST objects */ rc = ll_setattr_ost(inode, attr); } I verified this behavior on ldiskfs backend, and the test passed for ldiskfs backend. From what's been observed, it likely relates to the quota implementation with zfs backend.

            The lack of OST_SETATTR requests is weird. The test relies on wait_delete_completed to wait for setattr RPCs to be sent, is it possible that this function does not work any more for setattr?

            johann Johann Lombardi (Inactive) added a comment - The lack of OST_SETATTR requests is weird. The test relies on wait_delete_completed to wait for setattr RPCs to be sent, is it possible that this function does not work any more for setattr?

            People

              niu Niu Yawei (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: