Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13670

sanity-quota test_6: FAIL: dd not finished in 240 secs

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.16.0, Lustre 2.15.6
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for S Buisson <sbuisson@ddn.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/37495c46-a593-48c0-a5d8-28bc0e342fc3

      test_6 failed with the following error:

      dd not finished in 240 secs
      

      This happened with ZFS backend.
      In the OSS dmesg, we can see complaints about lost connection to the MDS:

      [ 3066.737564] Lustre: 17933:0:(client.c:2245:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1591889766/real 1591889766]  req@ffff8c0c6e17f180 x1669214324881792/t0(0) o601->lustre-MDT0000-lwp-OST0001@10.9.6.237@tcp:23/10 lens 336/336 e 0 to 1 dl 1591889773 ref 2 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'ll_ost_io00_005.0'
      [ 3066.742583] Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.9.6.237@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [ 3075.071151] Lustre: 17933:0:(client.c:2245:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1591889774/real 1591889774]  req@ffff8c0c6e17df80 x1669214324882240/t0(0) o601->lustre-MDT0000-lwp-OST0001@10.9.6.237@tcp:23/10 lens 336/336 e 0 to 1 dl 1591889781 ref 2 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'ll_ost_io00_005.0'
      [ 3075.076488] Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.9.6.237@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [ 3083.079772] Lustre: 17933:0:(client.c:2245:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1591889782/real 1591889782]  req@ffff8c0c6e17c480 x1669214324882752/t0(0) o601->lustre-MDT0000-lwp-OST0001@10.9.6.237@tcp:23/10 lens 336/336 e 0 to 1 dl 1591889789 ref 2 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'ll_ost_io00_005.0'
      [ 3083.084896] Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.9.6.237@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [ 3091.610366] Lustre: 17933:0:(client.c:2245:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1591889791/real 1591889791]  req@ffff8c0c7be6b180 x1669214324883264/t0(0) o601->lustre-MDT0000-lwp-OST0001@10.9.6.237@tcp:23/10 lens 336/336 e 0 to 1 dl 1591889798 ref 2 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'ll_ost_io00_005.0'
      [ 3091.615435] Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.9.6.237@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [ 3098.919547] Lustre: DEBUG MARKER: lctl set_param at_max=600
      [ 3099.698478] Lustre: DEBUG MARKER: dmesg
      [ 3101.618918] Lustre: 17933:0:(client.c:2245:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1591889801/real 1591889801]  req@ffff8c0c5d30fa80 x1669214324883776/t0(0) o601->lustre-MDT0000-lwp-OST0001@10.9.6.237@tcp:23/10 lens 336/336 e 0 to 1 dl 1591889808 ref 2 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'ll_ost_io00_005.0'
      [ 3101.623866] Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.9.6.237@tcp) was lost; in progress operations using this service will wait for recovery to complete
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity-quota test_6 - dd not finished in 240 secs

      Attachments

        Issue Links

          Activity

            [LU-13670] sanity-quota test_6: FAIL: dd not finished in 240 secs
            emoly.liu Emoly Liu added a comment - +1 on master: https://testing.whamcloud.com/test_sets/faeae133-7c45-448e-8ad9-2206c19288b1
            stancheff Shaun Tancheff added a comment - - edited

            This looks to be a related failure with a slightly different signature, from: https://testing.whamcloud.com/test_sets/39097f0a-1e73-427d-bb04-5d47b09bbf36

            [10368.372670] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-zfs.lustre-OST0007.quota_slave.enabled
            [10413.491561] Lustre: ll_ost00_008: service thread pid 156469 was inactive for 43.952 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
            [10413.494794] Pid: 156469, comm: ll_ost00_008 4.18.0-553.27.1.el8_lustre.x86_64 #1 SMP Mon Nov 18 21:14:09 UTC 2024
            [10413.496545] Call Trace TBD:
            [10413.497215] [<0>] cv_wait_common+0xaf/0x130 [spl]
            [10413.498111] [<0>] txg_wait_synced_impl+0xc6/0x110 [zfs]
            [10413.499426] [<0>] txg_wait_synced+0xc/0x40 [zfs]
            [10413.500345] [<0>] osd_trans_stop+0x515/0x550 [osd_zfs]
            [10413.501338] [<0>] ofd_attr_set+0xaf0/0x10e0 [ofd]
            [10413.502245] [<0>] ofd_setattr_hdl+0x4b7/0x8f0 [ofd]
            [10413.503132] [<0>] tgt_request_handle+0x3f4/0x1a30 [ptlrpc]
            [10413.504585] [<0>] ptlrpc_server_handle_request+0x2aa/0xcf0 [ptlrpc]
            [10413.505753] [<0>] ptlrpc_main+0xc9e/0x15c0 [ptlrpc]
            [10413.506695] [<0>] kthread+0x134/0x150
            [10413.507392] [<0>] ret_from_fork+0x35/0x40
            [10427.005026] Lustre: ll_ost00_008: service thread pid 156469 completed after 57.466s. This likely indicates the system was overloaded (too many service threads, or not enough hardware resources).
            [10432.477003] Lustre: DEBUG MARKER: lctl get_param -n at_max
            [10432.778084] Lustre: DEBUG MARKER: lctl set_param at_max=20
            [10487.850642] Lustre: DEBUG MARKER: lctl set_param -n osd*.*OS*.force_sync=1
            [10488.495676] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param osd-*.lustre-OST*.quota_slave.timeout=10
            [10515.378550] Lustre: 181682:0:(service.c:1456:ptlrpc_at_send_early_reply()) @@@ Could not add any time (5/5), not sending early reply  req@ffff9337b6921380 x1817656977237248/t0(0) o4->1ded8de4-41c7-46ed-be64-b6d256d2b6e0@10.240.27.60@tcp:130/0 lens 488/448 e 1 to 0 dl 1733460500 ref 2 fl Interpret:/200/0 rc 0/0 job:'dd.60000' uid:60000 gid:60000
            [10516.402536] Lustre: 109211:0:(client.c:2364:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1733460480/real 1733460480]  req@ffff9337b6923400 x1817654230660736/t0(0) o601->lustre-MDT0000-lwp-OST0003@10.240.27.69@tcp:23/10 lens 336/336 e 0 to 1 dl 1733460496 ref 2 fl Rpc:XNQr/200/ffffffff rc 0/-1 job:'ll_ost_io00_001.0' uid:0 gid:0
            [10516.407923] Lustre: lustre-MDT0000-lwp-OST0003: Connection to lustre-MDT0000 (at 10.240.27.69@tcp) was lost; in progress operations using this service will wait for recovery to complete
            [10516.410677] Lustre: Skipped 7 previous similar messages
            [10516.415472] Lustre: lustre-MDT0000-lwp-OST0003: Connection restored to 10.240.27.69@tcp (at 10.240.27.69@tcp)
            [10516.417211] Lustre: Skipped 8 previous similar messages
            [10532.274379] Lustre: 109211:0:(client.c:2364:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1733460496/real 1733460496]  req@ffff9337a0f9cd00 x1817654230677760/t0(0) o601->lustre-MDT0000-lwp-OST0003@10.240.27.69@tcp:23/10 lens 336/336 e 0 to 1 dl 1733460512 ref 2 fl Rpc:XNQr/200/ffffffff rc 0/-1 job:'ll_ost_io00_001.0' uid:0 gid:0
            [10540.371558] Lustre: DEBUG MARKER: lctl set_param at_max=600
            [10541.022332] Lustre: DEBUG MARKER: dmesg
            [10541.386020] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity-quota test_6: @@@@@@ FAIL: [10413.491561] Lustre: ll_ost00_008: service thread pid 156469 was inactive for 43.952 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for de
            [10541.561909] Lustre: DEBUG MARKER: sanity-quota test_6: @@@@@@ FAIL: [10413.491561] Lustre: ll_ost00_008: service thread pid 156469 was inactive for 43.952 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
            [10541.798922] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /autotest/autotest-2/2024-12-06/lustre-reviews_review-dne-zfs-part-4_109434_15_58bdf9be-0c10-406b-b4e5-4a4e7607fd60//sanity-quota.test_6.debug_log.$(hostname -s).1733460521.log;
                           		dmesg > /autotest/autotest-2/2024-12-06/lustre-reviews_
            
            stancheff Shaun Tancheff added a comment - - edited This looks to be a related failure with a slightly different signature, from: https://testing.whamcloud.com/test_sets/39097f0a-1e73-427d-bb04-5d47b09bbf36 [10368.372670] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-zfs.lustre-OST0007.quota_slave.enabled [10413.491561] Lustre: ll_ost00_008: service thread pid 156469 was inactive for 43.952 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [10413.494794] Pid: 156469, comm: ll_ost00_008 4.18.0-553.27.1.el8_lustre.x86_64 #1 SMP Mon Nov 18 21:14:09 UTC 2024 [10413.496545] Call Trace TBD: [10413.497215] [<0>] cv_wait_common+0xaf/0x130 [spl] [10413.498111] [<0>] txg_wait_synced_impl+0xc6/0x110 [zfs] [10413.499426] [<0>] txg_wait_synced+0xc/0x40 [zfs] [10413.500345] [<0>] osd_trans_stop+0x515/0x550 [osd_zfs] [10413.501338] [<0>] ofd_attr_set+0xaf0/0x10e0 [ofd] [10413.502245] [<0>] ofd_setattr_hdl+0x4b7/0x8f0 [ofd] [10413.503132] [<0>] tgt_request_handle+0x3f4/0x1a30 [ptlrpc] [10413.504585] [<0>] ptlrpc_server_handle_request+0x2aa/0xcf0 [ptlrpc] [10413.505753] [<0>] ptlrpc_main+0xc9e/0x15c0 [ptlrpc] [10413.506695] [<0>] kthread+0x134/0x150 [10413.507392] [<0>] ret_from_fork+0x35/0x40 [10427.005026] Lustre: ll_ost00_008: service thread pid 156469 completed after 57.466s. This likely indicates the system was overloaded (too many service threads, or not enough hardware resources). [10432.477003] Lustre: DEBUG MARKER: lctl get_param -n at_max [10432.778084] Lustre: DEBUG MARKER: lctl set_param at_max=20 [10487.850642] Lustre: DEBUG MARKER: lctl set_param -n osd*.*OS*.force_sync=1 [10488.495676] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param osd-*.lustre-OST*.quota_slave.timeout=10 [10515.378550] Lustre: 181682:0:(service.c:1456:ptlrpc_at_send_early_reply()) @@@ Could not add any time (5/5), not sending early reply req@ffff9337b6921380 x1817656977237248/t0(0) o4->1ded8de4-41c7-46ed-be64-b6d256d2b6e0@10.240.27.60@tcp:130/0 lens 488/448 e 1 to 0 dl 1733460500 ref 2 fl Interpret:/200/0 rc 0/0 job:'dd.60000' uid:60000 gid:60000 [10516.402536] Lustre: 109211:0:(client.c:2364:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1733460480/real 1733460480] req@ffff9337b6923400 x1817654230660736/t0(0) o601->lustre-MDT0000-lwp-OST0003@10.240.27.69@tcp:23/10 lens 336/336 e 0 to 1 dl 1733460496 ref 2 fl Rpc:XNQr/200/ffffffff rc 0/-1 job:'ll_ost_io00_001.0' uid:0 gid:0 [10516.407923] Lustre: lustre-MDT0000-lwp-OST0003: Connection to lustre-MDT0000 (at 10.240.27.69@tcp) was lost; in progress operations using this service will wait for recovery to complete [10516.410677] Lustre: Skipped 7 previous similar messages [10516.415472] Lustre: lustre-MDT0000-lwp-OST0003: Connection restored to 10.240.27.69@tcp (at 10.240.27.69@tcp) [10516.417211] Lustre: Skipped 8 previous similar messages [10532.274379] Lustre: 109211:0:(client.c:2364:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1733460496/real 1733460496] req@ffff9337a0f9cd00 x1817654230677760/t0(0) o601->lustre-MDT0000-lwp-OST0003@10.240.27.69@tcp:23/10 lens 336/336 e 0 to 1 dl 1733460512 ref 2 fl Rpc:XNQr/200/ffffffff rc 0/-1 job:'ll_ost_io00_001.0' uid:0 gid:0 [10540.371558] Lustre: DEBUG MARKER: lctl set_param at_max=600 [10541.022332] Lustre: DEBUG MARKER: dmesg [10541.386020] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity-quota test_6: @@@@@@ FAIL: [10413.491561] Lustre: ll_ost00_008: service thread pid 156469 was inactive for 43.952 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for de [10541.561909] Lustre: DEBUG MARKER: sanity-quota test_6: @@@@@@ FAIL: [10413.491561] Lustre: ll_ost00_008: service thread pid 156469 was inactive for 43.952 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [10541.798922] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /autotest/autotest-2/2024-12-06/lustre-reviews_review-dne-zfs-part-4_109434_15_58bdf9be-0c10-406b-b4e5-4a4e7607fd60//sanity-quota.test_6.debug_log.$(hostname -s).1733460521.log; dmesg > /autotest/autotest-2/2024-12-06/lustre-reviews_

            From https://testing.whamcloud.com/test_sets/8102d3fc-f790-4053-9e6b-51923ad0e251 :

            serega@Sergeys-MacBook-Pro lustre-release % grep "qsd_op_begin0.*op_begin" ~/Downloads/sanity-quota.test_6.debug_log.onyx-59vm6.1731512940.log 
            00040000:04000000:0.0:1731512927.747059:0:232570:0:(qsd_handler.c:743:qsd_op_begin0()) $$$ op_begin space:12  qsd:lustre-OST0001 qtype:usr id:60000 enforced:1 granted: 3072 pending:0 waiting:0 req:0 usage: 2832 qunit:1024 qtune:512 edquot:1 default:no
            00040000:04000000:0.0:1731512929.344954:0:232570:0:(qsd_handler.c:743:qsd_op_begin0()) $$$ op_begin space:12  qsd:lustre-OST0001 qtype:usr id:60000 enforced:1 granted: 3072 pending:0 waiting:0 req:0 usage: 2836 qunit:1024 qtune:512 edquot:1 default:no
            00040000:04000000:0.0:1731512931.629564:0:232570:0:(qsd_handler.c:743:qsd_op_begin0()) $$$ op_begin space:12  qsd:lustre-OST0001 qtype:usr id:60000 enforced:1 granted: 3072 pending:0 waiting:0 req:0 usage: 2840 qunit:1024 qtune:512 edquot:1 default:no
            00040000:04000000:0.0:1731512932.901844:0:232570:0:(qsd_handler.c:743:qsd_op_begin0()) $$$ op_begin space:12  qsd:lustre-OST0001 qtype:usr id:60000 enforced:1 granted: 3072 pending:0 waiting:0 req:0 usage: 2844 qunit:1024 qtune:512 edquot:1 default:no
            00040000:04000000:0.0:1731512933.823209:0:232570:0:(qsd_handler.c:743:qsd_op_begin0()) $$$ op_begin space:12  qsd:lustre-OST0001 qtype:usr id:60000 enforced:1 granted: 3072 pending:0 waiting:0 req:0 usage: 2848 qunit:1024 qtune:512 edquot:1 default:no
            00040000:04000000:0.0:1731512935.789237:0:232570:0:(qsd_handler.c:743:qsd_op_begin0()) $$$ op_begin space:12  qsd:lustre-OST0001 qtype:usr id:60000 enforced:1 granted: 3072 pending:0 waiting:0 req:0 usage: 2852 qunit:1024 qtune:512 edquot:1 default:no
            00040000:04000000:0.0:1731512937.531182:0:232570:0:(qsd_handler.c:743:qsd_op_begin0()) $$$ op_begin space:12  qsd:lustre-OST0001 qtype:usr id:60000 enforced:1 granted: 3072 pending:0 waiting:0 req:0 usage: 2856 qunit:1024 qtune:512 edquot:1 default:no
            00040000:04000000:0.0:1731512938.691124:0:232570:0:(qsd_handler.c:743:qsd_op_begin0()) $$$ op_begin space:12  qsd:lustre-OST0001 qtype:usr id:60000 enforced:1 granted: 3072 pending:0 waiting:0 req:0 usage: 2860 qunit:1024 qtune:512 edquot:1 default:no
            00040000:04000000:1.0:1731512940.366300:0:232570:0:(qsd_handler.c:743:qsd_op_begin0()) $$$ op_begin space:12  qsd:lustre-OST0001 qtype:usr id:60000 enforced:1 granted: 3072 pending:0 waiting:0 req:0 usage: 2864 qunit:1024 qtune:512 edquot:1 default:no 

            The usage is increasing at 4 bytes several seconds and it is trying to write 12(space:12) - looks weird.

            From the client side I see that it is setting edquot several times and in the end of test it gets EDQUOT in chkdq:

            00000008:04000000:0.0:1731512918.620394:0:99472:0:(osc_request.c:2138:osc_brw_fini_request()) setdq for [60000 60000 0] with valid 0x18000006b584fb9, flags 404100
            00000008:04000000:0.0:1731512920.347211:0:99472:0:(osc_request.c:2138:osc_brw_fini_request()) setdq for [60000 60000 0] with valid 0x18000006b584fb9, flags 404100
            00000008:04000000:0.0:1731512920.371962:0:99472:0:(osc_request.c:2138:osc_brw_fini_request()) setdq for [60000 60000 0] with valid 0x18000006b584fb9, flags 404100
            00000008:04000000:0.0:1731512925.884689:0:99471:0:(osc_request.c:2138:osc_brw_fini_request()) setdq for [60000 60000 0] with valid 0x18000006b584fb9, flags 404100
            00000008:04000000:0.0:1731512926.506032:0:99471:0:(osc_request.c:2138:osc_brw_fini_request()) setdq for [60000 60000 0] with valid 0x18000006b584fb9, flags 404100
            00000008:04000000:0.0:1731512935.788543:0:99471:0:(osc_request.c:2138:osc_brw_fini_request()) setdq for [60000 60000 0] with valid 0x18000006b584fb9, flags 404100
            00000008:04000000:0.0:1731512937.530517:0:99471:0:(osc_request.c:2138:osc_brw_fini_request()) setdq for [60000 60000 0] with valid 0x18000006b584fb9, flags 404100
            00000008:04000000:0.0:1731512938.690475:0:99471:0:(osc_request.c:2138:osc_brw_fini_request()) setdq for [60000 60000 0] with valid 0x18000006b584fb9, flags 404100
            00000008:04000000:0.0:1731512940.365420:0:99471:0:(osc_request.c:2138:osc_brw_fini_request()) setdq for [60000 60000 0] with valid 0x18000006b584fb9, flags 404100
            00000001:04000000:0.0:1731512940.365672:0:666845:0:(osc_quota.c:63:osc_quota_chkdq()) chkdq found noquota for user 60000 

            There is no requests from OST to QMT. It is clear that qsd already set edquot for this user and there is no need at least right now to send acquire request and to grant more space. There is also a chance that it might acquire some space through qsd_intent_lock, so prohibiting QMT to handle QUOTA_DQACQ it is still able to handle ldlm IT_QUOTA_DQACQ requests. Could it break the logic of the test?

            I found that if at_is_enabled, it removes $TESTFILE. I'm not sure it would help with the failures, but do we need to set stripe "-i0 -c 1" for the file we are going to write? If all OSTs are on the same node it shouldn't be the reason why it failed. It makes sense only when OSTs are on the different nodes.

            If patch would be updated, stack_trap need to be added at different places:

            1. local old_qst=$($LCTL get_param osd-*.$FSNAME-OST*.quota_slave.timeout)
               do_facet ost1 $LCTL set_param 
                  osd-*.$FSNAME-OST*.quota_slave.timeout=$((TIMEOUT / 2))
               stack_trap do_faset ost1 $LCTl set_param osd-*.$FSNAME-OST*.quota_slave.timeout=$old_gst
            2. do_facet ost1 dmesg > $TMP/lustre-log-${TESTNAME}.log
               stack_trap rm -f $TMP/lustre-log-${TESTNAME}.log
               watchdog=$(awk '/[Ss]ervice thread pid/ && /was inactive/ \
               ...
            3. at_max_set $TIMEOUT ost1
               stack_trap at_max_set $at_max_saved ost1

             

            Im no sure why we need to wait 240 seconds in the end of the test. Probably we could change it to something like 40-60 seconds, and that could give us a chance to capture earlier debug logs to better understand the problem.

            scherementsev Sergey Cheremencev added a comment - From https://testing.whamcloud.com/test_sets/8102d3fc-f790-4053-9e6b-51923ad0e251 : serega@Sergeys-MacBook-Pro lustre-release % grep "qsd_op_begin0.*op_begin" ~/Downloads/sanity-quota.test_6.debug_log.onyx-59vm6.1731512940.log 00040000:04000000:0.0:1731512927.747059:0:232570:0:(qsd_handler.c:743:qsd_op_begin0()) $$$ op_begin space:12  qsd:lustre-OST0001 qtype:usr id:60000 enforced:1 granted: 3072 pending:0 waiting:0 req:0 usage: 2832 qunit:1024 qtune:512 edquot:1 default:no 00040000:04000000:0.0:1731512929.344954:0:232570:0:(qsd_handler.c:743:qsd_op_begin0()) $$$ op_begin space:12  qsd:lustre-OST0001 qtype:usr id:60000 enforced:1 granted: 3072 pending:0 waiting:0 req:0 usage: 2836 qunit:1024 qtune:512 edquot:1 default:no 00040000:04000000:0.0:1731512931.629564:0:232570:0:(qsd_handler.c:743:qsd_op_begin0()) $$$ op_begin space:12  qsd:lustre-OST0001 qtype:usr id:60000 enforced:1 granted: 3072 pending:0 waiting:0 req:0 usage: 2840 qunit:1024 qtune:512 edquot:1 default:no 00040000:04000000:0.0:1731512932.901844:0:232570:0:(qsd_handler.c:743:qsd_op_begin0()) $$$ op_begin space:12  qsd:lustre-OST0001 qtype:usr id:60000 enforced:1 granted: 3072 pending:0 waiting:0 req:0 usage: 2844 qunit:1024 qtune:512 edquot:1 default:no 00040000:04000000:0.0:1731512933.823209:0:232570:0:(qsd_handler.c:743:qsd_op_begin0()) $$$ op_begin space:12  qsd:lustre-OST0001 qtype:usr id:60000 enforced:1 granted: 3072 pending:0 waiting:0 req:0 usage: 2848 qunit:1024 qtune:512 edquot:1 default:no 00040000:04000000:0.0:1731512935.789237:0:232570:0:(qsd_handler.c:743:qsd_op_begin0()) $$$ op_begin space:12  qsd:lustre-OST0001 qtype:usr id:60000 enforced:1 granted: 3072 pending:0 waiting:0 req:0 usage: 2852 qunit:1024 qtune:512 edquot:1 default:no 00040000:04000000:0.0:1731512937.531182:0:232570:0:(qsd_handler.c:743:qsd_op_begin0()) $$$ op_begin space:12  qsd:lustre-OST0001 qtype:usr id:60000 enforced:1 granted: 3072 pending:0 waiting:0 req:0 usage: 2856 qunit:1024 qtune:512 edquot:1 default:no 00040000:04000000:0.0:1731512938.691124:0:232570:0:(qsd_handler.c:743:qsd_op_begin0()) $$$ op_begin space:12  qsd:lustre-OST0001 qtype:usr id:60000 enforced:1 granted: 3072 pending:0 waiting:0 req:0 usage: 2860 qunit:1024 qtune:512 edquot:1 default:no 00040000:04000000:1.0:1731512940.366300:0:232570:0:(qsd_handler.c:743:qsd_op_begin0()) $$$ op_begin space:12  qsd:lustre-OST0001 qtype:usr id:60000 enforced:1 granted: 3072 pending:0 waiting:0 req:0 usage: 2864 qunit:1024 qtune:512 edquot:1 default:no The usage is increasing at 4 bytes several seconds and it is trying to write 12(space:12) - looks weird. From the client side I see that it is setting edquot several times and in the end of test it gets EDQUOT in chkdq: 00000008:04000000:0.0:1731512918.620394:0:99472:0:(osc_request.c:2138:osc_brw_fini_request()) setdq for [60000 60000 0] with valid 0x18000006b584fb9, flags 404100 00000008:04000000:0.0:1731512920.347211:0:99472:0:(osc_request.c:2138:osc_brw_fini_request()) setdq for [60000 60000 0] with valid 0x18000006b584fb9, flags 404100 00000008:04000000:0.0:1731512920.371962:0:99472:0:(osc_request.c:2138:osc_brw_fini_request()) setdq for [60000 60000 0] with valid 0x18000006b584fb9, flags 404100 00000008:04000000:0.0:1731512925.884689:0:99471:0:(osc_request.c:2138:osc_brw_fini_request()) setdq for [60000 60000 0] with valid 0x18000006b584fb9, flags 404100 00000008:04000000:0.0:1731512926.506032:0:99471:0:(osc_request.c:2138:osc_brw_fini_request()) setdq for [60000 60000 0] with valid 0x18000006b584fb9, flags 404100 00000008:04000000:0.0:1731512935.788543:0:99471:0:(osc_request.c:2138:osc_brw_fini_request()) setdq for [60000 60000 0] with valid 0x18000006b584fb9, flags 404100 00000008:04000000:0.0:1731512937.530517:0:99471:0:(osc_request.c:2138:osc_brw_fini_request()) setdq for [60000 60000 0] with valid 0x18000006b584fb9, flags 404100 00000008:04000000:0.0:1731512938.690475:0:99471:0:(osc_request.c:2138:osc_brw_fini_request()) setdq for [60000 60000 0] with valid 0x18000006b584fb9, flags 404100 00000008:04000000:0.0:1731512940.365420:0:99471:0:(osc_request.c:2138:osc_brw_fini_request()) setdq for [60000 60000 0] with valid 0x18000006b584fb9, flags 404100 00000001:04000000:0.0:1731512940.365672:0:666845:0:(osc_quota.c:63:osc_quota_chkdq()) chkdq found noquota for user 60000 There is no requests from OST to QMT. It is clear that qsd already set edquot for this user and there is no need at least right now to send acquire request and to grant more space. There is also a chance that it might acquire some space through qsd_intent_lock, so prohibiting QMT to handle QUOTA_DQACQ it is still able to handle ldlm IT_QUOTA_DQACQ requests. Could it break the logic of the test? I found that if at_is_enabled, it removes $TESTFILE. I'm not sure it would help with the failures, but do we need to set stripe "-i0 -c 1" for the file we are going to write? If all OSTs are on the same node it shouldn't be the reason why it failed. It makes sense only when OSTs are on the different nodes. If patch would be updated, stack_trap need to be added at different places: 1. local old_qst=$($LCTL get_param osd-*.$FSNAME-OST*.quota_slave.timeout) do_facet ost1 $LCTL set_param   osd-*.$FSNAME-OST*.quota_slave.timeout=$((TIMEOUT / 2)) stack_trap do_faset ost1 $LCTl set_param osd-*.$FSNAME-OST*.quota_slave.timeout=$old_gst 2. do_facet ost1 dmesg > $TMP/lustre-log-${TESTNAME}.log stack_trap rm -f $TMP/lustre-log-${TESTNAME}.log   watchdog=$(awk '/[Ss]ervice thread pid/ && /was inactive/ \ ... 3. at_max_set $TIMEOUT ost1 stack_trap at_max_set $at_max_saved ost1   Im no sure why we need to wait 240 seconds in the end of the test. Probably we could change it to something like 40-60 seconds, and that could give us a chance to capture earlier debug logs to better understand the problem.
            yujian Jian Yu added a comment - +1 on Lustre b2_15 branch: https://testing.whamcloud.com/test_sets/8102d3fc-f790-4053-9e6b-51923ad0e251
            yujian Jian Yu added a comment -

            The subtest failed many times in zfs test sessions in the past 6 months.

            yujian Jian Yu added a comment - The subtest failed many times in zfs test sessions in the past 6 months.
            yujian Jian Yu added a comment - +1 on master branch: https://testing.whamcloud.com/test_sets/0d7fdd5a-b404-4b82-bd32-aaf538cee475

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: