Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3326

recovery-mds-scale test_failover_ost: tar: Cannot open: No space left on device

Details

    • 3
    • 8208

    Description

      After running recovery-mds-scale test_failover_ost for 1.5 hours (OST failed over 6 times), client load on one of the clients failed as follows:

      <snip>
      tar: etc/mail/submit.cf: Cannot open: No space left on device
      tar: etc/mail/trusted-users: Cannot open: No space left on device
      tar: etc/mail/virtusertable: Cannot open: No space left on device
      tar: etc/mail/access: Cannot open: No space left on device
      tar: etc/mail/aliasesdb-stamp: Cannot open: No space left on device
      tar: etc/gssapi_mech.conf: Cannot open: No space left on device
      tar: Exiting with failure status due to previous errors
      

      Console log on the client (client-32vm6) showed that:

      19:40:31:INFO: task tar:2790 blocked for more than 120 seconds.
      19:40:31:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      19:40:31:tar           D 0000000000000000     0  2790   2788 0x00000080
      19:40:31: ffff88004eb73a28 0000000000000082 ffff88004eb739d8 ffff88007c24fe50
      19:40:31: 0000000000000286 0000000000000003 0000000000000001 0000000000000286
      19:40:31: ffff88007bcb3ab8 ffff88004eb73fd8 000000000000fb88 ffff88007bcb3ab8
      19:40:31:Call Trace:
      19:40:31: [<ffffffffa03d775a>] ? cfs_waitq_signal+0x1a/0x20 [libcfs]
      19:40:31: [<ffffffff8150ea05>] schedule_timeout+0x215/0x2e0
      19:40:31: [<ffffffffa068517c>] ? ptlrpc_request_bufs_pack+0x5c/0x80 [ptlrpc]
      19:40:31: [<ffffffffa069a770>] ? lustre_swab_ost_body+0x0/0x10 [ptlrpc]
      19:40:31: [<ffffffff8150e683>] wait_for_common+0x123/0x180
      19:40:31: [<ffffffff81063310>] ? default_wake_function+0x0/0x20
      19:40:31: [<ffffffff8150e79d>] wait_for_completion+0x1d/0x20
      19:40:31: [<ffffffffa08cbf6c>] osc_io_setattr_end+0xbc/0x190 [osc]
      19:40:31: [<ffffffffa095cde0>] ? lov_io_end_wrapper+0x0/0x100 [lov]
      19:40:31: [<ffffffffa055cf30>] cl_io_end+0x60/0x150 [obdclass]
      19:40:31: [<ffffffffa055d7e0>] ? cl_io_start+0x0/0x140 [obdclass]
      19:40:31: [<ffffffffa095ced1>] lov_io_end_wrapper+0xf1/0x100 [lov]
      19:40:31: [<ffffffffa095c86e>] lov_io_call+0x8e/0x130 [lov]
      19:40:31: [<ffffffffa095e3bc>] lov_io_end+0x4c/0xf0 [lov]
      19:40:31: [<ffffffffa055cf30>] cl_io_end+0x60/0x150 [obdclass]
      19:40:31: [<ffffffffa0561f92>] cl_io_loop+0xc2/0x1b0 [obdclass]
      19:40:31: [<ffffffffa0a2aa08>] cl_setattr_ost+0x208/0x2c0 [lustre]
      19:40:31: [<ffffffffa09f8b0e>] ll_setattr_raw+0x9ce/0x1000 [lustre]
      19:40:31: [<ffffffffa09f919b>] ll_setattr+0x5b/0xf0 [lustre]
      19:40:31: [<ffffffff8119e708>] notify_change+0x168/0x340
      19:40:31: [<ffffffff811b284c>] utimes_common+0xdc/0x1b0
      19:40:31: [<ffffffff811828d1>] ? __fput+0x1a1/0x210
      19:40:31: [<ffffffff811b29fe>] do_utimes+0xde/0xf0
      19:40:31: [<ffffffff811b2b12>] sys_utimensat+0x32/0x90
      19:40:31: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      

      Maloo report: https://maloo.whamcloud.com/test_sets/053120d2-bb19-11e2-8824-52540035b04c

      Attachments

        Issue Links

          Activity

            [LU-3326] recovery-mds-scale test_failover_ost: tar: Cannot open: No space left on device
            yujian Jian Yu added a comment -

            This is blocking recovery-mds-scale testing.

            yujian Jian Yu added a comment - This is blocking recovery-mds-scale testing.
            yujian Jian Yu added a comment - - edited

            Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/39/ (2.5.1 RC1)
            Distro/Arch: RHEL6.5/x86_64
            Test Group: failover

            The same failure occurred:
            https://maloo.whamcloud.com/test_sets/36c8c4fe-a657-11e3-a191-52540035b04c

            Many sub-tests in replay-single in failover test group also hit the "No space left on device" failure:
            https://maloo.whamcloud.com/test_sets/c55e55ee-a657-11e3-a191-52540035b04c

            yujian Jian Yu added a comment - - edited Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/39/ (2.5.1 RC1) Distro/Arch: RHEL6.5/x86_64 Test Group: failover The same failure occurred: https://maloo.whamcloud.com/test_sets/36c8c4fe-a657-11e3-a191-52540035b04c Many sub-tests in replay-single in failover test group also hit the "No space left on device" failure: https://maloo.whamcloud.com/test_sets/c55e55ee-a657-11e3-a191-52540035b04c
            yujian Jian Yu added a comment - - edited More instances on Lustre b2_5 branch: https://maloo.whamcloud.com/test_sets/fcbcabd0-770e-11e3-b181-52540035b04c https://maloo.whamcloud.com/test_sets/e669c6a8-8643-11e3-9f3f-52540035b04c https://maloo.whamcloud.com/test_sets/fc5c9556-8505-11e3-8da9-52540035b04c https://maloo.whamcloud.com/test_sets/f16b8720-9922-11e3-83d7-52540035b04c https://maloo.whamcloud.com/test_sets/7d4fc910-956b-11e3-936f-52540035b04c
            yujian Jian Yu added a comment - Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/5/ Test Group: failover The same failure occurred: https://maloo.whamcloud.com/test_sets/c86198c4-7505-11e3-95ae-52540035b04c

            local test with two nodes, the server node runs master and client runs b2_1, b2_3, b2_4 or master, the pages were not discarded after the lock's blocking AST
            is received, and will discard the pages after adding "LDLM_FL_DISCARD_DATA" instead of "LDLM_FL_AST_DISCARD_DATA" into "LDLM_FL_AST_MASK".

            LDLM_AST_DISCARD_DATA (changed to LDLM_FL_AST_DISCARD_DATA in master) is used to indicate ldlm to sent blocking AST with LDLM_FL_DISCARD_DATA flag,
            but it self is not on the wire, then the LDLM_FL_DISCARD_DATA will be ignored for it will be masked by "LDLM_FL_AST_MASK".

            code snippet in master

            void ldlm_add_bl_work_item(struct ldlm_lock *lock, struct ldlm_lock *new,
                                       cfs_list_t *work_list)
            {
                    if (!ldlm_is_ast_sent(lock)) {
                            LDLM_DEBUG(lock, "lock incompatible; sending blocking AST.");
                            ldlm_set_ast_sent(lock);
                            /* If the enqueuing client said so, tell the AST recipient to
                             * discard dirty data, rather than writing back. */
                            if (ldlm_is_ast_discard_data(new))         <---- check LDLM_FL_AST_DISCARD_DATA
                                    ldlm_set_discard_data(lock);       <---- set LDLM_FL_DISCARD_DATA
                            LASSERT(cfs_list_empty(&lock->l_bl_ast));
                            cfs_list_add(&lock->l_bl_ast, work_list);
                            LDLM_LOCK_GET(lock);
                            LASSERT(lock->l_blocking_lock == NULL);
                            lock->l_blocking_lock = LDLM_LOCK_GET(new);
                    }
            }
            

            the corresponding patch is tracked at http://review.whamcloud.com/#/c/8671/

            hongchao.zhang Hongchao Zhang added a comment - local test with two nodes, the server node runs master and client runs b2_1, b2_3, b2_4 or master, the pages were not discarded after the lock's blocking AST is received, and will discard the pages after adding "LDLM_FL_DISCARD_DATA" instead of "LDLM_FL_AST_DISCARD_DATA" into "LDLM_FL_AST_MASK". LDLM_AST_DISCARD_DATA (changed to LDLM_FL_AST_DISCARD_DATA in master) is used to indicate ldlm to sent blocking AST with LDLM_FL_DISCARD_DATA flag, but it self is not on the wire, then the LDLM_FL_DISCARD_DATA will be ignored for it will be masked by "LDLM_FL_AST_MASK". code snippet in master void ldlm_add_bl_work_item(struct ldlm_lock *lock, struct ldlm_lock * new , cfs_list_t *work_list) { if (!ldlm_is_ast_sent(lock)) { LDLM_DEBUG(lock, "lock incompatible; sending blocking AST." ); ldlm_set_ast_sent(lock); /* If the enqueuing client said so, tell the AST recipient to * discard dirty data, rather than writing back. */ if (ldlm_is_ast_discard_data( new )) <---- check LDLM_FL_AST_DISCARD_DATA ldlm_set_discard_data(lock); <---- set LDLM_FL_DISCARD_DATA LASSERT(cfs_list_empty(&lock->l_bl_ast)); cfs_list_add(&lock->l_bl_ast, work_list); LDLM_LOCK_GET(lock); LASSERT(lock->l_blocking_lock == NULL); lock->l_blocking_lock = LDLM_LOCK_GET( new ); } } the corresponding patch is tracked at http://review.whamcloud.com/#/c/8671/
            yujian Jian Yu added a comment - Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/70/ (2.4.2 RC2) https://maloo.whamcloud.com/test_sets/34a4c608-6be4-11e3-a73e-52540035b04c
            yujian Jian Yu added a comment -

            I performed the test to dd a large file in the background and then delete the file from another client. Unfortunately, the debug logs did not show "pages were discarded" info on the clients, and there were also no "server preparing blocking AST" info on the OSS node. Still digging.

            yujian Jian Yu added a comment - I performed the test to dd a large file in the background and then delete the file from another client. Unfortunately, the debug logs did not show "pages were discarded" info on the clients, and there were also no "server preparing blocking AST" info on the OSS node. Still digging.
            yujian Jian Yu added a comment - - edited

            In "run_dd_dk_1386339667" debug logs on OSS (wtm-72), I found that:

            00010000:00010000:18.0:1386339564.644518:0:33665:0:(ldlm_lockd.c:848:ldlm_server_blocking_ast()) ### server preparing blocking AST ns: filter-lustre-OST0003_UUID lock: ffff880416972cc0/0xe8fc4c150d5421b0 lrc: 3/0,0 mode: PW/PW res: [0x2:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x50000000010020 nid: 10.10.18.127@tcp remote: 0xe0b36e7f4f2635f3 expref: 7 pid: 38018 timeout: 0 lvb_type: 0
            

            A blocking AST RPC was sent to client.

            And on client (wtm-68), in "run_dd_dk_1386339667":

            00000020:00000001:14.0:1386339585.987516:0:40723:0:(cl_page.c:420:cl_page_put()) Process entered
            00000020:00000001:14.0:1386339585.987517:0:40723:0:(cl_page.c:422:cl_page_put()) page@ffff880746ae8600[1 ffff8805e1f6ac30:397170 ^(null)_(null) 4 0 1 (null) (null) 0x0]
            00000020:00000001:14.0:1386339585.987518:0:40723:0:(cl_page.c:422:cl_page_put()) 1
            00000020:00000001:14.0:1386339585.987518:0:40723:0:(cl_page.c:160:cl_page_free()) Process entered
            00000020:00000001:14.0:1386339585.987519:0:40723:0:(lustre_fid.h:719:fid_flatten32()) Process leaving (rc=788528898 : 788528898 : 2effff02)
            00000020:00000010:14.0:1386339585.987519:0:40723:0:(cl_page.c:174:cl_page_free()) kfreed 'page': 440 at ffff880746ae8600.
            00000020:00000001:14.0:1386339585.987520:0:40723:0:(cl_page.c:175:cl_page_free()) Process leaving
            00000020:00000001:14.0:1386339585.987520:0:40723:0:(cl_page.c:437:cl_page_put()) Process leaving
            00020000:00000001:14.0:1386339585.987520:0:40723:0:(lov_page.c:82:lov_page_fini()) Process leaving
            00000020:00000001:14.0:1386339585.987521:0:40723:0:(lustre_fid.h:719:fid_flatten32()) Process leaving (rc=4194307 : 4194307 : 400003)
            00000020:00000010:14.0:1386339585.987521:0:40723:0:(cl_page.c:174:cl_page_free()) kfreed 'page': 304 at ffff880746ae8800.
            00000020:00000001:14.0:1386339585.987522:0:40723:0:(cl_page.c:175:cl_page_free()) Process leaving
            00000020:00000001:14.0:1386339585.987522:0:40723:0:(cl_page.c:437:cl_page_put()) Process leaving
            00000008:00000001:14.0:1386339585.987523:0:40723:0:(osc_cache.c:3137:osc_page_gang_lookup()) Process leaving (rc=0 : 0 : 0)
            
            00000008:00000001:14.0:1386339585.987534:0:40723:0:(osc_cache.c:3246:osc_lock_discard_pages()) Process leaving (rc=0 : 0 : 0)
            00000020:00001000:14.0:1386339585.987535:0:40723:0:(cl_object.c:971:cl_env_put()) 2@ffff88081d244a68
            00000008:00000001:14.0:1386339585.987536:0:40723:0:(osc_lock.c:1376:osc_lock_flush()) Process leaving (rc=0 : 0 : 0)
            00010000:00000001:14.0:1386339585.987538:0:40723:0:(ldlm_request.c:1353:ldlm_cli_cancel()) Process entered
            00010000:00000001:14.0:1386339585.987543:0:40723:0:(ldlm_request.c:1122:ldlm_cli_cancel_local()) Process entered
            00010000:00010000:14.0:1386339585.987543:0:40723:0:(ldlm_request.c:1127:ldlm_cli_cancel_local()) ### client-side cancel ns: lustre-OST0003-osc-ffff880435738000 lock: ffff88083821d980/0xe0b36e7f4f2635f3 lrc: 4/0,0 mode: PW/PW res: [0x2:0x0:0x0].0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x428400010000 nid: local remote: 0xe8fc4c150d5421b0 expref: -99 pid: 42589 timeout: 0 lvb_type: 1
            

            The osc_lock_discard_pages() was called in osc_lock_flush(), however, the following code path was not covered in the debug logs:

                            if (descr->cld_mode >= CLM_WRITE) {
                                    result = osc_cache_writeback_range(env, obj,
                                                    descr->cld_start, descr->cld_end,
                                                    1, discard);
                                    LDLM_DEBUG(ols->ols_lock,
                                            "lock %p: %d pages were %s.\n", lock, result,
                                            discard ? "discarded" : "written");
                                    if (result > 0)
                                            result = 0;
                            }
            

            I just started the testing again to do the "write a large file in the background and delete it" operations for more times.

            yujian Jian Yu added a comment - - edited In "run_dd_dk_1386339667" debug logs on OSS (wtm-72), I found that: 00010000:00010000:18.0:1386339564.644518:0:33665:0:(ldlm_lockd.c:848:ldlm_server_blocking_ast()) ### server preparing blocking AST ns: filter-lustre-OST0003_UUID lock: ffff880416972cc0/0xe8fc4c150d5421b0 lrc: 3/0,0 mode: PW/PW res: [0x2:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x50000000010020 nid: 10.10.18.127@tcp remote: 0xe0b36e7f4f2635f3 expref: 7 pid: 38018 timeout: 0 lvb_type: 0 A blocking AST RPC was sent to client. And on client (wtm-68), in "run_dd_dk_1386339667": 00000020:00000001:14.0:1386339585.987516:0:40723:0:(cl_page.c:420:cl_page_put()) Process entered 00000020:00000001:14.0:1386339585.987517:0:40723:0:(cl_page.c:422:cl_page_put()) page@ffff880746ae8600[1 ffff8805e1f6ac30:397170 ^( null )_( null ) 4 0 1 ( null ) ( null ) 0x0] 00000020:00000001:14.0:1386339585.987518:0:40723:0:(cl_page.c:422:cl_page_put()) 1 00000020:00000001:14.0:1386339585.987518:0:40723:0:(cl_page.c:160:cl_page_free()) Process entered 00000020:00000001:14.0:1386339585.987519:0:40723:0:(lustre_fid.h:719:fid_flatten32()) Process leaving (rc=788528898 : 788528898 : 2effff02) 00000020:00000010:14.0:1386339585.987519:0:40723:0:(cl_page.c:174:cl_page_free()) kfreed 'page' : 440 at ffff880746ae8600. 00000020:00000001:14.0:1386339585.987520:0:40723:0:(cl_page.c:175:cl_page_free()) Process leaving 00000020:00000001:14.0:1386339585.987520:0:40723:0:(cl_page.c:437:cl_page_put()) Process leaving 00020000:00000001:14.0:1386339585.987520:0:40723:0:(lov_page.c:82:lov_page_fini()) Process leaving 00000020:00000001:14.0:1386339585.987521:0:40723:0:(lustre_fid.h:719:fid_flatten32()) Process leaving (rc=4194307 : 4194307 : 400003) 00000020:00000010:14.0:1386339585.987521:0:40723:0:(cl_page.c:174:cl_page_free()) kfreed 'page' : 304 at ffff880746ae8800. 00000020:00000001:14.0:1386339585.987522:0:40723:0:(cl_page.c:175:cl_page_free()) Process leaving 00000020:00000001:14.0:1386339585.987522:0:40723:0:(cl_page.c:437:cl_page_put()) Process leaving 00000008:00000001:14.0:1386339585.987523:0:40723:0:(osc_cache.c:3137:osc_page_gang_lookup()) Process leaving (rc=0 : 0 : 0) 00000008:00000001:14.0:1386339585.987534:0:40723:0:(osc_cache.c:3246:osc_lock_discard_pages()) Process leaving (rc=0 : 0 : 0) 00000020:00001000:14.0:1386339585.987535:0:40723:0:(cl_object.c:971:cl_env_put()) 2@ffff88081d244a68 00000008:00000001:14.0:1386339585.987536:0:40723:0:(osc_lock.c:1376:osc_lock_flush()) Process leaving (rc=0 : 0 : 0) 00010000:00000001:14.0:1386339585.987538:0:40723:0:(ldlm_request.c:1353:ldlm_cli_cancel()) Process entered 00010000:00000001:14.0:1386339585.987543:0:40723:0:(ldlm_request.c:1122:ldlm_cli_cancel_local()) Process entered 00010000:00010000:14.0:1386339585.987543:0:40723:0:(ldlm_request.c:1127:ldlm_cli_cancel_local()) ### client-side cancel ns: lustre-OST0003-osc-ffff880435738000 lock: ffff88083821d980/0xe0b36e7f4f2635f3 lrc: 4/0,0 mode: PW/PW res: [0x2:0x0:0x0].0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x428400010000 nid: local remote: 0xe8fc4c150d5421b0 expref: -99 pid: 42589 timeout: 0 lvb_type: 1 The osc_lock_discard_pages() was called in osc_lock_flush(), however, the following code path was not covered in the debug logs: if (descr->cld_mode >= CLM_WRITE) { result = osc_cache_writeback_range(env, obj, descr->cld_start, descr->cld_end, 1, discard); LDLM_DEBUG(ols->ols_lock, "lock %p: %d pages were %s.\n" , lock, result, discard ? "discarded" : "written" ); if (result > 0) result = 0; } I just started the testing again to do the "write a large file in the background and delete it" operations for more times.
            yujian Jian Yu added a comment - - edited

            I applied the LDLM_FL_AST_MASK change from http://review.whamcloud.com/8071 to the latest master branch and created http://review.whamcloud.com/8495 to get a new build.

            The following recovery-mds-scale test_failover_mds was performed on that build:
            https://maloo.whamcloud.com/sub_tests/9752d892-5e8a-11e3-a925-52540035b04c

            The "run_dd_debug" log on Client 2 (wtm-68) showed that:

            Total free disk space is 10589812, 4k blocks to dd is 2382707
            + pdsh -t 300 -S -w 'wtm-[67,68,71,72]' 'export PATH=$PATH:/sbin:/usr/sbin; lctl set_param debug=-1'
            + pdsh -t 300 -S -w 'wtm-[67,68,71,72]' 'export PATH=$PATH:/sbin:/usr/sbin; lctl set_param debug_mb=150'
            + pdsh -t 300 -S -w 'wtm-[67,68,71,72]' 'export PATH=$PATH:/sbin:/usr/sbin; lctl dk > /dev/null'
            + load_pid=42589
            + wait 42589
            + dd bs=4k count=2382707 status=noxfer if=/dev/zero of=/mnt/lustre/d0.dd-wtm-68/dd-file
            2382707+0 records in
            2382707+0 records out
            + '[' 0 -eq 0 ']'
            ++ date '+%F %H:%M:%S'
            + echoerr '2013-12-06 06:19:18: dd succeeded'
            + echo '2013-12-06 06:19:18: dd succeeded'
            2013-12-06 06:19:18: dd succeeded
            + cd /tmp
            + rm -rf /mnt/lustre/d0.dd-wtm-68
            + pdsh -t 300 -S -w 'wtm-[67,68,71,72]' 'export PATH=$PATH:/sbin:/usr/sbin;
            			lctl dk > /scratch/jianyu/test_logs/2013-12-06/060035/recovery-mds-scale.test_failover_mds.run_dd_dk.$(hostname -s)_$(date +%s).log'
            

            The lctl debug logs are in "run_dd_dk_1386339559" and "run_dd_dk_1386339667" (there are two separate dd runs) in the above Maloo report.

            I found ofd_destroy_by_fid() in the OSS (wtm-72) lctl debug logs. Need look into the logs deeply.

            yujian Jian Yu added a comment - - edited I applied the LDLM_FL_AST_MASK change from http://review.whamcloud.com/8071 to the latest master branch and created http://review.whamcloud.com/8495 to get a new build. The following recovery-mds-scale test_failover_mds was performed on that build: https://maloo.whamcloud.com/sub_tests/9752d892-5e8a-11e3-a925-52540035b04c The "run_dd_debug" log on Client 2 (wtm-68) showed that: Total free disk space is 10589812, 4k blocks to dd is 2382707 + pdsh -t 300 -S -w 'wtm-[67,68,71,72]' 'export PATH=$PATH:/sbin:/usr/sbin; lctl set_param debug=-1' + pdsh -t 300 -S -w 'wtm-[67,68,71,72]' 'export PATH=$PATH:/sbin:/usr/sbin; lctl set_param debug_mb=150' + pdsh -t 300 -S -w 'wtm-[67,68,71,72]' 'export PATH=$PATH:/sbin:/usr/sbin; lctl dk > /dev/null' + load_pid=42589 + wait 42589 + dd bs=4k count=2382707 status=noxfer if=/dev/zero of=/mnt/lustre/d0.dd-wtm-68/dd-file 2382707+0 records in 2382707+0 records out + '[' 0 -eq 0 ']' ++ date '+%F %H:%M:%S' + echoerr '2013-12-06 06:19:18: dd succeeded' + echo '2013-12-06 06:19:18: dd succeeded' 2013-12-06 06:19:18: dd succeeded + cd /tmp + rm -rf /mnt/lustre/d0.dd-wtm-68 + pdsh -t 300 -S -w 'wtm-[67,68,71,72]' 'export PATH=$PATH:/sbin:/usr/sbin; lctl dk > /scratch/jianyu/test_logs/2013-12-06/060035/recovery-mds-scale.test_failover_mds.run_dd_dk.$(hostname -s)_$(date +%s).log' The lctl debug logs are in "run_dd_dk_1386339559" and "run_dd_dk_1386339667" (there are two separate dd runs) in the above Maloo report. I found ofd_destroy_by_fid() in the OSS (wtm-72) lctl debug logs. Need look into the logs deeply.

            Yu Jian, I think the code change in 8071 is potentially in the right area, but not necessarily the right fix. I think this will change the wire protocol to use a different flag (LDLM_FL_DISCARD_DATA vs. LDLM_FL_AST_DISCARD_DATA) in the wire protocol, but I'm not positive. It would be great to test this properly - write a large file in the background, delete it, and then check the kernel debug logs to see if more pages are being written or if they are being discarded.

            adilger Andreas Dilger added a comment - Yu Jian, I think the code change in 8071 is potentially in the right area, but not necessarily the right fix. I think this will change the wire protocol to use a different flag (LDLM_FL_DISCARD_DATA vs. LDLM_FL_AST_DISCARD_DATA) in the wire protocol, but I'm not positive. It would be great to test this properly - write a large file in the background, delete it, and then check the kernel debug logs to see if more pages are being written or if they are being discarded.
            yujian Jian Yu added a comment -

            In addition, lustre/include/lustre_dlm_flags.h was added by http://review.whamcloud.com/5312, which does not exist on Lustre b2_4 branch.

            yujian Jian Yu added a comment - In addition, lustre/include/lustre_dlm_flags.h was added by http://review.whamcloud.com/5312 , which does not exist on Lustre b2_4 branch.

            People

              hongchao.zhang Hongchao Zhang
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: