Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5225

Client is evicted by multiple OSTs on all OSSs

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • None
    • None
    • 3
    • 14559

    Description

      As part of LU-2827 patch intensive testing, J.Simmons encountered a new issue when running with patch on top of latest/current 2.5.59/master version.

      Having a look to the infos provided (ftp.whamcloud.com/uploads/LU-4584/20140609-run1.tbz and 20140609-run2.tbz), it appears that at some point of time, Client's RPCs are not sent anymore. This mainly causes Client's locks cancel answers to Server's/OSTs blocking ASTs requests not to be sent and further evictions.

      The reason why Client's RPCs are not sent anymore can not be found using only the Lustre debug log level (dlmtrace) on Client, but I can see during Client's eviction process/handling these RPCs were on the delayed queue.

      Attachments

        Issue Links

          Activity

            [LU-5225] Client is evicted by multiple OSTs on all OSSs

            It was the testing with 2..5.60 clients. When I updated the clients to a newer version and could not reproduce the problem I figured some patch that landed fixed the problem. So I examined the list of merged patches since the broken client. The only one that made sense was LU-4861 since it

            1) Since I was seeing evictions from the OST it makes since a possible source of the problem could be the osc layer.

            2) The patch deals with a deadlock issue would can explain why I saw evictions.

            It seems I'm familiar enough with the code to make a good enough educated guess what will fix my problems

            I have been testing the LU-4861 patch with 2.5.2 clients with excellent success so far.

            simmonsja James A Simmons added a comment - It was the testing with 2..5.60 clients. When I updated the clients to a newer version and could not reproduce the problem I figured some patch that landed fixed the problem. So I examined the list of merged patches since the broken client. The only one that made sense was LU-4861 since it 1) Since I was seeing evictions from the OST it makes since a possible source of the problem could be the osc layer. 2) The patch deals with a deadlock issue would can explain why I saw evictions. It seems I'm familiar enough with the code to make a good enough educated guess what will fix my problems I have been testing the LU-4861 patch with 2.5.2 clients with excellent success so far.

            James,
            Thanks working+helping so hard on this, but I have an additional question, what made you point to LU-4861 patch as a possible fix ?

            bfaccini Bruno Faccini (Inactive) added a comment - James, Thanks working+helping so hard on this, but I have an additional question, what made you point to LU-4861 patch as a possible fix ?

            So far the results on my small scale system are very promising using the patch from LU-4861. If all goes well I will move it to the next scale system. If that works then we will use it in our test shot for Tuesday.

            simmonsja James A Simmons added a comment - So far the results on my small scale system are very promising using the patch from LU-4861 . If all goes well I will move it to the next scale system. If that works then we will use it in our test shot for Tuesday.

            So I updated my 2.6 tree to what is current and tried to duplicate the problem on my small scale system. I couldn't so I moved to a larger system (not titan) and same thing. I'm think it is the patch from LU-4861 that fixed this so I'm going to try a back port of the patch to 2.5 to see what happens.

            simmonsja James A Simmons added a comment - So I updated my 2.6 tree to what is current and tried to duplicate the problem on my small scale system. I couldn't so I moved to a larger system (not titan) and same thing. I'm think it is the patch from LU-4861 that fixed this so I'm going to try a back port of the patch to 2.5 to see what happens.

            James, I wonder if, during some dedicated time, you can get a Client's crash-dump (with a resized debug buffer and full debug trace enabled, or at least "net+dlmtrace+rpctrace" in addition to the default mask) when such eviction scenario occurs ?? This can be automated if you add http://review.whamcloud.com/#/c/8875/ on top of your Client's distro.

            bfaccini Bruno Faccini (Inactive) added a comment - James, I wonder if, during some dedicated time, you can get a Client's crash-dump (with a resized debug buffer and full debug trace enabled, or at least "net+dlmtrace+rpctrace" in addition to the default mask) when such eviction scenario occurs ?? This can be automated if you add http://review.whamcloud.com/#/c/8875/ on top of your Client's distro.

            Sorry you are right. Using 3 OSS is the most recent setup since I lost one OSS this last week. Original setup was 4 OSS with 7 OSTs per OSS. So it should be 28 OSTs all total.

            simmonsja James A Simmons added a comment - Sorry you are right. Using 3 OSS is the most recent setup since I lost one OSS this last week. Original setup was 4 OSS with 7 OSTs per OSS. So it should be 28 OSTs all total.

            Humm, only 3 OSSs ?? When you provided 4x OSSs dmesg/debug-log and where we see the same Client to be evicted from the 4 at about the same time :

            [root@client-17 20140609-run1]# ls -l
            total 14438608
            -rw-r--r-- 1 2004 2647      52742 Jun  9 08:06 clients-20140609
            -rw-r--r-- 1 2004 2647          0 Jun  9 08:13 dmesg-mds1-20140609
            -rw-r--r-- 1 2004 2647       1414 Jun  9 08:14 dmesg-oss1-20140609
            -rw-r--r-- 1 2004 2647       1928 Jun  9 08:14 dmesg-oss2-20140609
            -rw-r--r-- 1 2004 2647       2604 Jun  9 08:14 dmesg-oss3-20140609
            -rw-r--r-- 1 2004 2647       1296 Jun  9 08:14 dmesg-oss4-20140609
            -rw-r--r-- 1 2004 2647   18560470 Jun  9 08:32 lustre-log.c0-0c0s0n2.txt
            -rw-r--r-- 1 2004 2647 2293127795 Jun  9 08:32 lustre-log.c0-0c0s1n0.txt
            -rw-r--r-- 1 2004 2647 2292348834 Jun  9 08:33 lustre-log.c0-0c0s1n1.txt
            -rw-r--r-- 1 2004 2647 2292352991 Jun  9 08:33 lustre-log.c0-0c0s1n2.txt
            -rw-r--r-- 1 2004 2647 2292326850 Jun  9 08:34 lustre-log.c0-0c0s1n3.txt
            -rw-r--r-- 1 2004 2647     584732 Jun  9 08:34 lustre-log.c0-0c0s2n2.txt
            -rw-r--r-- 1 2004 2647 1442967552 Jun  9 08:34 lustre-log.c0-0c0s3n0.txt
            -rw-r--r-- 1 2004 2647          0 Jun  9 08:35 lustre-log.c0-0c0s3n1.txt
            -rw-r--r-- 1 2004 2647          0 Jun  9 08:35 lustre-log.c0-0c0s3n2.txt
            -rw-r--r-- 1 2004 2647          0 Jun  9 08:36 lustre-log.c0-0c0s3n3.txt
            -rw-r--r-- 1 2004 2647          0 Jun  9 08:36 lustre-log.c0-0c0s5n0.txt
            -rw-r--r-- 1 2004 2647          0 Jun  9 08:36 lustre-log.c0-0c0s5n1.txt
            -rw-r--r-- 1 2004 2647          0 Jun  9 08:36 lustre-log.c0-0c0s5n2.txt
            -rw-r--r-- 1 2004 2647          0 Jun  9 08:36 lustre-log.c0-0c0s5n3.txt
            -rw-r--r-- 1 2004 2647          0 Jun  9 08:36 lustre-log.c0-0c0s6n0.txt
            -rw-r--r-- 1 2004 2647          0 Jun  9 08:36 lustre-log.c0-0c0s6n1.txt
            -rw-r--r-- 1 2004 2647          0 Jun  9 08:36 lustre-log.c0-0c0s6n2.txt
            -rw-r--r-- 1 2004 2647          0 Jun  9 08:36 lustre-log.c0-0c0s6n3.txt
            -rw-r--r-- 1 2004 2647          0 Jun  9 08:36 lustre-log.c0-0c0s7n0.txt
            -rw-r--r-- 1 2004 2647          0 Jun  9 08:36 lustre-log.c0-0c0s7n1.txt
            -rw-r--r-- 1 2004 2647          0 Jun  9 08:36 lustre-log.c0-0c0s7n2.txt
            -rw-r--r-- 1 2004 2647          0 Jun  9 08:36 lustre-log.c0-0c0s7n3.txt
            -rw-r--r-- 1 2004 2647  406375578 Jun  9 08:13 mds1-log-20140609
            -rw-r--r-- 1 2004 2647  944812799 Jun  9 08:13 oss1-log-20140609
            -rw-r--r-- 1 2004 2647  939348575 Jun  9 08:12 oss2-log-20140609
            -rw-r--r-- 1 2004 2647  905196600 Jun  9 08:08 oss3-log-20140609
            -rw-r--r-- 1 2004 2647  942542266 Jun  9 08:08 oss4-log-20140609
            [root@client-17 20140609-run1]# grep -A 1 -i evict dmesg-oss*
            dmesg-oss1-20140609:[251714.315118] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 148s: evicting client at 28@gni1  ns: filter-tickfs-OST0000_UUID lock: ffff88059326c7c0/0x2004179f53efbf08 lrc: 3/0,0 mode: PR/PR res: [0x3e9b:0x0:0x0].0 rrc: 13 type: EXT [469762048->585105407] (req 469762048->470810623) flags: 0x60000000010020 nid: 28@gni1 remote: 0xb50852018d8f69d9 expref: 50 pid: 30470 timeout: 4546331352 lvb_type: 0
            dmesg-oss1-20140609-[251714.355075] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) Skipped 608 previous similar messages
            --
            dmesg-oss2-20140609:[251704.627677] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 154s: evicting client at 28@gni1  ns: filter-tickfs-OST0015_UUID lock: ffff88057b31abc0/0xc23127a58a2b8bb6 lrc: 3/0,0 mode: PR/PR res: [0x3e3b:0x0:0x0].0 rrc: 10 type: EXT [412090368->479199231] (req 412090368->413138943) flags: 0x60000000010020 nid: 28@gni1 remote: 0xb50852018d8f5e8e expref: 65 pid: 19858 timeout: 4546325396 lvb_type: 0
            dmesg-oss2-20140609-[251704.667664] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) Skipped 508 previous similar messages
            --
            dmesg-oss3-20140609:[251707.739428] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 147s: evicting client at 28@gni1  ns: filter-tickfs-OST0002_UUID lock: ffff88056f8dc2c0/0x5619018c38d12223 lrc: 3/0,0 mode: PR/PR res: [0x3e99:0x0:0x0].0 rrc: 25 type: EXT [575668224->920649727] (req 575668224->576716799) flags: 0x60000000010020 nid: 28@gni1 remote: 0xb50852018d8f6c33 expref: 80 pid: 21886 timeout: 4546330359 lvb_type: 0
            dmesg-oss3-20140609-[251707.779405] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) Skipped 406 previous similar messages
            --
            dmesg-oss4-20140609:[251704.795841] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 399s: evicting client at 28@gni1  ns: filter-tickfs-OST0013_UUID lock: ffff880511d64ac0/0x12a297bb77a5f496 lrc: 3/0,0 mode: PR/PR res: [0x3e59:0x0:0x0].0 rrc: 2 type: EXT [402653184->614465535] (req 402653184->403701759) flags: 0x60000000010020 nid: 28@gni1 remote: 0xb50852018d8eefc2 expref: 62 pid: 2561 timeout: 4546325612 lvb_type: 0
            dmesg-oss4-20140609-[251704.835633] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) Skipped 491 previous similar messages
            [root@client-17 20140609-run1]# 
            

            where I have been bluffed by the number of "lock callback timer" msgs to think about "thousands" of OSTs, when it stands for the number of locks ...

            BTW, I see 25 OSTs and not 21 :

            [root@client-17 20140609-run1]# grep -i evict lustre-log.c0-0c0s1n2.txt | head -1
            00000100:02020000:31.0:1402326083.337206:0:2533:0:(import.c:1370:ptlrpc_import_recovery_state_machine()) 167-0: tickfs-OST0000-osc-ffff88060e906400: This client was evicted by tickfs-OST0000; in progress operations using this service will fail.
            [root@client-17 20140609-run1]# grep -i evict lustre-log.c0-0c0s1n2.txt.end | awk '{print $9}' | sort
            tickfs-OST0000;
            tickfs-OST0001;
            tickfs-OST0002;
            tickfs-OST0003;
            tickfs-OST0004;
            tickfs-OST0005;
            tickfs-OST0006;
            tickfs-OST0007;
            tickfs-OST0008;
            tickfs-OST0009;
            tickfs-OST000a;
            tickfs-OST000b;
            tickfs-OST000c;
            tickfs-OST000d;
            tickfs-OST000e;
            tickfs-OST000f;
            tickfs-OST0010;
            tickfs-OST0011;
            tickfs-OST0012;
            tickfs-OST0013;
            tickfs-OST0014;
            tickfs-OST0015;
            tickfs-OST0016;
            tickfs-OST0017;
            tickfs-OST0018;
            [root@client-17 20140609-run1]# 
            
            bfaccini Bruno Faccini (Inactive) added a comment - Humm, only 3 OSSs ?? When you provided 4x OSSs dmesg/debug-log and where we see the same Client to be evicted from the 4 at about the same time : [root@client-17 20140609-run1]# ls -l total 14438608 -rw-r--r-- 1 2004 2647 52742 Jun 9 08:06 clients-20140609 -rw-r--r-- 1 2004 2647 0 Jun 9 08:13 dmesg-mds1-20140609 -rw-r--r-- 1 2004 2647 1414 Jun 9 08:14 dmesg-oss1-20140609 -rw-r--r-- 1 2004 2647 1928 Jun 9 08:14 dmesg-oss2-20140609 -rw-r--r-- 1 2004 2647 2604 Jun 9 08:14 dmesg-oss3-20140609 -rw-r--r-- 1 2004 2647 1296 Jun 9 08:14 dmesg-oss4-20140609 -rw-r--r-- 1 2004 2647 18560470 Jun 9 08:32 lustre-log.c0-0c0s0n2.txt -rw-r--r-- 1 2004 2647 2293127795 Jun 9 08:32 lustre-log.c0-0c0s1n0.txt -rw-r--r-- 1 2004 2647 2292348834 Jun 9 08:33 lustre-log.c0-0c0s1n1.txt -rw-r--r-- 1 2004 2647 2292352991 Jun 9 08:33 lustre-log.c0-0c0s1n2.txt -rw-r--r-- 1 2004 2647 2292326850 Jun 9 08:34 lustre-log.c0-0c0s1n3.txt -rw-r--r-- 1 2004 2647 584732 Jun 9 08:34 lustre-log.c0-0c0s2n2.txt -rw-r--r-- 1 2004 2647 1442967552 Jun 9 08:34 lustre-log.c0-0c0s3n0.txt -rw-r--r-- 1 2004 2647 0 Jun 9 08:35 lustre-log.c0-0c0s3n1.txt -rw-r--r-- 1 2004 2647 0 Jun 9 08:35 lustre-log.c0-0c0s3n2.txt -rw-r--r-- 1 2004 2647 0 Jun 9 08:36 lustre-log.c0-0c0s3n3.txt -rw-r--r-- 1 2004 2647 0 Jun 9 08:36 lustre-log.c0-0c0s5n0.txt -rw-r--r-- 1 2004 2647 0 Jun 9 08:36 lustre-log.c0-0c0s5n1.txt -rw-r--r-- 1 2004 2647 0 Jun 9 08:36 lustre-log.c0-0c0s5n2.txt -rw-r--r-- 1 2004 2647 0 Jun 9 08:36 lustre-log.c0-0c0s5n3.txt -rw-r--r-- 1 2004 2647 0 Jun 9 08:36 lustre-log.c0-0c0s6n0.txt -rw-r--r-- 1 2004 2647 0 Jun 9 08:36 lustre-log.c0-0c0s6n1.txt -rw-r--r-- 1 2004 2647 0 Jun 9 08:36 lustre-log.c0-0c0s6n2.txt -rw-r--r-- 1 2004 2647 0 Jun 9 08:36 lustre-log.c0-0c0s6n3.txt -rw-r--r-- 1 2004 2647 0 Jun 9 08:36 lustre-log.c0-0c0s7n0.txt -rw-r--r-- 1 2004 2647 0 Jun 9 08:36 lustre-log.c0-0c0s7n1.txt -rw-r--r-- 1 2004 2647 0 Jun 9 08:36 lustre-log.c0-0c0s7n2.txt -rw-r--r-- 1 2004 2647 0 Jun 9 08:36 lustre-log.c0-0c0s7n3.txt -rw-r--r-- 1 2004 2647 406375578 Jun 9 08:13 mds1-log-20140609 -rw-r--r-- 1 2004 2647 944812799 Jun 9 08:13 oss1-log-20140609 -rw-r--r-- 1 2004 2647 939348575 Jun 9 08:12 oss2-log-20140609 -rw-r--r-- 1 2004 2647 905196600 Jun 9 08:08 oss3-log-20140609 -rw-r--r-- 1 2004 2647 942542266 Jun 9 08:08 oss4-log-20140609 [root@client-17 20140609-run1]# grep -A 1 -i evict dmesg-oss* dmesg-oss1-20140609:[251714.315118] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 148s: evicting client at 28@gni1 ns: filter-tickfs-OST0000_UUID lock: ffff88059326c7c0/0x2004179f53efbf08 lrc: 3/0,0 mode: PR/PR res: [0x3e9b:0x0:0x0].0 rrc: 13 type: EXT [469762048->585105407] (req 469762048->470810623) flags: 0x60000000010020 nid: 28@gni1 remote: 0xb50852018d8f69d9 expref: 50 pid: 30470 timeout: 4546331352 lvb_type: 0 dmesg-oss1-20140609-[251714.355075] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) Skipped 608 previous similar messages -- dmesg-oss2-20140609:[251704.627677] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 154s: evicting client at 28@gni1 ns: filter-tickfs-OST0015_UUID lock: ffff88057b31abc0/0xc23127a58a2b8bb6 lrc: 3/0,0 mode: PR/PR res: [0x3e3b:0x0:0x0].0 rrc: 10 type: EXT [412090368->479199231] (req 412090368->413138943) flags: 0x60000000010020 nid: 28@gni1 remote: 0xb50852018d8f5e8e expref: 65 pid: 19858 timeout: 4546325396 lvb_type: 0 dmesg-oss2-20140609-[251704.667664] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) Skipped 508 previous similar messages -- dmesg-oss3-20140609:[251707.739428] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 147s: evicting client at 28@gni1 ns: filter-tickfs-OST0002_UUID lock: ffff88056f8dc2c0/0x5619018c38d12223 lrc: 3/0,0 mode: PR/PR res: [0x3e99:0x0:0x0].0 rrc: 25 type: EXT [575668224->920649727] (req 575668224->576716799) flags: 0x60000000010020 nid: 28@gni1 remote: 0xb50852018d8f6c33 expref: 80 pid: 21886 timeout: 4546330359 lvb_type: 0 dmesg-oss3-20140609-[251707.779405] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) Skipped 406 previous similar messages -- dmesg-oss4-20140609:[251704.795841] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 399s: evicting client at 28@gni1 ns: filter-tickfs-OST0013_UUID lock: ffff880511d64ac0/0x12a297bb77a5f496 lrc: 3/0,0 mode: PR/PR res: [0x3e59:0x0:0x0].0 rrc: 2 type: EXT [402653184->614465535] (req 402653184->403701759) flags: 0x60000000010020 nid: 28@gni1 remote: 0xb50852018d8eefc2 expref: 62 pid: 2561 timeout: 4546325612 lvb_type: 0 dmesg-oss4-20140609-[251704.835633] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) Skipped 491 previous similar messages [root@client-17 20140609-run1]# where I have been bluffed by the number of "lock callback timer" msgs to think about "thousands" of OSTs, when it stands for the number of locks ... BTW, I see 25 OSTs and not 21 : [root@client-17 20140609-run1]# grep -i evict lustre-log.c0-0c0s1n2.txt | head -1 00000100:02020000:31.0:1402326083.337206:0:2533:0:(import.c:1370:ptlrpc_import_recovery_state_machine()) 167-0: tickfs-OST0000-osc-ffff88060e906400: This client was evicted by tickfs-OST0000; in progress operations using this service will fail. [root@client-17 20140609-run1]# grep -i evict lustre-log.c0-0c0s1n2.txt.end | awk '{print $9}' | sort tickfs-OST0000; tickfs-OST0001; tickfs-OST0002; tickfs-OST0003; tickfs-OST0004; tickfs-OST0005; tickfs-OST0006; tickfs-OST0007; tickfs-OST0008; tickfs-OST0009; tickfs-OST000a; tickfs-OST000b; tickfs-OST000c; tickfs-OST000d; tickfs-OST000e; tickfs-OST000f; tickfs-OST0010; tickfs-OST0011; tickfs-OST0012; tickfs-OST0013; tickfs-OST0014; tickfs-OST0015; tickfs-OST0016; tickfs-OST0017; tickfs-OST0018; [root@client-17 20140609-run1]#

            The reproducers are IOR and simul both running jobs generating single shared files. They are executed on a single Cray cage system (22 nodes) with one router between the file system and the computes. That router is used as a bridge between the Gemini interconnect to the infiniband fabric. For each job a test directory is created that is the current maximum stripe count for each OST available. That current count is 21 OSTs so the directory is created with 20 stripes. This is the case for all job sets. The command to execute the jobs are

            aprun -n 288 /lustre/tick/stf008/scratch/jsimmons/IOR -a POSIX -i 5 -C -v -g -w -r -e -b 256m -t 4m -o /lustre/tick/stf008/scratch/jsimmons/test_ior/testfile.out

            aprun -n 112 /lustre/tick/stf008/scratch/jsimmons/simul -d /lustre/tick/stf008/scratch/jsimmons/test_simul

            For aprun -n # is the number of cores to use. Since each compute has 16 cores this means for example in the 288 core job only 18 nodes are used in total. For the simul job only 7 nodes are used. All cores are used on each node.

            The file system system itself is made of 4 MDS with one single MDT per MDS. For the back end storage array it is a DDN 10K attached to 3 OSS with each OSS having 7 OSTs mapped to it to bring the total amount of OSTs to 21.

            The size of each MDT is formatted to be 30GB for each MDT and 7.1TB for each OST.

            The clients are running a 2.6 version from about 2 weeks ago and the file system is running with 2.5.60 from last Friday.

            simmonsja James A Simmons added a comment - The reproducers are IOR and simul both running jobs generating single shared files. They are executed on a single Cray cage system (22 nodes) with one router between the file system and the computes. That router is used as a bridge between the Gemini interconnect to the infiniband fabric. For each job a test directory is created that is the current maximum stripe count for each OST available. That current count is 21 OSTs so the directory is created with 20 stripes. This is the case for all job sets. The command to execute the jobs are aprun -n 288 /lustre/tick/stf008/scratch/jsimmons/IOR -a POSIX -i 5 -C -v -g -w -r -e -b 256m -t 4m -o /lustre/tick/stf008/scratch/jsimmons/test_ior/testfile.out aprun -n 112 /lustre/tick/stf008/scratch/jsimmons/simul -d /lustre/tick/stf008/scratch/jsimmons/test_simul For aprun -n # is the number of cores to use. Since each compute has 16 cores this means for example in the 288 core job only 18 nodes are used in total. For the simul job only 7 nodes are used. All cores are used on each node. The file system system itself is made of 4 MDS with one single MDT per MDS. For the back end storage array it is a DDN 10K attached to 3 OSS with each OSS having 7 OSTs mapped to it to bring the total amount of OSTs to 21. The size of each MDT is formatted to be 30GB for each MDT and 7.1TB for each OST. The clients are running a 2.6 version from about 2 weeks ago and the file system is running with 2.5.60 from last Friday.

            Also, what would be helpful would be to have a precise description of the configuration/reproducer you use, like :

            _ do you still use the "single shared file IOR job" you indicated in LU-4584/LU-2827 ? If yes can you fully detail its environment/parameters ? With directories and file striping ?

            _ in the sets of logs you provided, it seems you use 22 Clients, and your Lustre filesystems has 4 OSSs, each of them is managing hundreds of OSTs (round-robin indexed?). Right ?

            bfaccini Bruno Faccini (Inactive) added a comment - Also, what would be helpful would be to have a precise description of the configuration/reproducer you use, like : _ do you still use the "single shared file IOR job" you indicated in LU-4584 / LU-2827 ? If yes can you fully detail its environment/parameters ? With directories and file striping ? _ in the sets of logs you provided, it seems you use 22 Clients, and your Lustre filesystems has 4 OSSs, each of them is managing hundreds of OSTs (round-robin indexed?). Right ?

            James, can you update it with any interesting detail you think please?

            bfaccini Bruno Faccini (Inactive) added a comment - James, can you update it with any interesting detail you think please?

            People

              bfaccini Bruno Faccini (Inactive)
              bfaccini Bruno Faccini (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: