Details
-
Bug
-
Resolution: Duplicate
-
Minor
-
None
-
None
-
None
-
3
-
14559
Description
As part of LU-2827 patch intensive testing, J.Simmons encountered a new issue when running with patch on top of latest/current 2.5.59/master version.
Having a look to the infos provided (ftp.whamcloud.com/uploads/LU-4584/20140609-run1.tbz and 20140609-run2.tbz), it appears that at some point of time, Client's RPCs are not sent anymore. This mainly causes Client's locks cancel answers to Server's/OSTs blocking ASTs requests not to be sent and further evictions.
The reason why Client's RPCs are not sent anymore can not be found using only the Lustre debug log level (dlmtrace) on Client, but I can see during Client's eviction process/handling these RPCs were on the delayed queue.
Attachments
Issue Links
- is related to
-
LU-4861 App hung - deadlock in cl_lock_mutex_get along cl_glimpse_lock path
-
- Resolved
-
Activity
James,
Thanks working+helping so hard on this, but I have an additional question, what made you point to LU-4861 patch as a possible fix ?
So far the results on my small scale system are very promising using the patch from LU-4861. If all goes well I will move it to the next scale system. If that works then we will use it in our test shot for Tuesday.
So I updated my 2.6 tree to what is current and tried to duplicate the problem on my small scale system. I couldn't so I moved to a larger system (not titan) and same thing. I'm think it is the patch from LU-4861 that fixed this so I'm going to try a back port of the patch to 2.5 to see what happens.
James, I wonder if, during some dedicated time, you can get a Client's crash-dump (with a resized debug buffer and full debug trace enabled, or at least "net+dlmtrace+rpctrace" in addition to the default mask) when such eviction scenario occurs ?? This can be automated if you add http://review.whamcloud.com/#/c/8875/ on top of your Client's distro.
Sorry you are right. Using 3 OSS is the most recent setup since I lost one OSS this last week. Original setup was 4 OSS with 7 OSTs per OSS. So it should be 28 OSTs all total.
Humm, only 3 OSSs ?? When you provided 4x OSSs dmesg/debug-log and where we see the same Client to be evicted from the 4 at about the same time :
[root@client-17 20140609-run1]# ls -l total 14438608 -rw-r--r-- 1 2004 2647 52742 Jun 9 08:06 clients-20140609 -rw-r--r-- 1 2004 2647 0 Jun 9 08:13 dmesg-mds1-20140609 -rw-r--r-- 1 2004 2647 1414 Jun 9 08:14 dmesg-oss1-20140609 -rw-r--r-- 1 2004 2647 1928 Jun 9 08:14 dmesg-oss2-20140609 -rw-r--r-- 1 2004 2647 2604 Jun 9 08:14 dmesg-oss3-20140609 -rw-r--r-- 1 2004 2647 1296 Jun 9 08:14 dmesg-oss4-20140609 -rw-r--r-- 1 2004 2647 18560470 Jun 9 08:32 lustre-log.c0-0c0s0n2.txt -rw-r--r-- 1 2004 2647 2293127795 Jun 9 08:32 lustre-log.c0-0c0s1n0.txt -rw-r--r-- 1 2004 2647 2292348834 Jun 9 08:33 lustre-log.c0-0c0s1n1.txt -rw-r--r-- 1 2004 2647 2292352991 Jun 9 08:33 lustre-log.c0-0c0s1n2.txt -rw-r--r-- 1 2004 2647 2292326850 Jun 9 08:34 lustre-log.c0-0c0s1n3.txt -rw-r--r-- 1 2004 2647 584732 Jun 9 08:34 lustre-log.c0-0c0s2n2.txt -rw-r--r-- 1 2004 2647 1442967552 Jun 9 08:34 lustre-log.c0-0c0s3n0.txt -rw-r--r-- 1 2004 2647 0 Jun 9 08:35 lustre-log.c0-0c0s3n1.txt -rw-r--r-- 1 2004 2647 0 Jun 9 08:35 lustre-log.c0-0c0s3n2.txt -rw-r--r-- 1 2004 2647 0 Jun 9 08:36 lustre-log.c0-0c0s3n3.txt -rw-r--r-- 1 2004 2647 0 Jun 9 08:36 lustre-log.c0-0c0s5n0.txt -rw-r--r-- 1 2004 2647 0 Jun 9 08:36 lustre-log.c0-0c0s5n1.txt -rw-r--r-- 1 2004 2647 0 Jun 9 08:36 lustre-log.c0-0c0s5n2.txt -rw-r--r-- 1 2004 2647 0 Jun 9 08:36 lustre-log.c0-0c0s5n3.txt -rw-r--r-- 1 2004 2647 0 Jun 9 08:36 lustre-log.c0-0c0s6n0.txt -rw-r--r-- 1 2004 2647 0 Jun 9 08:36 lustre-log.c0-0c0s6n1.txt -rw-r--r-- 1 2004 2647 0 Jun 9 08:36 lustre-log.c0-0c0s6n2.txt -rw-r--r-- 1 2004 2647 0 Jun 9 08:36 lustre-log.c0-0c0s6n3.txt -rw-r--r-- 1 2004 2647 0 Jun 9 08:36 lustre-log.c0-0c0s7n0.txt -rw-r--r-- 1 2004 2647 0 Jun 9 08:36 lustre-log.c0-0c0s7n1.txt -rw-r--r-- 1 2004 2647 0 Jun 9 08:36 lustre-log.c0-0c0s7n2.txt -rw-r--r-- 1 2004 2647 0 Jun 9 08:36 lustre-log.c0-0c0s7n3.txt -rw-r--r-- 1 2004 2647 406375578 Jun 9 08:13 mds1-log-20140609 -rw-r--r-- 1 2004 2647 944812799 Jun 9 08:13 oss1-log-20140609 -rw-r--r-- 1 2004 2647 939348575 Jun 9 08:12 oss2-log-20140609 -rw-r--r-- 1 2004 2647 905196600 Jun 9 08:08 oss3-log-20140609 -rw-r--r-- 1 2004 2647 942542266 Jun 9 08:08 oss4-log-20140609 [root@client-17 20140609-run1]# grep -A 1 -i evict dmesg-oss* dmesg-oss1-20140609:[251714.315118] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 148s: evicting client at 28@gni1 ns: filter-tickfs-OST0000_UUID lock: ffff88059326c7c0/0x2004179f53efbf08 lrc: 3/0,0 mode: PR/PR res: [0x3e9b:0x0:0x0].0 rrc: 13 type: EXT [469762048->585105407] (req 469762048->470810623) flags: 0x60000000010020 nid: 28@gni1 remote: 0xb50852018d8f69d9 expref: 50 pid: 30470 timeout: 4546331352 lvb_type: 0 dmesg-oss1-20140609-[251714.355075] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) Skipped 608 previous similar messages -- dmesg-oss2-20140609:[251704.627677] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 154s: evicting client at 28@gni1 ns: filter-tickfs-OST0015_UUID lock: ffff88057b31abc0/0xc23127a58a2b8bb6 lrc: 3/0,0 mode: PR/PR res: [0x3e3b:0x0:0x0].0 rrc: 10 type: EXT [412090368->479199231] (req 412090368->413138943) flags: 0x60000000010020 nid: 28@gni1 remote: 0xb50852018d8f5e8e expref: 65 pid: 19858 timeout: 4546325396 lvb_type: 0 dmesg-oss2-20140609-[251704.667664] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) Skipped 508 previous similar messages -- dmesg-oss3-20140609:[251707.739428] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 147s: evicting client at 28@gni1 ns: filter-tickfs-OST0002_UUID lock: ffff88056f8dc2c0/0x5619018c38d12223 lrc: 3/0,0 mode: PR/PR res: [0x3e99:0x0:0x0].0 rrc: 25 type: EXT [575668224->920649727] (req 575668224->576716799) flags: 0x60000000010020 nid: 28@gni1 remote: 0xb50852018d8f6c33 expref: 80 pid: 21886 timeout: 4546330359 lvb_type: 0 dmesg-oss3-20140609-[251707.779405] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) Skipped 406 previous similar messages -- dmesg-oss4-20140609:[251704.795841] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 399s: evicting client at 28@gni1 ns: filter-tickfs-OST0013_UUID lock: ffff880511d64ac0/0x12a297bb77a5f496 lrc: 3/0,0 mode: PR/PR res: [0x3e59:0x0:0x0].0 rrc: 2 type: EXT [402653184->614465535] (req 402653184->403701759) flags: 0x60000000010020 nid: 28@gni1 remote: 0xb50852018d8eefc2 expref: 62 pid: 2561 timeout: 4546325612 lvb_type: 0 dmesg-oss4-20140609-[251704.835633] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) Skipped 491 previous similar messages [root@client-17 20140609-run1]#
where I have been bluffed by the number of "lock callback timer" msgs to think about "thousands" of OSTs, when it stands for the number of locks ...
BTW, I see 25 OSTs and not 21 :
[root@client-17 20140609-run1]# grep -i evict lustre-log.c0-0c0s1n2.txt | head -1 00000100:02020000:31.0:1402326083.337206:0:2533:0:(import.c:1370:ptlrpc_import_recovery_state_machine()) 167-0: tickfs-OST0000-osc-ffff88060e906400: This client was evicted by tickfs-OST0000; in progress operations using this service will fail. [root@client-17 20140609-run1]# grep -i evict lustre-log.c0-0c0s1n2.txt.end | awk '{print $9}' | sort tickfs-OST0000; tickfs-OST0001; tickfs-OST0002; tickfs-OST0003; tickfs-OST0004; tickfs-OST0005; tickfs-OST0006; tickfs-OST0007; tickfs-OST0008; tickfs-OST0009; tickfs-OST000a; tickfs-OST000b; tickfs-OST000c; tickfs-OST000d; tickfs-OST000e; tickfs-OST000f; tickfs-OST0010; tickfs-OST0011; tickfs-OST0012; tickfs-OST0013; tickfs-OST0014; tickfs-OST0015; tickfs-OST0016; tickfs-OST0017; tickfs-OST0018; [root@client-17 20140609-run1]#
The reproducers are IOR and simul both running jobs generating single shared files. They are executed on a single Cray cage system (22 nodes) with one router between the file system and the computes. That router is used as a bridge between the Gemini interconnect to the infiniband fabric. For each job a test directory is created that is the current maximum stripe count for each OST available. That current count is 21 OSTs so the directory is created with 20 stripes. This is the case for all job sets. The command to execute the jobs are
aprun -n 288 /lustre/tick/stf008/scratch/jsimmons/IOR -a POSIX -i 5 -C -v -g -w -r -e -b 256m -t 4m -o /lustre/tick/stf008/scratch/jsimmons/test_ior/testfile.out
aprun -n 112 /lustre/tick/stf008/scratch/jsimmons/simul -d /lustre/tick/stf008/scratch/jsimmons/test_simul
For aprun -n # is the number of cores to use. Since each compute has 16 cores this means for example in the 288 core job only 18 nodes are used in total. For the simul job only 7 nodes are used. All cores are used on each node.
The file system system itself is made of 4 MDS with one single MDT per MDS. For the back end storage array it is a DDN 10K attached to 3 OSS with each OSS having 7 OSTs mapped to it to bring the total amount of OSTs to 21.
The size of each MDT is formatted to be 30GB for each MDT and 7.1TB for each OST.
The clients are running a 2.6 version from about 2 weeks ago and the file system is running with 2.5.60 from last Friday.
Also, what would be helpful would be to have a precise description of the configuration/reproducer you use, like :
_ do you still use the "single shared file IOR job" you indicated in LU-4584/LU-2827 ? If yes can you fully detail its environment/parameters ? With directories and file striping ?
_ in the sets of logs you provided, it seems you use 22 Clients, and your Lustre filesystems has 4 OSSs, each of them is managing hundreds of OSTs (round-robin indexed?). Right ?
James, can you update it with any interesting detail you think please?
It was the testing with 2..5.60 clients. When I updated the clients to a newer version and could not reproduce the problem I figured some patch that landed fixed the problem. So I examined the list of merged patches since the broken client. The only one that made sense was
LU-4861since it1) Since I was seeing evictions from the OST it makes since a possible source of the problem could be the osc layer.
2) The patch deals with a deadlock issue would can explain why I saw evictions.
It seems I'm familiar enough with the code to make a good enough educated guess what will fix my problems
I have been testing the
LU-4861patch with 2.5.2 clients with excellent success so far.