[LU-6831] The ticket for tracking all DNE2 bugs Created: 09/Jul/15 Updated: 25/Feb/20 |
|
| Status: | Reopened |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0, Lustre 2.9.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Di Wang | Assignee: | Di Wang |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | dne2, llnl | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
This ticket is for tracking all of DNE2 bugs. |
| Comments |
| Comment by James A Simmons [ 13/Jul/15 ] |
|
In my latest testing I'm running into this bug: mdtest-1.8.3 was launched with 1 total task(s) on 1 nodes It is very easy to reproduce. What I did was create a stripe directory of count 2 at a index of 1. The reason is I'm avoiding using MDS0 which has the smallest MDT. |
| Comment by Di Wang [ 13/Jul/15 ] |
|
James: I can run this on my node [root@testnode mdtest]# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/vg_testnode-lv_root
27228028 10704408 15133848 42% /
tmpfs 4014860 0 4014860 0% /dev/shm
/dev/sda1 487652 48320 413732 11% /boot
192.168.167.1:/Users/wangdi/work
243358976 188184448 54918528 78% /work
/dev/loop2 133560 1904 121932 2% /mnt/mds3
/dev/loop3 133560 1908 121928 2% /mnt/mds4
/dev/loop4 358552 13900 324420 5% /mnt/ost1
/dev/loop5 358552 13904 324416 5% /mnt/ost2
/dev/loop0 133560 2192 121644 2% /mnt/mds1
/dev/loop1 133560 2040 121796 2% /mnt/mds2
testnode@tcp:/lustre 717104 27804 648836 5% /mnt/lustre
[root@testnode mdtest]# cd /work/lustre_release_work/lustre-release_new/lustre/tests/
[root@testnode tests]# pwd
/work/lustre_release_work/lustre-release_new/lustre/tests
[root@testnode tests]# dne2_2_mds_md_test^C
[root@testnode tests]# ../utils/lfs mkdir -i1 -c2 /mnt/lustre/test
[root@testnode tests]# /work/mdtest/
COPYRIGHT Makefile mdtest mdtest.1 README RELEASE_LOG scripts/
[root@testnode tests]# /work/mdtest/mdtest -I 10000 -i 5 -d /mnt/lustre/test/shared_10k_1
-- started at 07/12/2015 00:10:25 --
mdtest-1.9.3 was launched with 1 total task(s) on 1 node(s)
Command line used: /work/mdtest/mdtest -I 10000 -i 5 -d /mnt/lustre/test/shared_10k_1
Path: /mnt/lustre/test
FS: 0.7 GiB Used FS: 3.9% Inodes: 0.2 Mi Used Inodes: 0.5%
1 tasks, 10000 files/directories
SUMMARY: (of 5 iterations)
Operation Max Min Mean Std Dev
--------- --- --- ---- -------
Directory creation: 6005.343 5420.750 5707.103 218.750
Directory stat : 5607.789 5152.637 5289.370 166.391
Directory removal : 5475.998 5276.844 5368.197 78.798
File creation : 3563.760 3035.905 3204.669 194.226
File stat : 2883.332 2761.046 2820.534 41.375
File read : 3320.368 2823.954 3061.123 194.269
File removal : 5018.647 4533.453 4712.663 186.385
Tree creation : 4696.869 2688.656 3332.096 703.002
Tree removal : 2096.104 1976.581 2042.187 48.555
-- finished at 07/12/2015 00:11:53 --
Could you please check if these values has been set correctly? Thanks [root@testnode lustre-release_new]# ./lustre/utils/lctl get_param mdt.*.enable_remote_dir mdt.lustre-MDT0000.enable_remote_dir=1 mdt.lustre-MDT0001.enable_remote_dir=1 mdt.lustre-MDT0002.enable_remote_dir=1 mdt.lustre-MDT0003.enable_remote_dir=1 |
| Comment by James A Simmons [ 13/Jul/15 ] |
|
Doh. I assume remote_dir was enabled by default. I just set it. Seems to work now. The strange thing is even with remote_dir=0 everywhere when I set the directory MDS stripe to > 1 as long as the index was zero it appeared to work. |
| Comment by James A Simmons [ 14/Jul/15 ] |
|
Yes as root it works but not as a regular user. I created a directory with the count = 2 and index = 1 then I did the -D flags so any directories created under my DNE2 directory would inherit the properties. As my self (non-root) I tried a mkdir and it failed. |
| Comment by Di Wang [ 14/Jul/15 ] |
|
James: you need set these value to -1 to make sure all users can create remote and striped dir [root@mds01 ~]# lctl get_param mdt.*.enable_remote_dir_gid mdt.lustre-MDT0000.enable_remote_dir_gid=0 mdt.lustre-MDT0004.enable_remote_dir_gid=0 |
| Comment by James A Simmons [ 15/Jul/15 ] |
|
Nope. Creation of the directories appear to only work part of the time. Some times it does work and some times it does not work. Running mdtest never can run for me. |
| Comment by Di Wang [ 15/Jul/15 ] |
|
If you run the test as non-root user, then you probably need this patch http://review.whamcloud.com/#/c/13990/ And also please set enable_remote_dir_gid=-1 on all of MDTs. |
| Comment by James A Simmons [ 15/Jul/15 ] |
|
Patch 13990 did the trick. Now I can create DNE2 striped directories as myself. Thanks. |
| Comment by James A Simmons [ 20/Jul/15 ] |
|
Updated to your latest patches and I lost the ability to create remote directories. Now I get the following errors: [ 1016.185382] Lustre: 19975:0:(lmv_obd.c:297:lmv_init_ea_size()) sultan-clilmv-ffff88080ac8ec00: NULL export for 1 |
| Comment by Di Wang [ 20/Jul/15 ] |
|
which patches? could you please list your patches here? thanks |
| Comment by James A Simmons [ 21/Jul/15 ] |
|
http://review.whamcloud.com/#/c/13990 |
| Comment by Di Wang [ 21/Jul/15 ] |
|
Strange, does your build include this patch http://review.whamcloud.com/#/c/15269/ ? If it does, please remove this one. and retry? thanks |
| Comment by James A Simmons [ 21/Jul/15 ] |
|
Nope. I found the source of the problems. It was the patch from |
| Comment by James A Simmons [ 22/Jul/15 ] |
|
Doing more testing I found that the patch from |
| Comment by James A Simmons [ 23/Jul/15 ] |
|
Now I'm seeing clients get evicted during heavy meta data operations. Di Wang have you seen this behavior and does a patch exist to address this? |
| Comment by Di Wang [ 23/Jul/15 ] |
|
James: What test did you run? Do you have the trace? I am not sure if there are such fixes. Thanks. |
| Comment by James A Simmons [ 24/Jul/15 ] |
|
I see what is triggering the client evictions. I'm getting these errors on the clients: LustreError: 10306:0:(lmv_intent.c:234:lmv_revalidate_slaves()) sultan-clilmv-ffff8803ea284c00: nlink 1 < 2 corrupt stripe 1 [0x2800013ba:0x84ad:0x0]:[0x2400013c8:0x84ad:0x0] |
| Comment by Gerrit Updater [ 24/Jul/15 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/15720 |
| Comment by Di Wang [ 24/Jul/15 ] |
|
James: Please try this patch to see if it works? thanks. Unfortunately, I can not reproduce this problem locally. |
| Comment by James A Simmons [ 29/Jul/15 ] |
|
Yes LU-6831 helped with the revalidate FID bug. |
| Comment by James A Simmons [ 29/Jul/15 ] |
|
For my DNE2 testing here is the list of patches I running against: http://review.whamcloud.com/#/c/14346 |
| Comment by Jessica A. Popp (Inactive) [ 30/Jul/15 ] |
|
Translating James' list to ticket numbers for tracking purposes: |
| Comment by James A Simmons [ 03/Aug/15 ] |
|
The patch for this ticket landed but I like to see this kept open to handle any further bug reports. |
| Comment by Di Wang [ 03/Aug/15 ] |
|
Sorry, it might be a mistakes, even the patch on this ticket is not landed. |
| Comment by James A Simmons [ 03/Aug/15 ] |
|
An update in my latest testing. I'm still seeing problems when creating 1 million+ files per directory. Clearing out the debug logs I see the problem is only on the client side. When running a application I see: command line used: /lustre/sultan/stf008/scratch/jsimmons/mdtest -I 100000 -i 5 -d /lustre/sultan/stf008/scratch/jsimmons/dne2_4_mds_md_test/shared_1000k_10 10 tasks, 1000000 files/directories After the test failed any attempt to remove the files create by these test fail. When I attempt to remove the files I see the following errors in dmesg. LustreError: 5430:0:(llite_lib.c:2286:ll_prep_inode()) new_inode -fatal: rc -2 DiWang have you seen these errors during your testing? |
| Comment by Di Wang [ 03/Aug/15 ] |
|
James: no, I did not see these errors? Could you please collect -1 debug log on client side, when you remove one of these files? thanks |
| Comment by Di Wang [ 06/Aug/15 ] |
|
James: Any news for this -2 problem? Thanks |
| Comment by James A Simmons [ 06/Aug/15 ] |
|
Testing to see if the problem exist on directory striped across 8 MDS servers. Waiting for the results. I will push some log data soon for you. |
| Comment by Gerrit Updater [ 07/Aug/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15720/ |
| Comment by James A Simmons [ 11/Aug/15 ] |
|
I attached my client logs to |
| Comment by James A Simmons [ 26/Aug/15 ] |
|
Due to the lose of some of my MDS servers I attempted to create new striped directories today but instead I get this error every time. lfs setdirstripe -c 4 /lustre/sultan/stf008/scratch/jsimmons/dne2_4_mds_md_test This happens even when I'm root. |
| Comment by Di Wang [ 26/Aug/15 ] |
|
could you please get the debug log(-1 level) on MDT0? I assume jsimmons is on MDT0 ? Thanks. |
| Comment by James A Simmons [ 10/Sep/15 ] |
|
Here is the full log from the node that was crashing this morning Just to let you know the IOC_LMV_SETSTRIPE is no longer a issue. |
| Comment by James A Simmons [ 18/Dec/15 ] |
|
Updated my software stack and I'm seeing a lot of these on the OSS servers: [94725.339746] Lustre: sultan-OST0004: already connected client sultan-MDT0000-mdtlov_UUID (at 10.37.248.155@o2ib1) with handle 0xb4b2e32f66f3ee41. Rejecting client with the same UUID trying to reconnect with handle 0x157ffaac64917bbd Its seems to be only MDS1 having this. On that MDS the error message is: 95881.016995] LustreError: 137-5: sultan-MDT0001_UUID: not available for connect from 10.37.248.130@o2ib1 (no target). If you are running an HA pair check that the target is mounted on the other server. |
| Comment by James A Simmons [ 23/Dec/15 ] |
|
a soft lockup happing on a spin lock - Dec 22 10:54:26 feral17.ccs.ornl.gov kernel: [ 793.147894] BUG: soft lockup - CPU#0 stuck for 67s! [osp_up7-0:20904] Dec 22 10:54:26 feral17.ccs.ornl.gItsov kernel: [ 793.152993] Pid: 20901, comm: osp_up4-0 Tainted: P --------------- 2.6.32 |
| Comment by Di Wang [ 28/Dec/15 ] |
|
James: I just updated the patch http://review.whamcloud.com/#/c/16969/ please retry, thanks. |
| Comment by James A Simmons [ 29/Dec/15 ] |
|
Yep. I'm testing it right now. |