[LMR-3] lhsm archive of more than 15k small files ends up in errors Created: 21/Feb/17  Updated: 08/Feb/24  Resolved: 08/Feb/24

Status: Resolved
Project: Lemur
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Mangala Jyothi Bhaskar Assignee: Michael MacDonald (Inactive)
Resolution: Incomplete Votes: 0
Labels: None
Environment:

IEEL3.0 lustre client : lustre: 2.7.16.10
CentOS7.2 3.10.0-327.36.2.el7.x86_64
Lemur
Interconnect Intel Omnipath


Attachments: Microsoft Word 30k_test2_fail.rtf     Text File agent.txt     Text File lemur_rpms.txt     Text File lhsmd posix conf.txt     Text File mdslog.txt    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

1) I have created 30k , 10kb sized files using dd. I have lhsmd posix plugin running in the debug mode where I monitor the progress of archival job.

2) When I issue the command "lhsm archive *.bin" in the directory where the 30k files are located, I see ALERTS on the debug logs that some handlers were unable to find the files although they exist. However, archival of other files by other handlers still proceeds.

3) At the end of the archival when I check the MDT I see that not all 30k files were archived.
lctl get_param -n mdt.*.hsm.agents
uuid=f9ee32b4-d8fa-821d-e19c-9b0700d1e276 archive_id=ANY requests=[current:0 ok:4241 *errors:25759*]

4) ) However, 15k files of same size were all successfully archived , starting 30k upto 1M archival ends up in errors and our test case calls for successful archival of 1M files.

lctl get_param -n mdt.*.hsm.agents --> sucessful 15k
uuid=7fb34125-b8fd-bbc4-1632-007ceaa3df78 archive_id=ANY requests=[current:0 ok:15000 errors:0]

5) Attachments
a) agent conf file
b) lhsm posix conf file
c) Alerts seen on the lemur archival logs
d) Lemur rpms installed

Please let us know if you would need any more information.

Thanks



 Comments   
Comment by Michael MacDonald (Inactive) [ 21/Feb/17 ]

Hi, I'll take a look at this. At first glance, it seems that you've given us everything we need to try to reproduce this, so thanks for that. I'll post updates or questions as warranted.

Comment by Michael MacDonald (Inactive) [ 21/Feb/17 ]

Oh, one question: Am I correct in assuming that you are also running IEEL 3.0 on your MDS?

Comment by Mangala Jyothi Bhaskar [ 21/Feb/17 ]

Great! Thank you. Looking forward to it and yes IEEL3.0. To be more precise, IEEL3.0 for CentOS7.2 kernel "3.10.0-327.el7_lustre.g993c615.x86_64"  and lustre "2.7.15.3-3.10.0_327.el7"

 

Client has a slightly different kernel compared to the lustre servers. Looks like lustre versions are also different. 

Client ( like mentioned in description) has lustre 2.7.16.10 

Server has lustre 2.7.15.3

Comment by Michael MacDonald (Inactive) [ 23/Feb/17 ]

Hi Jyothi.

Sorry, it took me a bit to get to this. Rather than trying to mirror your environment exactly, I set up a Lustre 2.9.0 filesystem. With this configuration, I was unable to reproduce this problem. This leads me to suspect a problem in the version of lustre shipped in IEEL3.0 rather than in Lemur.

I will have to defer to Lustre support on this, and I will get them looped in.

For reference, here is what I did to test this out:

Installed Lustre 2.9.0 from https://build.hpdd.intel.com/job/lustre-b2_9/2/ on 4 nodes (MDS, MGS, OSS, client) as usual.

On my client, I installed Lemur RPMs from http://lemur-release.s3-website-us-east-1.amazonaws.com/release/0.5.2/ and configured it with settings similar to those attached to this ticket. I used a 2GB tmpfs as my archive root. I started the agent in debug mode with the following command:

lhsmd -debug 2>&1 | tee /tmp/lhsmd.log

Next, I created 30k files using the following command:

for i in $(seq 1 30000); do dd if=/dev/urandom of=$i.bin bs=10k count=1; done 

Next, I archived the files using:

 lhsm archive *.bin

Finally, I verified the files' state using:

lhsm status *.bin | grep archived | wc -l

... and the output was 30000 as expected.

I did not observe any errors or other unusual log output in the agent log.

Comment by Michael MacDonald (Inactive) [ 23/Feb/17 ]

One other thing does occur to me actually. I notice that you appear to be using Lemur RPMs from a non-release build. The version is 0.5.1_2_g885da1d, which corresponds to https://github.com/intel-hpdd/lemur/commit/885da1d4f93e4e8f09181812d0a211a3cd544a63. Did you build this locally or pull it from the devel section of our release site? While it shouldn't be a problem to build locally, I suggest that you try the 0.5.2 RPMs I used in my test. There are no lemur code changes between the version you're using and the 0.5.2 release (just some packaging housekeeping), but I suppose it's possible that there could be some difference in build environments which contribute to the problem.

Comment by Mangala Jyothi Bhaskar [ 24/Feb/17 ]

Sorry for the delay in getting back.  Did you say you tested it on Lustre 2.9.0?  We use IEEL and even with the latest IEEL release which is IEEL 3.1.0.2 we have lustre rpms which have 2.7.19.8 for example you will see this with a lustre rpm which was built for us "lustre-osd-ldiskfs-2.7.19.8-3.10.0_514.el7_lustre.g0afcb1e.x86_64_g0afcb1e.x86_64.rpm" .  What release of IEEL would have 2.9.0? 

Yes I built the rpms locally. It was about a month ago that I checked out Lemur code from git master branch and made my own rpms since we have a specific kernel and lustre client to adhere to. On the client node where we run Lemur , we have a requirement to stick to kernel "3.10.0-327.36.2.el7"  and lustre client "lustre: 2.7.16.10".  We can however, upgrade the lustre servers to IEEL3.1.0.2 which would be lustre 2.7.19.8. 

So if I directly install the 0.5.2 RPMs from where you pointed, would I be able to run it on the above kernel and lustre client versions? If yes, that would be great and I can check out the new lemur rpms since I also had issues connecting to non AWS S3 ( on a separate note )  and the link suggests there could be a fix for S3 region covered in those rpms ? ). 

Comment by Michael MacDonald (Inactive) [ 24/Feb/17 ]

Hi. Yes, I tested it against 2.9.0. I wanted to see if the problem you reported still manifests in the most recent release of Lustre, which happens to be that community release. I do not know when IEEL will be rebased on 2.9.x.

The Lemur RPMs we build should work with any version of Lustre released since 2.6.0. They are not tied to any particular kernel or Lustre release. We haven't tested specifically against your version of Lustre, but I'm not aware of any reason that it wouldn't work just fine.

If the problem repeats with the 0.5.2 RPMs we've provided, then i think we'll have to dig into the Lustre side of things and get Lustre support involved. Out of curiosity, have you previously worked with the in-tree Lustre copytool (lhsmtool_posix)? I ask because it would be helpful to understand if Lemur is replacing an existing solution or if this is completely new.

Comment by Mangala Jyothi Bhaskar [ 27/Feb/17 ]

At the first look at the logs and behavior do you mostly suspect lustre or lemur plugin? 

Good to know Lemur rpms are not tied to any kernel or lustre version. I will get the rpms from the link you pointed and test it against our current lustre or one version later than that which would be 2.7.19.8.  If the problem persists may be we can think of taking it up with lustre support. 

As far as I know, we have not worked extensivley with the in-tree copy tool either.  We might have tested some HSM features years ago. So I would say this is completely new and we haven't released any copy tools as a part of our solutions yet. 

Comment by Michael MacDonald (Inactive) [ 27/Feb/17 ]

Hi.

Well, as the co-developer of Lemur, my inclination is to say it must be the other software's fault.

In all seriousness, though, I don't see anything in the logs you've posted which indicate a problem in Lemur. This error seems to occur a lot:

ALERT 2017/02/03 21:12:41 /root/rpmbuild/BUILD/lemur-0.5.1_2_g885da1d/src/github.com/intel-hpdd/lemur/cmd/lhsmd/agent/agent.go:161: handler-19: begin failed: no such file or directory: AI: 58956d9a ARCHIVE [0x200001ca2:0x14bc5:0x0] 0,EOF []

Looking through our code, it appears that this error is coming from within liblustreapi rather than the lemur code. It may be helpful to look at the MDS logs as well to see if there are any relevant error messages appearing there. You're only using 1 MDS, correct? If you are using > 1 MDS, then it's possible that there is a problem with Lemur's support for that, but I know that was tested in the past.

So, that was a longish way of saying that I suspect the version of Lustre in IEEL more than Lemur at this point.

Comment by Mangala Jyothi Bhaskar [ 27/Feb/17 ]

I see. As of now I do not have MDS logs. I will have to reproduce this to get more relevant MDS logs since this test was about 2-3 weeks ago. We are in the middle of a big benchmark and I still dont have my hands on the resources to test this again. 

I am thinking, once I have access to the resources again, I will first use the Lemur rpms you used and then reproduce the issue and send some MDS logs your way. 

No this is not a DNE set up. There is one MDT and one primary MDS at a time. However we have a pair of MDS ( configured in High Availability), still, we would have one MDS server primarily managing the MDT at any given point in time. Not sure if this is what you were talking about.

 

Comment by Michael MacDonald (Inactive) [ 27/Feb/17 ]

Yes, I was asking if this was a DNE setup. As I was reading through the code surrounding the error message I referenced, I was looking for potential sources of the ENOENT error.

As you are not running with DNE, and this same code works fine with Lustre 2.9.0, I am again lead to suspect that the problem is with the version of Lustre in IEEL. I will see if I can get some resources together to reproduce it on our side, but I think it will be faster in your environment.

Comment by Mangala Jyothi Bhaskar [ 23/Mar/17 ]

I re-ran the 30K test with lemur0.6 rpms and I still see the issue. 

 

lctl get_param -n mdt.*.hsm.agents
uuid=7035a1ec-a1bf-36e9-83f5-9847469a03ea archive_id=ANY requests=[current:0 ok:5760 errors:24240]

 

It ended up not archiving about 24K files. The best case I have seen so far is 15k files and all 15k are archived with 0 errors. Like you said it could be lustre version. Even the latest IEEL build we have has lustre version 2.7.19.8.  Is there a way we can find out if this has been a known issue with HSM or if any fixes have gone in since 2.7 since you said you didnt see this in 2.9, or could it be something specific to IEEL distribution? Could some kind of logs help? I have attached "dmesg -T" from the MDS server when this archival job was running.  "mdslog.txt" hope this helps.

 

On lemur side I see the same kind of error that mentioned before "ALERT 2017/03/23 18:26:03 /tmp/rpmbuild/BUILD/lemur-0.6.0/src/github.com/intel-hpdd/lemur/cmd/lhsmd/agent/agent.go:161: handler-34: begin failed: no such file or directory: AI: 58ccb253 ARCHIVE [0x2000088d1:0x15a3:0x0] 0,EOF []
DEBUG 13:26:03.248862 agent.go:152: handler-34: incoming: AI: 58ccb254 ARCHIVE [0x2000088d1:0x15a4:0x0] 0,EOF []"  I remember you mentioned this is a common error.

 

At the most as of now I could try the latest IEEL3.1 servers and 2.7.19.8 lustre version. If I still see the same issue I might have to find exclusive hardware and set up a different lustre. 

Comment by Michael MacDonald (Inactive) [ 23/Mar/17 ]

Hmm, that's strange. I'll check with the Lustre engineering folks to see if they have any insights. The error messages in your mds log sure look like a smoking gun, to me...

Comment by Mangala Jyothi Bhaskar [ 27/Mar/17 ]

Michael, did you get a chance to discuss this with Lustre engineering? 

Generated at Fri Feb 09 23:54:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.