[LU-4724] Test failure on test suite sanity-hsm, subtest test_71 Created: 06/Mar/14  Updated: 02/Aug/19  Resolved: 11/Mar/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.5.1, Lustre 2.12.3
Fix Version/s: Lustre 2.6.0, Lustre 2.5.1

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Michael MacDonald (Inactive)
Resolution: Fixed Votes: 0
Labels: HSM

Severity: 3
Rank (Obsolete): 12983

 Description   

This issue was created by maloo for John Hammond <john.hammond@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/e688c5ae-a550-11e3-9e53-52540035b04c.

The sub-test test_71 failed with the following error:

Copytool sent malformed event: {"event_time": "2014-03-06 03:11:53 -0800", "event_type": "LOGGED_MESSAGE", "level": "INFO", "message": "lhsmtool_posix[8611]: waiting for message from kernel"

Unknown macro: {"event_time"}

Info required for matching: sanity-hsm 71



 Comments   
Comment by John Hammond [ 06/Mar/14 ]

Writing to the event FIFO uses fprintf() with an unbuffered stream and there is no synchronization around calls to llapi_hsm_write_json_event() that I can see. So since the CT is multithreaded expect more such malformed events.

Comment by Michael MacDonald (Inactive) [ 06/Mar/14 ]

Looking into this.

Comment by Michael MacDonald (Inactive) [ 07/Mar/14 ]

I've been able to reproduce this locally by running test_71 in a loop. Will implement a fix and soak it for a while this morning before pushing for review.

Comment by Michael MacDonald (Inactive) [ 07/Mar/14 ]

Pushed http://review.whamcloud.com/9553 for review. Soaked this locally for 64 runs, zero failures. Without it, was getting failures about 33% of the time. Lesson learned.

Comment by Michael MacDonald (Inactive) [ 07/Mar/14 ]

jhammond: I've added you as a reviewer. Please take a look when you have a chance.

Comment by Jian Yu [ 09/Mar/14 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/40/ (2.5.1 RC2)
Distro/Arch: RHEL6.5/x86_64

sanity-hsm test 71 hit the same failure:
https://maloo.whamcloud.com/test_sets/f3e026c4-a687-11e3-9d0d-52540035b04c

This is a regression failure on Lustre 2.5.1 RC2 introduced by the patch of http://review.whamcloud.com/9512 for LU-4020.

Comment by Bob Glossman (Inactive) [ 11/Mar/14 ]

backport to b2_5:
http://review.whamcloud.com/9579

Comment by Bob Glossman (Inactive) [ 11/Mar/14 ]

I see that Michael already did a back port. Will abandon mine. Sorry for the confusion.

Comment by Peter Jones [ 11/Mar/14 ]

Landed for 2.5.1 and 2.6

Generated at Sat Feb 10 01:45:17 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.