Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.2.0
-
None
-
3
-
4251
Description
This bug appeared after commit e8ffe16619baf1ef7c5c6b117d338956372aa752, "LU-884 clio: client in memory checksum",
unfortunately our tests don't show failure of fsx in sanity-benchmark and sanity-benchmark is not part of every autotest.
The issue looks like the following:
/usr/lib64/lustre/tests/sanity-benchmark.sh: line 186: 16826 Bus error (core dumped) fsx -c 50 -p 1000 -S $FSX_SEED -P $TMP -l $FSX_SIZE -N $(($FSX_COUNT * 100)) $testfile
Example of report (master):
https://maloo.whamcloud.com/test_sets/02485a3a-45d0-11e1-8d6e-5254004bbbd3
sanity-benchmark is green but fsx failed as showed above. Recently the same code was landed to the orion and we start experiencing the same issue.
Attachments
Issue Links
- is related to
-
LU-2305 Test failure sanityn, test_16: fsx bus error, core dumped
-
- Resolved
-
- is related to
-
LU-2305 Test failure sanityn, test_16: fsx bus error, core dumped
-
- Resolved
-
-
LU-884 Client In-Memory Data Checksum
-
- Resolved
-
- Trackbacks
-
Changelog 2.2 version 2.2.0 Support for networks: o2iblnd OFED 1.5.4 Server support for kernels: 2.6.32220.4.2.el6 (RHEL6) Client support for unpatched kernels: 2.6.18274.18.1.el5 (RHEL5) 2.6.32220.4.2.el6 (RHEL6) 2.6.32.360....
-
Lustre Community Development in Progress Features are being developed for future Lustre releases both at Whamcloud and by other organizations in the Lustre community. These will be eligible for inclusion in future Lustre releases as per our processes
We also trigger the same situation (osc.cur_
{dirty|grant|lost_grant}_bytes = 0) on CEA test system running with our/Bull build of Lustre v2.1.2. This build/version integrates
LU-1299(patch set 11) and ORNL-22 patches but not the one for thisLU-1028.The very bad news/consequence, which does not clearly appear in this JIRA comments, is that this situation causes files corruptions on affected Clients because applications/cmds are still allowed to write in cache when later asynchronous flushes never succeed (-EDQUOT) but this occurs silently and out of context.
As a work-around, we are also able to recover grants and associated+working mechanism by writing sunchronous/O_DIRECT I/Os on affected OSTs/OSCs. But this problem is a showstopper for customer to migrate to v2.1.2 since there is always a timing-window where corruption can occur.