[LU-5674] Maloo test report should include zfs debugging data when when FSTYPE=zfs Created: 14/Mar/14 Updated: 16/Mar/16 Resolved: 15/Jul/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.7.0, Lustre 2.5.4 |
| Type: | Improvement | Priority: | Critical |
| Reporter: | Isaac Huang (Inactive) | Assignee: | Minh Diep |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | prz, triaged | ||
| Rank (Obsolete): | 1854 |
| Description |
|
If I haven't missed something, zfs debugging data hasn't been included in test reports, e.g.: It'd be very useful to have a tarball of /proc/spl/. Lots of useful data to troubleshoot ZFS problems can be found under that directory, e.g. dmu_tx_assign delay histogram. |
| Comments |
| Comment by Isaac Huang (Inactive) [ 25/Mar/14 ] |
|
Some of the test failures (e.g. |
| Comment by Mike Stok (Inactive) [ 29/Apr/14 ] |
|
I would like to understand this a little better. At the end of an autotest run with ZFS autotest could just include the content of /proc/spl/... in the tar ball it sends over to maloo, and maloo could make it available to download. (this is how I read the original request - Mike) How much data is involved, and would that data in isolation be useful? Would this be sufficient for a start, or do we need to include more information (for example the kernel which was running)? |
| Comment by Isaac Huang (Inactive) [ 06/May/14 ] |
|
Yes, a tgz of /proc/spl/ should be sufficient. And it's necessary only for failed tests. Everything under /proc/spl/ is text, when compressed the total size shouldn't be large. |
| Comment by Minh Diep [ 15/Aug/14 ] |
|
My take on this is we should start this in Lustre test-framework and generate such file under ZFS tests. After that we can see if autotest automatically grab the files send to maloo. |
| Comment by Isaac Huang (Inactive) [ 05/Sep/14 ] |
|
Another piece of vital debug information to collect is outputs from "zpool events -v", available since zfs 0.6.3. Recently in debugging a few ZFS issues (e.g. |
| Comment by Isaac Huang (Inactive) [ 09/Oct/14 ] |
|
Please also set ZFS module option on MDS and OSS: zfs_txg_history=3 Without the option, some debug information will not be exported in the proc file. |
| Comment by Minh Diep [ 15/Oct/14 ] |
| Comment by Peter Jones [ 30/Oct/14 ] |
|
Landed for 2.7 |
| Comment by Jian Yu [ 05/Nov/14 ] |
|
Here is the back-ported patch for Lustre b2_5 branch: http://review.whamcloud.com/12590 |
| Comment by Isaac Huang (Inactive) [ 01/Dec/14 ] |
|
Looks like ZFS info is missing from this report: Or have I missed something? |
| Comment by Minh Diep [ 01/Dec/14 ] |
|
the test that you mentioned doesn't follow the test-framework way of start test. this results in zfs log was not called. additional, the test timed out which could also mean that the log would not be collect at the end of the client crashed. |
| Comment by Isaac Huang (Inactive) [ 01/Dec/14 ] |
|
Did you mean that even if the test I mentioned had failed a different way (i.e. not a timeout, so it'd be possible to collect to logs) the zfs logs would still not be collected? If yes, does it apply to all Maloo tests triggered from Gerrit? |
| Comment by Minh Diep [ 02/Dec/14 ] |
|
sorry for not being clear. after looking at this I think it likely that the test timed out caused the zfs log to not collected. Please if you find a case where a test failed but not log, please open a new ticket instead of reopen this. I believe this enhancement is completed. |
| Comment by Isaac Huang (Inactive) [ 02/Dec/14 ] |
|
When you said "the test timed out caused the zfs log to not collected", did you mean:
In the report I mentioned above, the OSS was in good state; there was a deadlock on the MDS, which made some service threads unresponsive, but user space process should still work. In addition, dmesg and Lustre debug logs were all available for both the OSS and the MDS, then why wasn't the ZFS logs available as well? |
| Comment by Gerrit Updater [ 15/Jan/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12590/ |
| Comment by Isaac Huang (Inactive) [ 16/Jan/15 ] |
|
https://testing.hpdd.intel.com/test_sets/deca9712-9bc1-11e4-857a-5254006e85c2 In the test report above, I couldn't find any ZFS data requested here. Since dmesg and other data that'd require a working user space were all there, I'd believe that the ZFS data should be available as well. Please take a look - the missing of such data made it harder to debug. Thanks! |
| Comment by Minh Diep [ 03/Jun/15 ] |
|
since the test has timed out, there isn't any way to collect the zfs log |
| Comment by Minh Diep [ 15/Jul/15 ] |
|
this ticket already in 2.7..etc. we can not include any log if a node is crashed (ie timeout). this ticket should be closed. |