[LU-8081] Lustre 2.8.0, osd_trans_exec_op(), osd_trans_dump_creds() Created: 28/Apr/16  Updated: 18/Oct/16  Resolved: 10/Sep/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Minor
Reporter: christopher coffey Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

EL 6.7, x86_64, mellanox FDR, lustre 2.8.0 client and server. 1 combined mds/mdt, and 7 combined oss/ost.


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

On the OSS's, a few times a day, I see in the logs:

Apr 27 16:00:57 oss1 kernel: Lustre: 41839:0:(osd_handler.c:1265:osd_trans_dump_creds()) create: 0/0/0, destroy: 0/0/0
Apr 27 16:00:57 oss1 kernel: Lustre: 41839:0:(osd_handler.c:1272:osd_trans_dump_creds()) attr_set: 1/1/0, xattr_set: 2/15/0
Apr 27 16:00:57 oss1 kernel: Lustre: 41839:0:(osd_handler.c:1282:osd_trans_dump_creds()) write: 2/12/0, punch: 1/4/0, quota 4/4/0
Apr 27 16:00:57 oss1 kernel: Lustre: 41839:0:(osd_handler.c:1289:osd_trans_dump_creds()) insert: 0/0/0, delete: 0/0/0
Apr 27 16:00:57 oss1 kernel: Lustre: 41839:0:(osd_handler.c:1296:osd_trans_dump_creds()) ref_add: 0/0/0, ref_del: 0/0/0
Apr 27 16:00:57 oss1 kernel: LustreError: 41839:0:(osd_internal.h:1073:osd_trans_exec_op()) blizzard-OST0000-osd: op = 7, rb = 7

Not sure what triggers it for sure. We have a mixed workload of mpi, and serial type jobs.

The values are always the same: write 2/10/0, punch 1/4/0 , etc. A limit I imagine. It almost seems like debug type logs that are not super important.



 Comments   
Comment by Peter Jones [ 28/Apr/16 ]

Bruno

Could you please advise?

Thanks

Peter

Comment by Andreas Dilger [ 28/Apr/16 ]

Hi Christopher,
thanks for filing this report. In addition to the above osd_trans_dump_creds() message, is there a stack trace or other debugging messages in the logs at the same time?

You are correct that this is partly a debugging message, but it indicates that the internal ldiskfs transaction accounting is incorrect, and in some rare cases this might result in a failure in lower layers of the code. In our regression testing this would trigger an immediate assertion, but for production use the Lustre-level assertion is disabled to allow the system to continue running in the vast majority of cases.

If you are able to take a short outage at some point in the future, you could set lctl set_param osd-ldiskfs.track_declares_assert=1 on the OSS nodes so that the failure triggers an assertion and dumps the stack of the server process running this transaction.

*Please Note* that track_declares_assert=1 will require that the OSS be rebooted after the assertion has triggered, and should only be enabled if you are interested to debug this problem further, and rebooting the OSS is acceptable for your environment. If not, then we can try and find some other way to debug this problem.

Comment by Andreas Dilger [ 28/Apr/16 ]

Another question that comes to mind is what the journal size is on your OSTs? This can be checked with dumpe2fs -h /dev/ostXXX | grep "Journal size" on each of the OST devices. Typically they should be 400MB in size, allowing individual transactions up to 100MB (25000 blocks), so it is a bit surprising that this debugging message is being printed with only a few tens of blocks reserved in this transaction.

Comment by Alex Zhuravlev [ 28/Apr/16 ]

Hmm, actually it looks like credits were consumed outside of our tracking mechanism:
we reserved 12 blocks for write, but at write we observed that <12 left in the transaction.
at the same time none of the previous operations consumed more than declared.

Comment by Gerrit Updater [ 28/Apr/16 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/19865
Subject: LU-8081 osd-ldiskfs: improve transaction debug message
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8219dd1e5da64670ddcf84e7e779e315169e3b63

Comment by Gerrit Updater [ 10/Sep/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19865/
Subject: LU-8081 osd-ldiskfs: improve transaction debug message
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0cf638409812ce2533043908073b68698fb46c84

Comment by Peter Jones [ 10/Sep/16 ]

Landed for 2.9

Generated at Sat Feb 10 02:14:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.