[LU-6634] (osd_handler.c:901:osd_trans_start()) ASSERTION( get_current()->journal_info == ((void *)0) ) failed: when reaching Catalog full condition Created: 23/May/15  Updated: 18/Dec/15  Resolved: 04/Nov/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0, Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Critical
Reporter: Bruno Faccini (Inactive) Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-6556 changelog catalog corruption if all p... Resolved
is related to LU-7138 LBUG: (osd_handler.c:1017:osd_trans_s... Resolved
is related to LU-6954 LustreError: 12934:0:(mdd_device.c:30... Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I found this problem during my testing of my patch for LU-6556, when also trying verify a Catalog full condition ends up with ENOSPC.

Having a look to the code, this happens because in the error path for llog_cat_new_log(), llog_destroy() is called to destroy the new plain LLOG for which the reference can't be recorded into Catalog because there is no slot available to do so.
Doing so triggers the Assertion because there is already a started transaction from llog_cat_add(), when llog_destroy() wants to start its own transaction.



 Comments   
Comment by Bruno Faccini (Inactive) [ 23/May/15 ]

The full LBUG/Assert signature is :

LustreError: 11337:0:(llog_osd.c:530:llog_osd_write_rec()) no free catalog slots for log...
LustreError: 11337:0:(osd_handler.c:901:osd_trans_start()) ASSERTION( get_current()->journal_info == ((void *)0) ) failed:
LustreError: 11337:0:(osd_handler.c:901:osd_trans_start()) LBUG
Pid: 11337, comm: lctl

Call Trace:
 [<ffffffffa042f875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa042fe77>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa12616cc>] osd_trans_start+0x25c/0x660 [osd_ldiskfs]
 [<ffffffffa054ef0a>] llog_osd_destroy+0x42a/0xd40 [obdclass]
 [<ffffffffa0548e6c>] llog_cat_new_log+0x1ec/0x710 [obdclass]
 [<ffffffffa054949a>] llog_cat_add_rec+0x10a/0x450 [obdclass]
 [<ffffffffa05499f5>] llog_cat_add+0x215/0x380 [obdclass]
 [<ffffffffa16c8905>] llog_test_9+0xf15/0x11a0 [llog_test]
 [<ffffffffa054a200>] ? llog_ctxt_destroy+0x160/0x1f0 [obdclass]
 [<ffffffffa16c55ff>] ? verify_handle+0x1f/0x240 [llog_test]
 [<ffffffffa16cde2a>] llog_run_tests+0x165a/0x1a00 [llog_test]
 [<ffffffffa16ce979>] llog_test_setup+0x7a9/0x8f0 [llog_test]
 [<ffffffffa05749cb>] obd_setup+0x19b/0x290 [obdclass]
 [<ffffffff8116fb23>] ? kmem_cache_alloc_trace+0x1a3/0x1b0
 [<ffffffffa055f9db>] ? class_new_export+0x78b/0x9a0 [obdclass]
 [<ffffffffa0574cc8>] class_setup+0x208/0x870 [obdclass]
 [<ffffffffa057c59c>] class_process_config+0x113c/0x27c0 [obdclass]
 [<ffffffff8117008c>] ? __kmalloc+0x20c/0x220
 [<ffffffff8117008c>] ? __kmalloc+0x20c/0x220
 [<ffffffffa0556a65>] class_handle_ioctl+0xbc5/0x21b0 [obdclass]
 [<ffffffffa053e2ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
 [<ffffffff8119dfc2>] vfs_ioctl+0x22/0xa0
 [<ffffffff8119e164>] do_vfs_ioctl+0x84/0x580
 [<ffffffff8119e6e1>] sys_ioctl+0x81/0xa0
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

and if required, I can provide a test to be added in llog_test to create the catalog full situation.

Comment by Mikhail Pershin [ 24/May/15 ]

this is obvious bug, log_destroy() shouldn't be called in error path in this function because it starts new transaction while in context of another one.

Comment by Mikhail Pershin [ 08/Jun/15 ]

Since this llog cannot be deleted inside current transaction, I propose to mark it as 'invalid' or 'orphaned' and delete outside this call, probably in llog processing. It can be also deleted by lfsck in case of server reboot also by checking such flag.

Comment by Mikhail Pershin [ 11/Sep/15 ]

This can be fixed now with simple fix after commit made by Wang Di for llog changes related to DNE. We have to use just llog_trans_destroy() there.

Comment by Di Wang [ 14/Sep/15 ]

Actually the patch on LU-6846 will fix this issue.http://review.whamcloud.com/16333

The idea is that we do not need take care of the created llog object, because if llog_cat_new_log() fails, and the log handle has been added to phd_entry, so llog_cat_close() will take care of the created object anyway.

Comment by Alex Zhuravlev [ 15/Sep/15 ]

Di, llog_cat_close() is called at mount. meaning llog_cat_new_log() leaves an orphan?

Comment by Gerrit Updater [ 15/Sep/15 ]

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/16427
Subject: LU-6634 llog: destroy plain llog if init fails
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7edf0a47546b03ac886500d1b78f1cb01084850e

Comment by Di Wang [ 15/Sep/15 ]
Di, llog_cat_close() is called at mount. meaning llog_cat_new_log() leaves an orphan?

hmm, not sure I follow, you mean called at umount? why can not this code take care of this?

        list_for_each_entry_safe(loghandle, n, &cathandle->u.chd.chd_head,
                                 u.phd.phd_entry) {
                struct llog_log_hdr     *llh = loghandle->lgh_hdr;
                int                      index;

                /* unlink open-not-created llogs */
                list_del_init(&loghandle->u.phd.phd_entry);
                llh = loghandle->lgh_hdr;
                if (loghandle->lgh_obj != NULL && llh != NULL &&
                    (llh->llh_flags & LLOG_F_ZAP_WHEN_EMPTY) &&
                    (llh->llh_count == 1)) {
                        rc = llog_destroy(env, loghandle);

Comment by Alex Zhuravlev [ 15/Sep/15 ]

because this code is called at umount (when we close the llog). and of course, this is very different transaction.

Comment by Di Wang [ 15/Sep/15 ]
because this code is called at umount (when we close the llog). and of course, this is very different transaction.

Well, it is not an orphan, and will be destroyed during umount. TBH, I do not think it is a big deal. The reason I am a bit reluctant is that if you do destroy here, you have to reserve the credit for this destroy here? Btw: I did not see it declare the destroy in llog_cat_decalre_add_rec()?

Comment by Alex Zhuravlev [ 15/Sep/15 ]

there is no need to declare that as it's a reverse operation and it doesn't consume additional credits. it's an orphan because it isn't referenced by some on-disk structure. so if we crash before umount this object is leaked. I agree this isn't a very big deal, but if this can be fixed easily, why not?

Comment by Gerrit Updater [ 03/Nov/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16427/
Subject: LU-6634 llog: destroy plain llog if init fails
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9126d8159d5d5b61a600e7427d0c173084a710e6

Comment by Peter Jones [ 04/Nov/15 ]

Landed for 2.8

Generated at Sat Feb 10 02:01:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.