[LU-4688] target_destroy_export() LBUG Created: 28/Feb/14  Updated: 12/Aug/15  Resolved: 23/Apr/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.7.0, Lustre 2.5.3, Lustre 2.5.4
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Critical
Reporter: James Nunez (Inactive) Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: 22pl, mq115
Environment:

OpenSFS Cluster running Lustre master build #1914
Combined MGS/MDS, one OSS with two OSTs and one client.


Issue Links:
Related
is related to LU-5395 lfsck_start not progressing Resolved
is related to LU-5377 target_destroy_export()) ASSERTION( a... Resolved
Severity: 3
Rank (Obsolete): 12884

 Description   

I’ve created a brand new file system on a freshly installed system on the OpenSFS cluster. I run dd a few times and everything looks fine. On the OSS, I run

# lctl set_param fail_loc=0x1610
fail_loc=0x1610
# lctl get_param fail_loc
fail_loc=5648

fail_loc=0x1610 (OBD_FAIL_LFSCK_DANGLING) is supposed to create files with dangling references. Then I run dd

# dd if=/dev/urandom of=/lustre/scratch/a_3 count=1 bs=64k
1+0 records in
1+0 records out
65536 bytes (66 kB) copied, 0.0157425 s, 4.2 MB/s

and get no errors for the first 50 or so files written. Then all dd commands will produce the following error

# dd if=/dev/urandom of=/lustre/scratch/m_502 count=1 bs=64k
dd: writing `/lustre/scratch/m_502': Cannot allocate memory
1+0 records in
0+0 records out
0 bytes (0 B) copied, 0.292437 s, 0.0 kB/s

I run LFSCK on the MDS

#lctl lfsck_start -M scratch-MDT0000 -A --reset --type layout
Started LFSCK on the device scratch-MDT0000: layout.
# lctl get_param -n mdd.scratch-MDT0000.lfsck_layout

and see that some number of dangling references were repaired. Up to this point, all of this is expected behavior.

The problem happens when I try to turn dangling references off. On the OST, I run “lctl set_param fail_loc=0” and get “fail_loc=0” returned. I then run dd on the client and get the same error as above about allocating memory and running LFSCK finds and corrects dangling references. I’m told that files could still be created with dangling references due to preallocation, but that after 32 or so files, it should stop.

After writing about 30 files, the dd command on the client froze, the OST crashed and, on the OST console, I see

Message from syslogd@c11-ib at Feb 27 20:42:26 ...
 kernel:LustreError: 2082:0:(ldlm_lib.c:1311:target_destroy_export()) ASSERTION( atomic_read(&exp->exp_cb_count) == 0 ) failed: value: 1

Message from syslogd@c11-ib at Feb 27 20:42:26 ...
 kernel:LustreError: 2082:0:(ldlm_lib.c:1311:target_destroy_export()) LBUG

The OST came back on-line after a few minutes. I’ve repeated this twice on two different clean file systems.



 Comments   
Comment by James Nunez (Inactive) [ 28/Feb/14 ]

More information:

<3>LustreError: 2280:0:(ldlm_resource.c:1154:ldlm_resource_get()) Skipped 7 previous similar messages
<0>LustreError: 2082:0:(ldlm_lib.c:1311:target_destroy_export()) ASSERTION( atomic_read(&exp->exp_cb_count) == 0 ) failed: value: 1
<0>LustreError: 2082:0:(ldlm_lib.c:1311:target_destroy_export()) LBUG
<4>Pid: 2082, comm: obd_zombid
<4>
<4>Call Trace:
<4> [<ffffffffa0512895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa0512e97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa08656bc>] target_destroy_export+0x10c/0x150 [ptlrpc]
<4> [<ffffffffa080840d>] ofd_destroy_export+0x3d/0x1a0 [ofd]
<4> [<ffffffffa0635003>] obd_zombie_impexp_cull+0x2d3/0x5d0 [obdclass]
<4> [<ffffffffa0635365>] obd_zombie_impexp_thread+0x65/0x190 [obdclass]
<4> [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa0635300>] ? obd_zombie_impexp_thread+0x0/0x190 [obdclass]
<4> [<ffffffff8109aee6>] kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
<4>
<0>Kernel panic - not syncing: LBUG
<4>Pid: 2082, comm: obd_zombid Not tainted 2.6.32-431.5.1.el6_lustre.gf1171da.x86_64 #1
<4>Call Trace:
<4> [<ffffffff81527983>] ? panic+0xa7/0x16f
<4> [<ffffffffa0512eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
<4> [<ffffffffa08656bc>] ? target_destroy_export+0x10c/0x150 [ptlrpc]
<4> [<ffffffffa080840d>] ? ofd_destroy_export+0x3d/0x1a0 [ofd]
<4> [<ffffffffa0635003>] ? obd_zombie_impexp_cull+0x2d3/0x5d0 [obdclass]
<4> [<ffffffffa0635365>] ? obd_zombie_impexp_thread+0x65/0x190 [obdclass]
<4> [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa0635300>] ? obd_zombie_impexp_thread+0x0/0x190 [obdclass]
<4> [<ffffffff8109aee6>] ? kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
<4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
Comment by James Nunez (Inactive) [ 04/Mar/14 ]

I cannot reproduce this error with the latest Lustre master build #1927.

Comment by Andreas Dilger [ 10/Mar/14 ]

Closing per last comment. Can reopen if it is hot again.

Comment by Sarah Liu [ 15/Jul/14 ]

Hit the error again in b2_6 build # 2 (2.6.0-RC2)
server and client: RHEL6 ldiskfs

https://testing.hpdd.intel.com/test_sets/bc2e0496-0c47-11e4-b749-5254006e85c2

Comment by Zhenyu Xu [ 31/Jul/14 ]

a debug patch is pushed at http://review.whamcloud.com/11287

Comment by Sarah Liu [ 29/Sep/14 ]

hit this on master branch build #2671
https://testing.hpdd.intel.com/test_sets/86f40836-472d-11e4-a9ec-5254006e85c2

Comment by Jian Yu [ 30/Sep/14 ]

I found that replay-dual was not in any review* test groups. Should we add it back?

Comment by Oleg Drokin [ 11/Oct/14 ]

I noticed in my logs an interesting crash of this nature where there is the second racing thread unhappy.
Hopefully this is enough to identify the problematic code:

<0>[60318.480420] LustreError: 31766:0:(ldlm_lib.c:1298:target_destroy_export()) ASSERTION( atomic_read(&exp->exp_cb_count) == 0 ) failed: value: 1
<0>[60318.480900] LustreError: 13152:0:(genops.c:831:class_export_put()) ASSERTION( __v > 0 && __v < ((int)0x5a5a5a5a5a5a5a5a) ) failed: value: 0
<0>[60318.480903] LustreError: 13152:0:(genops.c:831:class_export_put()) LBUG
<4>[60318.480904] Pid: 13152, comm: jbd2/loop0-8
<4>[60318.480905] 
<4>[60318.480906] Call Trace:
<4>[60318.480919]  [<ffffffffa08348a5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4>[60318.480929]  [<ffffffffa0834ea7>] lbug_with_loc+0x47/0xb0 [libcfs]
<4>[60318.480958]  [<ffffffffa094a251>] class_export_put+0x251/0x2c0 [obdclass]
<4>[60318.481001]  [<ffffffffa11fc424>] tgt_cb_new_client+0x154/0x2a0 [ptlrpc]
<4>[60318.481010]  [<ffffffffa071d9b4>] osd_trans_commit_cb+0xb4/0x350 [osd_ldiskfs]
<4>[60318.481022]  [<ffffffffa06caae9>] ldiskfs_journal_commit_callback+0x89/0xc0 [ldiskfs]
<4>[60318.481026]  [<ffffffffa0368b77>] jbd2_journal_commit_transaction+0x1377/0x1690 [jbd2]
<4>[60318.481030]  [<ffffffff8108292b>] ? try_to_del_timer_sync+0x7b/0xe0
<4>[60318.481035]  [<ffffffffa036df67>] kjournald2+0xb7/0x210 [jbd2]
<4>[60318.481038]  [<ffffffff81098f90>] ? autoremove_wake_function+0x0/0x40
<4>[60318.481042]  [<ffffffffa036deb0>] ? kjournald2+0x0/0x210 [jbd2]
<4>[60318.481044]  [<ffffffff81098c06>] kthread+0x96/0xa0
<4>[60318.481047]  [<ffffffff8100c24a>] child_rip+0xa/0x20
<4>[60318.481049]  [<ffffffff81098b70>] ? kthread+0x0/0xa0
<4>[60318.481050]  [<ffffffff8100c240>] ? child_rip+0x0/0x20
<4>[60318.481051] 
<0>[60318.488119] Kernel panic - not syncing: LBUG
<4>[60318.488121] Pid: 13152, comm: jbd2/loop0-8 Not tainted 2.6.32-rhe6.5-debug #4
<4>[60318.488123] Call Trace:
<4>[60318.488130]  [<ffffffff81512d12>] ? panic+0xa7/0x16f
<4>[60318.488142]  [<ffffffffa0834efb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
<4>[60318.488168]  [<ffffffffa094a251>] ? class_export_put+0x251/0x2c0 [obdclass]
<4>[60318.488211]  [<ffffffffa11fc424>] ? tgt_cb_new_client+0x154/0x2a0 [ptlrpc]
<4>[60318.488223]  [<ffffffffa071d9b4>] ? osd_trans_commit_cb+0xb4/0x350 [osd_ldiskfs]
<4>[60318.488233]  [<ffffffffa06caae9>] ? ldiskfs_journal_commit_callback+0x89/0xc0 [ldiskfs]
<4>[60318.488238]  [<ffffffffa0368b77>] ? jbd2_journal_commit_transaction+0x1377/0x1690 [jbd2]
<4>[60318.488243]  [<ffffffff8108292b>] ? try_to_del_timer_sync+0x7b/0xe0
<4>[60318.488248]  [<ffffffffa036df67>] ? kjournald2+0xb7/0x210 [jbd2]
<4>[60318.488250]  [<ffffffff81098f90>] ? autoremove_wake_function+0x0/0x40
<4>[60318.488254]  [<ffffffffa036deb0>] ? kjournald2+0x0/0x210 [jbd2]
<4>[60318.488257]  [<ffffffff81098c06>] ? kthread+0x96/0xa0
<4>[60318.488260]  [<ffffffff8100c24a>] ? child_rip+0xa/0x20
<4>[60318.488262]  [<ffffffff81098b70>] ? kthread+0x0/0xa0
<4>[60318.488264]  [<ffffffff8100c240>] ? child_rip+0x0/0x20
Comment by Jian Yu [ 11/Nov/14 ]

Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/100/
Distro/Arch: RHEL6.5/x86_64

replay-dual test 24 hit the same failure again:
https://testing.hpdd.intel.com/test_sets/a6c1b3de-68c5-11e4-a63a-5254006e85c2

Comment by Jian Yu [ 02/Dec/14 ]

More instances on Lustre b2_5 branch:
https://testing.hpdd.intel.com/test_sets/ab3fa49a-79df-11e4-807e-5254006e85c2
https://testing.hpdd.intel.com/test_sets/52ae6fde-7ac8-11e4-925f-5254006e85c2
https://testing.hpdd.intel.com/test_sets/82c46bfa-7de4-11e4-aa98-5254006e85c2

Comment by Jian Yu [ 12/Dec/14 ]

More instance on Lustre b2_5 branch:
https://testing.hpdd.intel.com/test_sets/5651325e-805c-11e4-a434-5254006e85c2

Comment by Gerrit Updater [ 10/Feb/15 ]

Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/13706
Subject: LU-4688 mdt: remove export_put() from mdt_export_evict()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9b52a1ae7aad617bdf95d88bf1d2ae9648733392

Comment by Mikhail Pershin [ 10/Feb/15 ]

I hit this issue during one of Maloo runs for non-related patch: https://testing.hpdd.intel.com/test_sessions/a37ccdb6-b03d-11e4-a3a2-5254006e85c2 and did some investigations. I think the reason for this issue is the extra call to the class_export_put() in mdt_export_evict(). I am not sure though was that wrong since the time that code was added or something was changed later but this code stayed unfixed. In all places where class_fail_export() is called it is followed by class_export_put() but that is because the export reference was taken in the same function to be used by class_fail_export(). In case of mdt_export_evict() it drops export reference taken by caller, so caller will drop it again and cause this issue.

Comment by Gerrit Updater [ 27/Mar/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13706/
Subject: LU-4688 mdt: remove export_put() from mdt_export_evict()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 76a8694321b07c360b238b73676928401ce837f8

Comment by Peter Jones [ 23/Apr/15 ]

Landed for 2.8

Generated at Sat Feb 10 01:44:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.