[LU-2163] racer test_1: LBUG: ASSERTION( !list_empty(&job->js_list) ) failed - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.3.0, Lustre 2.4.0
Affects Version/s: Lustre 2.3.0, Lustre 2.4.0
Labels:
None

Severity:
3
Rank (Obsolete):
5191

Description

This issue was created by maloo for yujian <yujian@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/aeb1821c-1472-11e2-8ca0-52540035b04c.

The sub-test test_1 hung and console log on MDS showed that:

15:23:56:LustreError: 4359:0:(lprocfs_jobstats.c:248:lprocfs_job_stats_log()) ASSERTION( !list_empty(&job->js_list) ) failed: 
15:23:56:LustreError: 4359:0:(lprocfs_jobstats.c:248:lprocfs_job_stats_log()) LBUG
15:23:56:Pid: 4359, comm: mdt00_022
15:23:56:
15:23:56:Call Trace:
15:23:56: [<ffffffffa0f3d905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
15:23:56: [<ffffffffa0f3df17>] lbug_with_loc+0x47/0xb0 [libcfs]
15:23:56: [<ffffffffa1043c91>] lprocfs_job_stats_log+0x711/0x870 [obdclass]
15:23:56: [<ffffffffa097d6a3>] mdt_counter_incr+0xc3/0xe0 [mdt]
15:23:56: [<ffffffffa0954511>] mdt_getattr_internal+0x171/0xf10 [mdt]
15:23:56: [<ffffffffa0958d5d>] mdt_getattr_name_lock+0xe8d/0x1950 [mdt]
15:23:56: [<ffffffffa11d671d>] ? lustre_msg_buf+0x5d/0x60 [ptlrpc]
15:23:56: [<ffffffffa1203cb6>] ? __req_capsule_get+0x176/0x750 [ptlrpc]
15:23:56: [<ffffffffa11d8b04>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
15:23:56: [<ffffffffa0959e45>] mdt_intent_getattr+0x375/0x590 [mdt]
15:23:56: [<ffffffffa0957191>] mdt_intent_policy+0x371/0x6a0 [mdt]
15:23:56: [<ffffffffa118f881>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
15:23:56: [<ffffffffa11b79bf>] ldlm_handle_enqueue0+0x48f/0xf70 [ptlrpc]
15:23:56: [<ffffffffa0957506>] mdt_enqueue+0x46/0x130 [mdt]
15:23:56: [<ffffffffa094e802>] mdt_handle_common+0x922/0x1740 [mdt]
15:23:56: [<ffffffffa094f6f5>] mdt_regular_handle+0x15/0x20 [mdt]
15:23:56: [<ffffffffa11e7b3c>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc]
15:23:56: [<ffffffffa0f3e65e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
15:23:56: [<ffffffffa11def37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc]
15:23:56: [<ffffffff810533f3>] ? __wake_up+0x53/0x70
15:23:56: [<ffffffffa11e9111>] ptlrpc_main+0xbf1/0x19e0 [ptlrpc]
15:23:56: [<ffffffffa11e8520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
15:23:56: [<ffffffff8100c14a>] child_rip+0xa/0x20
15:23:56: [<ffffffffa11e8520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
15:23:56: [<ffffffffa11e8520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
15:23:56: [<ffffffff8100c140>] ? child_rip+0x0/0x20
15:23:56:
15:23:56:Kernel panic - not syncing: LBUG

Info required for matching: racer 1

Attachments

Activity

[LU-2163] racer test_1: LBUG: ASSERTION( !list_empty(&job->js_list) ) failed

Peter Jones added a comment - 13/Oct/12 3:02 PM

Landed for 2.3 and 2.4

Peter Jones added a comment - 13/Oct/12 3:02 PM Landed for 2.3 and 2.4

Andreas Dilger added a comment - 13/Oct/12 12:55 PM

Oleg, could you please also cherry pick for master.

Andreas Dilger added a comment - 13/Oct/12 12:55 PM Oleg, could you please also cherry pick for master.

Andreas Dilger added a comment - 12/Oct/12 5:46 PM

If two threads are racing to add the same jobid into the job stats list in lprocfs_job_stats_log(), one thread will lose the race from cfs_hash_findadd_unique() and enter the "if (job != job2)" case. It could fail LASSERT(!cfs_list_empty(&job->js_list)) depending whether the other thread in "else" added "job2" to the list first or not.

Simply locking the check for cfs_list_empty(&job->js_list) is not sufficient to fix the race. There would need to be locking over the whole cfs_hash_findadd_unique() and cfs_list_add() calls, but since ojs_lock is global for the whole OST this may have performance costs.

Instead, just remove the LASSERT() entirely, since it provides no value, and the "losing" thread can happily use the job_stat struct immediately since it was fully initialized in job_alloc().

Patch for this issue on b2_3: http://review.whamcloud.com/4263

Andreas Dilger added a comment - 12/Oct/12 5:46 PM If two threads are racing to add the same jobid into the job stats list in lprocfs_job_stats_log(), one thread will lose the race from cfs_hash_findadd_unique() and enter the "if (job != job2)" case. It could fail LASSERT(!cfs_list_empty(&job->js_list)) depending whether the other thread in "else" added "job2" to the list first or not. Simply locking the check for cfs_list_empty(&job->js_list) is not sufficient to fix the race. There would need to be locking over the whole cfs_hash_findadd_unique() and cfs_list_add() calls, but since ojs_lock is global for the whole OST this may have performance costs. Instead, just remove the LASSERT() entirely, since it provides no value, and the "losing" thread can happily use the job_stat struct immediately since it was fully initialized in job_alloc(). Patch for this issue on b2_3: http://review.whamcloud.com/4263

Peter Jones added a comment - 12/Oct/12 4:20 PM

Oleg is looking into this one

Peter Jones added a comment - 12/Oct/12 4:20 PM Oleg is looking into this one

racer test_1: LBUG: ASSERTION( !list_empty(&job->js_list) ) failed

Details

Description

Attachments

Activity

People

Dates