app_rpt: "fast hangup" crashing with latest dev. #459

mkmer · 2025-01-20T00:35:00Z

There is quite a bit of additional code beyond what I had tested earlier in PR #451

Now when executing the "fast disconnect" test I receive this crash:

[2025-01-19 19:24:50.540] WARNING[420715][C-00000024]: app_rpt/rpt_channel.c:470 send_newkey: Failed to send text !NEWKEY1! on IAX2/149.154.11.243:24357-1639
[2025-01-19 19:24:50.542] WARNING[420715][C-00000024]: app_rpt/rpt_channel.c:470 send_newkey: Failed to send text !NEWKEY1! on IAX2/149.154.11.243:24357-1639
 == Spawn extension (radio-secure, 287893, 1) exited non-zero on 'IAX2/149.154.11.243:24357-1639'
[2025-01-19 19:24:50.542] WARNING[420715][C-00000024]: channel.c:2630 ast_hangup: Hard hangup called by thread LWP 420715 on IAX2/149.154.11.243:24357-1639, while blocked by thread LWP 420617 in procedure ast_waitfor_nandfds!  Expect a failure

core-asterisk-2025-01-20T00-24-50Z-full.txt
core-asterisk-2025-01-20T00-24-50Z-locks.txt
core-asterisk-2025-01-20T00-24-50Z-brief.txt
core-asterisk-2025-01-20T00-24-50Z-info.txt

The text was updated successfully, but these errors were encountered:

mkmer · 2025-01-20T13:11:03Z

This is the offending code:

app_rpt/apps/app_rpt.c

Lines 6848 to 6852 in e4dee90

    
           if (!l->chan || ast_check_hangup(chan)) { 
        
           	/* This connection is already toast, just return -1 as normal and let the core kill the channel off */ 
        
           	ast_debug(3, "Channel %s is a dead link\n", ast_channel_name(chan)); 
        
           	return -1; 
        
           }

Commenting it out eliminates the crash but we end up with a Max retries error indefinitely:

[2025-01-20 08:07:23.356] WARNING[541574]: chan_iax2.c:3619 __attempt_transmit: Max retries exceeded to host 149.154.11.243 on IAX2/149.154.11.243:24305-8161 (type = 6, subclass = 2, ts=251998, seqno=37)

This is only occurring when the "fast" disconnect situation is triggered.

mkmer · 2025-01-20T13:36:56Z

Not really sure this is the right way, but return 0; plus elminating ast_check_hangup(chan) eliminates the crash AND eliminates the retry error...

This is my current "test" code:

		if (!l->chan) {
			/* This connection is already toast, just return -1 as normal and let the core kill the channel off */
			ast_debug(3, "Channel %s is a dead link\n", ast_channel_name(chan));
			return 0;
		}

InterLinked1 · 2025-01-20T15:42:16Z

There is quite a bit of additional code beyond what I had tested earlier in PR #451

Now when executing the "fast disconnect" test I receive this crash:
[2025-01-19 19:24:50.540] WARNING[420715][C-00000024]: app_rpt/rpt_channel.c:470 send_newkey: Failed to send text !NEWKEY1! on IAX2/149.154.11.243:24357-1639
[2025-01-19 19:24:50.542] WARNING[420715][C-00000024]: app_rpt/rpt_channel.c:470 send_newkey: Failed to send text !NEWKEY1! on IAX2/149.154.11.243:24357-1639
 == Spawn extension (radio-secure, 287893, 1) exited non-zero on 'IAX2/149.154.11.243:24357-1639'
[2025-01-19 19:24:50.542] WARNING[420715][C-00000024]: channel.c:2630 ast_hangup: Hard hangup called by thread LWP 420715 on IAX2/149.154.11.243:24357-1639, while blocked by thread LWP 420617 in procedure ast_waitfor_nandfds!  Expect a failure
core-asterisk-2025-01-20T00-24-50Z-full.txt core-asterisk-2025-01-20T00-24-50Z-locks.txt core-asterisk-2025-01-20T00-24-50Z-brief.txt core-asterisk-2025-01-20T00-24-50Z-info.txt

Are you sure this is the right coredump? This appears to be caused by something different:

Thread 1 (Thread 0x7fba0a6396c0 (LWP 420659)):
#0  0x00007fba8823331a in ast_sockaddr_isnull (addr=<optimized out>) at /usr/src/asterisk-20.11.1/include/asterisk/netsock2.h:129
No locals.
#1  function_iaxpeer (chan=0x7fba680015d0, cmd=<optimized out>, data=0x7fba0a6295c8 "CURRENTCHANNEL", buf=0x7fba0a62b700 "", len=4096) at chan_iax2.c:14566

Perhaps this is incidentally caused by the hard hangup issue.

https://github.com/asterisk/asterisk/blob/ff0fb401ae8372312070e4f911d5a46ffc42d137/include/asterisk/netsock2.h#L129

The backtrace says SIGSEGV, so somehow there is an issue here. My guess is some kind of invalid memory access since the crashing function already does a NULL check.

I have a suspicion about how the problem might be resolvable. Are you able to easily or at least periodically reproduce this, to test a patch?

Ultimately, this seems to be caused by rpt_manager_do_xstat in stack frame 6 (app_rpt/rpt_manager.c:227).

InterLinked1 · 2025-01-20T15:49:51Z

Okay, I see the log messages in a thread, so it is right.

Unfortunately, LWP 420715 isn't present in the core dump, so whatever thread triggered the issue had exited by the time the backtrace was generated.

To force a crash while the thread exists, can you do the following:

Go to channel.c line 2630.
Add an ast_assert(0) after that.

Assuming you are compiled with dev mode, that should force a crash there. If you could upload the backtrace from that, that would be great. I'd like to confirm if the offending thread is an app_rpt thread or something else.

mkmer · 2025-01-20T16:36:07Z

@InterLinked1 - your looking to add an assert in asterisk base code channel.c (right?) There is already and assert(0) in that location.

There are 2 scenarios for crashes (I made sure to grab each one after the specific crash - so they should be right this time):

PR without modifications:

[2025-01-20 11:24:06.322] WARNING[692197][C-00000024]: app_rpt/rpt_channel.c:470 send_newkey: Failed to send text !NEWKEY1! on IAX2/149.154.11.243:24501-5692
  == Spawn extension (radio-secure, 287893, 1) exited non-zero on 'IAX2/149.154.11.243:24501-5692'
[2025-01-20 11:24:06.322] WARNING[692197][C-00000024]: channel.c:2630 ast_hangup: Hard hangup called by thread LWP 692197 on IAX2/149.154.11.243:24501-5692, while blocked by thread LWP 692009 in procedure ast_waitfor_nandfds!  Expect a failure
    -- Hungup 'IAX2/149.154.11.243:24501-5692'

core-asterisk-2025-01-20T16-24-06Z-full.txt
core-asterisk-2025-01-20T16-24-06Z-info.txt
core-asterisk-2025-01-20T16-24-06Z-locks.txt
core-asterisk-2025-01-20T16-24-06Z-thread1.txt
core-asterisk-2025-01-20T16-24-06Z-brief.txt

Scenario #2: changing to return 0; but leaving ast_check_hangup(chan) during the fast disconnect gives a bunch of these and a core dump:

[2025-01-20 11:17:41.460] ERROR[678530]: channel.c:3055 ast_waitfor_nandfds: FRACK!, Failed assertion bad magic number 0x0 for object 0x7f4694006cd0 (0)
[2025-01-20 11:17:41.460] ERROR[678530]:   Got 7 backtrace records
# 0: /usr/sbin/asterisk(__ao2_unlock+0x13a) [0x55e4b279d96a]
# 1: /usr/sbin/asterisk(ast_waitfor_nandfds+0x167) [0x55e4b27de317]
# 2: /usr/sbin/asterisk(ast_waitfor_n+0x14) [0x55e4b27deb74]
# 3: /usr/lib/asterisk/modules/app_rpt.so(+0x1fa4e) [0x7f46cb21fa4e]
# 4: /usr/sbin/asterisk(+0x1bf26c) [0x55e4b28f026c]
# 5: /lib/x86_64-linux-gnu/libc.so.6(+0x891c4) [0x7f46db18f1c4]
# 6: /lib/x86_64-linux-gnu/libc.so.6(+0x10985c) [0x7f46db20f85c]

core-asterisk-2025-01-20T16-17-41Z-brief.txt
core-asterisk-2025-01-20T16-17-41Z-full.txt
core-asterisk-2025-01-20T16-17-41Z-info.txt
core-asterisk-2025-01-20T16-17-41Z-locks.txt
core-asterisk-2025-01-20T16-17-41Z-thread1.txt

IF I change return -1; to return 0; AND remove the ast_check_hangup(chan) there are no crashes (or errors/warnings that I see)

Also, I'm thinking we have a problem with how link pointers are freed up -> it looks like we remove the link from the repeater linked list, and free the memory based on l->chan == NULL here:

app_rpt/apps/app_rpt.c

Lines 3270 to 3291 in e4dee90

    
           if ((!l->chan) && (!l->disctime) && (!l->outbound)) { 
        
           	ast_debug(1, "LINKDISC AA\n"); 
        
           	/* remove from queue */ 
        
           	rpt_link_remove(myrpt, l); 
        
           	if (myrpt->links.next == &myrpt->links) 
        
           		channel_revert(myrpt); 
        
           	if (!strcmp(myrpt->cmdnode, l->name)) 
        
           		myrpt->cmdnode[0] = 0; 
        
           	rpt_mutex_unlock(&myrpt->lock); 
        
           	if (l->name[0] != '0') { 
        
           		rpt_telemetry(myrpt, REMDISC, l); 
        
           	} 
        
           	rpt_update_links(myrpt); 
        
           	if (myrpt->p.archivedir) { 
        
           		donodelog_fmt(myrpt, "LINKDISC,%s", l->name); 
        
           	} 
        
           	dodispgm(myrpt, l->name); 
        
           	/* hang-up on call to device */ 
        
           	ast_hangup(l->pchan); 
        
           	ast_free(l); 
        
           	rpt_mutex_lock(&myrpt->lock); 
        
           	break;

We probably also need to be even more careful looking at link pointers than may have been freed already (basically everywhere?)

InterLinked1 · 2025-01-20T17:08:12Z

Okay, I see. Could you run core show threads immediately prior to crashing and provide its output? Perhaps it's in the thread list there. If not, you could grep the logs on that LWP for a match, there should be something there at least.

Resolves: #459

InterLinked1 · 2025-01-20T17:10:55Z

Never mind, I see it's the same thread that's logging it in fact.
I don't actually think this will work in that case, but curious what happens if you try with this patch: #460

InterLinked1 added a commit that referenced this issue Jan 20, 2025

app_rpt: Acquire blocklock mutex when hard hanging up link channel.

3457244

Resolves: #459

InterLinked1 linked a pull request Jan 20, 2025 that will close this issue

app_rpt: Acquire blocklock mutex when hard hanging up link channel. #460

Draft

mkmer added the bug Something isn't working label Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

app_rpt: "fast hangup" crashing with latest dev. #459

app_rpt: "fast hangup" crashing with latest dev. #459

mkmer commented Jan 20, 2025

mkmer commented Jan 20, 2025

mkmer commented Jan 20, 2025 •

edited

Loading

InterLinked1 commented Jan 20, 2025

InterLinked1 commented Jan 20, 2025

mkmer commented Jan 20, 2025

InterLinked1 commented Jan 20, 2025

InterLinked1 commented Jan 20, 2025

app_rpt: "fast hangup" crashing with latest dev. #459

app_rpt: "fast hangup" crashing with latest dev. #459

Comments

mkmer commented Jan 20, 2025

mkmer commented Jan 20, 2025

mkmer commented Jan 20, 2025 • edited Loading

InterLinked1 commented Jan 20, 2025

InterLinked1 commented Jan 20, 2025

mkmer commented Jan 20, 2025

InterLinked1 commented Jan 20, 2025

InterLinked1 commented Jan 20, 2025

mkmer commented Jan 20, 2025 •

edited

Loading