Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPT_LINKS incomplete node list #469

Open
mattmelling opened this issue Jan 27, 2025 · 8 comments · May be fixed by #470
Open

RPT_LINKS incomplete node list #469

mattmelling opened this issue Jan 27, 2025 · 8 comments · May be fixed by #470
Assignees
Labels
bug Something isn't working

Comments

@mattmelling
Copy link

I’m working with a network which has around 90-100 nodes connected at any one time. I want to monitor which nodes are connected and have written a script to subscribe to the RPT_LINKS event (Python, via the excellent pystrix package).

If I query the Allstarlink API and count the number of links connected to each of our hub nodes, I get:

$ cat nodecount.sh 
#!/usr/bin/env bash

NODES=(2167 2196 63061 601040)

t=0
for n in "${NODES[@]}"; do
    c=$(curl -s https://stats.allstarlink.org/api/stats/$n | jq '.stats .data .links | length')
    t=$(( $t + $c ))
done
echo $t
$ ./nodecount.sh 
94

However, the payload I receieve in the RPT_LINKS event has significantly fewer:

RPT_LINKS=69,T2167,T53396,T547462,T620131,T55512,T61706,T577817,T64077,T418191,T1999,T46348,T58753,T60148,T63466,T63783,T61578,T40064,T2196,T494130,T48907,T63429,T62169,T63863,T59629,T623770,T632850,T550113,T61739,T620133,T52375,T547487,T286982,T605872,T547482,T602201,T3889266,T555399,T3197638,T567200,T550700,T601630,T601640,T547488,T602020,T547480,T614820,T286980,T605870,T555390,T606530,T60699,T424380,T635660,T605880,T602200,T547489,T51871,T1999,T578103,T578101,T1995,T42434,T50490,T501642,T50982,T546381,T569940,T521391,T5

I counted them, there are indeed 69, which corresponds with RPT_NUMLINKS. Also node the truncated T5 at the end. To validate that it isn’t my script, I see the same list when I run rpt showvars NNNNNN. The hub nodes are all connected according to our dashboards, at most 2 hops away from each other.

The topology is 2167 - 2196 - 63061

In this case I am connected direct to 2167, but cannot see that 63061 or any of its adjacent nodes. If I connect direct to 63061 I will not see 2167 or any of its adjacent nodes. They are both connected to 2196. Both can see 2196 an its adjacent nodes in the list.

RPT_LINKS seems to be emitted when there is activity, the network itself is quite busy with the usual flurry of kerchunks between QSOs, I have also tried poking the hub nodes to solicit RPT_LINKS events between nodes, but the most I can see is around 70.

Is there any limit to the number of nodes that will be emitted with RPT_LINKS? Is there some configuration that would prevent an intermediate node from sending node details from one side to another (looking at 2196)?


Version information

I've observed this behavior on an ASL3 node:

OS            : Debian GNU/Linux 12 (bookworm)
OS Kernel     : 6.1.0-28-amd64

Asterisk      : 20.10.0+asl3-3.1.0-1.deb12
ASL [app_rpt] : 3.1.0
Full asl-show-version 3.1.0 output
    $ sudo asl-show-version 
    dpkg-query: no packages found matching cockpit*
    ********** AllStarLink [ASL] Version Info **********

    OS            : Debian GNU/Linux 12 (bookworm)
    OS Kernel     : 6.1.0-28-amd64

    Asterisk      : 20.10.0+asl3-3.1.0-1.deb12
    ASL [app_rpt] : 3.1.0

    Installed ASL packages :

      Package                         Version
      ==============================  ==============================
      allmon3                         1.4.1-1.deb12
      asl3                            3.6.0-1.deb
      asl3-asterisk                   2:20.10.0+asl3-3.1.0-1.deb12
      asl3-asterisk-config            2:20.10.0+asl3-3.1.0-1.deb12
      asl3-asterisk-modules           2:20.10.0+asl3-3.1.0-1.deb12
      asl3-menu                       1.10-1.deb12
      dahdi                           1:3.1.0-2
      dahdi-dkms                      1:3.4.0-4+asl
      dahdi-linux                     1:3.4.0-4+asl

Also seeing it after upgrading to 3.2.0:

OS            : Debian GNU/Linux 12 (bookworm)
OS Kernel     : 6.1.0-28-amd64

Asterisk      : 20.11.0+asl3-3.2.0-2.deb12
ASL [app_rpt] : 3.2.0
Full asl-show-version 3.2.0 output
    $ sudo asl-show-version
    dpkg-query: no packages found matching cockpit*
    ********** AllStarLink [ASL] Version Info **********
    
    OS            : Debian GNU/Linux 12 (bookworm)
    OS Kernel     : 6.1.0-28-amd64
    
    Asterisk      : 20.11.0+asl3-3.2.0-2.deb12
    ASL [app_rpt] : 3.2.0
    
    Installed ASL packages :
    
      Package                         Version
      ==============================  ==============================
      allmon3                         1.4.2-1.deb12
      asl3                            3.6.0-1.deb
      asl3-asterisk                   2:20.11.0+asl3-3.2.0-2.deb12
      asl3-asterisk-config            2:20.11.0+asl3-3.2.0-2.deb12
      asl3-asterisk-modules           2:20.11.0+asl3-3.2.0-2.deb12
      asl3-menu                       1.11-1.deb12
      dahdi                           1:3.1.0-2
      dahdi-dkms                      1:3.4.0-4+asl
      dahdi-linux                     1:3.4.0-4+asl

Also seeing the same thing on an older node running ASL Version 2.0.0-beta.6.

@mkmer
Copy link
Collaborator

mkmer commented Jan 27, 2025

This is an interesting one. Let me confirm I understand what you've said here:

  • stats.allstarlink.org is reporting accurate number of nodes
  • RPT_LINKS is appearently truncated AND RPT_NUMLINKS is matching node count found in RPT_LINKS.

I looks like we generate the allstarlink data here:

static inline int do_link_post(struct rpt *myrpt)

And this apparently is generating the correct list of nodes.

We then generate the Asterisk vars here:

void rpt_update_links(struct rpt *myrpt)

Generating the "same list" in a completely different way.

I suspect Asterisk has a limit on the internal variable size which is truncating these values, but I've chased that a bit and don't see the limit. It appears you have 1669518 chars in the truncated list so we may be limited to 1670 519 chars (adding \0 to the string). Maybe other eyes looking at the rpt_update_links() function will see what I'm missing.

@mattmelling
Copy link
Author

This is an interesting one. Let me confirm I understand what you've said here:

* stats.allstarlink.org is reporting accurate number of nodes

* RPT_LINKS is appearently truncated AND RPT_NUMLINKS is matching node count found in RPT_LINKS.

Yes, and RPT_LINKS is truncated in 2 ways:

  1. It is not showing the full number of nodes connected to the system. As demonstrated I was able to count 94 nodes connected to 4 hubs that are all connected to each other. The number of links reported in RPT_LINKS and RPT_NUMLINKS is significantly lower and seems to max out at around 70, no matter what is connected.
  2. The value emitted with the RPT_LINKS event is truncated to ~520 bytes. From an initial read I believe that it should support up to 5140 bytes.

To validate that it is a general issue, this is what I get when connecting to a completely different network - isolated from the network I am working with:

RPT_NUMLINKS=72
RPT_LINKS=72,T41223,T48574,T455601,T64090,T60027,T49838,T45567,T60966,T60733,T48593,T572060,T48957,T420663,T43915,T54162,T45914,T43845,T63582,T562221,T45426,T480641,T45563,T43891,T51288,T615400,T511612,T47767,T512513,T512510,T512511,T512514,T599912,T487292,T550540,T53844,T59988,T64042,T54889,T487290,T578821,T570771,T41689,T45873,T50707,T47429,T63295,T1111,T3656784,T3003379,T3867262,T1020,T1010,T1017,T1122,T1155,T1133,T1144,T56020,T1130,T48979,T54546,T51163,T54088,T49342,T57207,T51993,T58353,T53876,T61192,T41288,T41522,T4761

app_rpt is reporting ~70 nodes connected (at least 130 according to my hub node-counting script), and RPT_LINKS is 519 characters long, suspiciously just short of 520.

The back story to this is that I'd like to monitor which nodes are connected and raise an alert if certain nodes disconnect, so a precise understanding of who is connected is important. This is achievable by hammering the allstarlink.org API but I'd rather not do that :)

@Allan-N
Copy link
Collaborator

Allan-N commented Jan 27, 2025

app_rpt is reporting ~70 nodes connected (at least 130 according to my hub node-counting script)

Doesn't your node counting script aggregate the links from multiple nodes?

Wouldn't you need to do the same if you are counting the info from RPT_LINKS? or are you thinking that RPT_LINKS includes the full tree?

But, the truncated list is something that needs to be investigated.

@mattmelling
Copy link
Author

It does count links across multiple nodes, however they are all connected.

I was thinking that RPT_LINKS would include the full tree. From Event Management:

RPT_LINKS is a list of all nodes, whether connected directly, or connected through a node that is connected directly.

I can see a couple of interpretations here - through a node that is connected directly could imply that the node is adjacent to the directly connected node, or adjacent to an adjacent to an adjacent, etc.

So are we saying that RPT_LINKS will only show adjacents and their adjacents? Is there a way to get the whole tree directly from Asterisk?

@mkmer
Copy link
Collaborator

mkmer commented Jan 27, 2025

Is my first statement correct? stats.allstarlink.org has the correct node list.

@mkmer
Copy link
Collaborator

mkmer commented Jan 27, 2025

Also, can you share what is in the RPT_ALINKS and RPT_NUMALINKS registers along side the others. Trying to pin point where things maybe going wrong.

@Allan-N
Copy link
Collaborator

Allan-N commented Jan 27, 2025

And while we're asking for info, can you share your pystrix script that subscribes to the RPT_LINKS events.

@mattmelling
Copy link
Author

Is my first statement correct? stats.allstarlink.org has the correct node list.

Sure, it is correct. I was hoping to find a way of observing the network state without hammering the stats API (rate limited) and to be able to capture links through nodes that aren't reporting. If that isn't possible with RPT_LINKS, I'll figure something out, that's the fun part :)

Also, can you share what is in the RPT_ALINKS and RPT_NUMALINKS registers along side the others. Trying to pin point where things maybe going wrong.

Here is an example connected to one network:

allstar*CLI> rpt showvars 596200
Variable listing for node 596200:
   RPT_TXKEYED=0
   RPT_NUMLINKS=69
   RPT_LINKS=69,T63061,T59398,T63797,T63502,T462561,T615461,T548161,T62164,T596201,T596203,T62077,T59411,T1999,T537104,T537103,T596600,T63639,T60527,T61154,T62953,T537101,T2196,T618590,T48907,T578100,T634430,T1001,T1999,T1002,T43006,T50982,T54455,T62169,T63863,T61739,T550113,T521292,T615010,T494130,T620133,T54307,T56128,T546381,T623770,T632850,T52375,T555399,T424380,T602020,T550700,T60699,T606530,T635660,T605880,T601640,T547489,T547488,T602200,T547480,T605870,T614820,T286980,T555390,T601630,T567200,T547487,T286982,T605872,T547
   RPT_NUMALINKS=1
   RPT_ALINKS=1,63061TU
   RPT_AUTOPATCHUP=0
   RPT_ETXKEYED=0
   RPT_RXKEYED=0
    -- 8 variables

And connected to another network for comparison:

allstar*CLI> rpt showvars 596200
Variable listing for node 596200:
   RPT_TXKEYED=0
   RPT_NUMLINKS=73
   RPT_LINKS=73,T41223,T48574,T572060,T43915,T54162,T455601,T49838,T45567,T60966,T60733,T48593,T48957,T420663,T45914,T43845,T63582,T562221,T480641,T45563,T43891,T51288,T62821,T512513,T512510,T512511,T512514,T64141,T54186,T487292,T53844,T59988,T54889,T578821,T570771,T41689,T63295,T45873,T1111,T3003379,T1020,T1010,T1017,T1122,T1155,T1133,T1144,T56020,T1130,T48979,T54546,T51163,T49342,T57207,T51993,T53876,T61192,T41288,T41522,T47615,T63905,T62961,T608530,TDL6EAC,T47138,T59930,T41962,T61148,T47384,T473810,T47692,T561260,T47743,T54
   RPT_NUMALINKS=1
   RPT_ALINKS=1,41223TU
   RPT_ETXKEYED=0
   RPT_RXKEYED=0
   RPT_AUTOPATCHUP=0
    -- 8 variables

And while we're asking for info, can you share your pystrix script that subscribes to the RPT_LINKS events.

The script is wrapped up with a bunch of other stuff, below is a minimal example that pulls out the LINKS and ALINKS events.

To be clear, the info I posted above was extracted to directly from a local ASL node with rpt showvars rather than through this script.

import os
import pystrix
import time

class AsteriskManager:
    def __init__(self, hostname='localhost', port=5038, username='admin', password='password'):
        self._hostname = hostname
        self._port = port
        self._username = username
        self._password = password
        self._manager = pystrix.ami.Manager()

    def start(self):
        print('AsteriskManager starting')
        self._manager.connect(self._hostname, port=self._port)
        self._do_login()
        self._manager.register_callback('RPT_LINKS', self.handle_rpt_links)
        self._manager.register_callback('RPT_NUMLINKS', self.handle_rpt_numlinks)
        self._manager.register_callback('RPT_ALINKS', self.handle_rpt_alinks)
        self._manager.register_callback('RPT_NUMALINKS', self.handle_rpt_numalinks)
        self._manager.monitor_connection()

    def _do_login(self):
        challenge_response = self._manager.send_action(pystrix.ami.core.Challenge())
        login_action = pystrix.ami.core.Login(self._username, self._password,
                                              challenge=challenge_response.result['Challenge'])
        self._manager.send_action(login_action)

    def handle_rpt_links(self, event, manager):
        print(f"RPT_LINKS = {event.get('EventValue', '')}")

    def handle_rpt_numlinks(self, event, manager):
        print(f"RPT_NUMLINKS = {event.get('EventValue', '')}")

    def handle_rpt_alinks(self, event, manager):
        print(f"RPT_ALINKS = {event.get('EventValue', '')}")

    def handle_rpt_numalinks(self, event, manager):
        print(f"RPT_NUMALINKS = {event.get('EventValue', '')}")

ast = AsteriskManager(hostname=os.environ.get('ASTERISK_HOSTNAME', None),
                      port=int(os.environ.get('ASTERISK_PORT', '5038')),
                      username=os.environ.get('ASTERISK_USERNAME', ''),
                      password=os.environ.get('ASTERISK_PASSWORD', ''))
ast.start()
while True:
    time.sleep(0.01)

@mkmer mkmer linked a pull request Jan 27, 2025 that will close this issue
@mkmer mkmer added the bug Something isn't working label Jan 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants