Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLE connection randomly hangs on Bangle.js2 #2560

Closed
nravanelli opened this issue Sep 23, 2024 · 23 comments
Closed

BLE connection randomly hangs on Bangle.js2 #2560

nravanelli opened this issue Sep 23, 2024 · 23 comments

Comments

@nravanelli
Copy link

I have noticed a random issue with the Bangle.js2 where during a bluetooth connection from a separate device to the UART, the Bangle.js2 will freeze and need a reboot (button hold).

I am connecting to the bangle.js2 every 3 minutes over bluetooth via an ESP32 (programmed in arduino/esp-idf), sending 2 commands one after the other, without need for reading the response, and then disconnecting from the device. The ESP32 executes its code successfully, and this can work fine for hours, days, even weeks.

However, at random during the connection, the Bangle.js2 freezes at some point and halts function (e.g. time on the clock freezes, the bluetooth icon from widbt stays blue). The Bangle.js2 does not advertise itself anymore as well, so it is stuck in a ble connected state, despite the ESP32 disconnecting.

I have been trying to unravel when in the connection this happens, and I believe it occurs before commands are sent, thus during the connection between ESP32 and Bangle.js2.

@gfwilliams
Copy link
Member

Just to check, the firmware on your Bangle.js is up to date?

Please can you turn on the debug logging option on the Bangle - probably to write messages to the screen as well as log them to a file - and maybe that will help you see what's happening?

Please can you also attempt to subscribe to notifications on the UART RX characteristic and print them? Even the act of doing that might help, but also it would show you any errors/warnings the Bangle sent

There are a few options here:

  • most likely: When a Bluetooth connection is established, any print statements, or even commands sent to the Bangle that aren't sent with echo off will cause characters to be printed to the bluetooth connection. These will be queued up waiting for the point that they can be received. Espruino does check to see if notifications have been enabled I believe and clears the buffer if they're not, but if your ESP32 has enabled notifications but doesn't ack them then that could well be an issue, and the Bangle might be stuck waiting for the buffer to become empty.
  • Or your code is causing memory usage to build up over time, and eventually it gets so full nothing can execute - but that doesn't explain the disconnection issue.

With no idea what commands you're sending, what the Bangle is reporting back or even the software running on the Bangle there's not really anything I can do to help solve this - if you could get me the Bangle and ESP32 code you're running I could try it on a device here and see if I can reproduce it, but realistically it'd have to occur more often than once a week for me to be able to be sure of doing to while the Bangle was on a debugger.

In the mean time, you should be able to work around your problem by resetting the Bangle if it's not working as you expect. Based on https://www.espruino.com/Bangle.js+Button+Reset you can do:

Bangle.setOptions({btnLoadTimeout:0}); // disable app reload
setInterval(()=>E.kickWatchdog(), 2000); // disable watchdog

So now, if that interval (with E.kickWatchdog()) doesn't get called every 2 seconds, the device will do a full, hardware reboot. You could instead add the interval to some other code that you want to be sure is running, so if it ever stops a reboot will be triggered.

@nravanelli
Copy link
Author

Thanks for the feedback and plan forward for diagnosing. I will implement and report back. Your watchdog fix will work in the interim, but finding the culprit is better.

I cannot disclose the exact code at the moment as we are currently in the process of a research publication submission, however the code just sends a function to 1) update unix time (and tz offset), and 2) updates a cached json file on the watch with the time. Each function is written with echo off; \x03\x10(function(){}()\n;. And the commands are sent in 20 byte chunks.

I initially thought it was a writing to storage issue (my main app writes to the watch storage each minute), and so randomly the second command and the storage write within app would crash, so I implemented a 'watcher' system for storage writes, but the freezing still occurs.

The connect to Bangle.js2 from an ESP32 is quite generic (so happy to share) and used from many tutorials online, and registers for notifications:

bool connectToServer(String deviceAddr) {
  Serial.print("[BLE CONN] Forming a connection to ");
  Serial.println(deviceAddr);
  uint8_t macID[6];
  unsigned int macBytes[6];
  if (sscanf(deviceAddr.c_str(), "%2X:%2X:%2X:%2X:%2X:%2X", &macBytes[0], &macBytes[1], &macBytes[2], &macBytes[3], &macBytes[4], &macBytes[5]) == 6) {
    // Copy the parsed values to macID array
    for (int i = 0; i < 6; i++) {
      macID[i] = static_cast<uint8_t>(macBytes[i]);
    }
  }
  // Recreate pClient if necessary
  pClient = NimBLEDevice::createClient(NimBLEAddress(macID, 1));
  // Create pClientCallback if it doesn't exist
  if (pClientCallback) {
    delete pClientCallback;
  }
  pClientCallback = new MyClientCallback();
  pClient->setClientCallbacks(pClientCallback);
  // Attempt to connect with a timeout
  unsigned long startTime = millis();
  const unsigned long timeout = 5000;  // 5 seconds timeout
  while (!pClient->connect() && millis() - startTime < timeout) {
    delay(100);  // Short delay to yield CPU
  }
  if (!pClient->isConnected()) {
    Serial.println("[BLE CONN] Failed to connect to server within timeout");
    return false;
  }
  //pClient->setMTU(517);
  pRemoteService = pClient->getService(serviceUUID);
  if (pRemoteService == nullptr) {
    Serial.print("Failed to find our service UUID: ");
    Serial.println(serviceUUID.toString().c_str());
    pClient->disconnect();
    return false;
  }
  Serial.println(F("[BLE CONN] Found our service"));
  pTXCharacteristic = pRemoteService->getCharacteristic(charUUID_TX);
  if (pTXCharacteristic == nullptr) {
    Serial.print("Failed to find TX characteristic UUID: ");
    Serial.println(charUUID_TX.toString().c_str());
    pClient->disconnect();
    return false;
  }
  std::string value = pTXCharacteristic->readValue();
  Serial.print(F("[BLE CONN] The characteristic value is currently: "));
  Serial.println(value.c_str());
  pTXCharacteristic->registerForNotify(notifyCallback);
  const uint8_t notificationOn[] = { 0x1, 0x0 };
  pTXCharacteristic->getDescriptor(BLEUUID((uint16_t)0x2902))->writeValue((uint8_t *)notificationOn, 2, true);
  pRXCharacteristic = pRemoteService->getCharacteristic(charUUID_RX);
  if (pRXCharacteristic == nullptr) {
    Serial.print(F("[BLE CONN] Failed to find our characteristic UUID: "));
    Serial.println(charUUID_RX.toString().c_str());
    return false;
  }
  Serial.println(F("[BLE CONN] Remote BLE RX characteristic reference established"));
  BangleConnected = true;
  return true;
};

@gfwilliams
Copy link
Member

Ok, great! So one think I do see if you are enabling notifications and adding a handler with pTXCharacteristic->registerForNotify(notifyCallback);

I don't know how NimBLEDevice handles it but maybe if notifyCallback should be responding with an acknowledge and isn't that could be the issue.

If you really don't care about the response you can always try removing every reference to pTXCharacteristic, which will at least ensure you're not enabling notifications, which would the hopefully mean Bangle.js wasn't even trying to send any responses.

@nravanelli
Copy link
Author

One thing that came to mind is this...

  // Attempt to connect with a timeout
  unsigned long startTime = millis();
  const unsigned long timeout = 5000;  // 5 seconds timeout
  while (!pClient->connect() && millis() - startTime < timeout) {
    delay(100);  // Short delay to yield CPU
  }
  if (!pClient->isConnected()) {
    Serial.println("[BLE CONN] Failed to connect to server within timeout");
    return false;
  }

What is the 'worst case' time duration for a Bluetooth connection to a Bangle? Could there be a case where the NRF thinks its connected, but the esp32 has timed out and stopped trying, and so it hasn't proceeded. Thereby halting the Bangle.js2 from continuing the 'connection dance'? In other words, do you think 5 seconds is too short...?

Hope that makes sense.

@gfwilliams
Copy link
Member

Well, it shouldn't really matter - at whatever point, if the Bangle's not receiving any packets from the connecting device for a certain time period it should flag itself as disconnected.

Out of interest, when the Bangle hangs, what happens if you just unplug or hard reset the ESP32? Does it have any effect on the Bangle?

@nravanelli
Copy link
Author

While I haven't tried that, but the program on the microcontroller continues to run and connect to other Bangle.js2 in proximity.... so it has already cleared/recycled its bluetooth connection (only one Bangle.js2 is connected at a time).

@nravanelli
Copy link
Author

Update, after 2 days of solid connect/disconnects. We have a freeze on my dev/test watch.

I wasn't able to capture the exact log of what happened. Got home, put watch on charger, and it froze on next BLE connection. The ESP32 continued to operate as expected (disconnect after a stale connection). I did have your interim suggestion included in the codebase;

Bangle.setOptions({btnLoadTimeout:0}); // disable app reload
setInterval(()=>E.kickWatchdog(), 2000); // disable watchdog

But that didn't even restart the watch, despite the frozen clockface (I created a custom widget and just use the Anton clock).

@gfwilliams
Copy link
Member

Well that's odd - so you think that the interval wasn't running? Maybe you could update your custom widget from within that setInterval so you can see if that is still working ok.

The ESP32 continued to operate as expected (disconnect after a stale connection).

Did you actually try restarting the ESP32 in case it was still holding the connection open even though the code said it wasn't?

@nravanelli
Copy link
Author

nravanelli commented Sep 25, 2024

I am not sure whether the interval was running or not (although, the inability to use the btn for reset implies its positioned correctly), but the Bangle.js2 accepts a connection (widbt shows blue on screen). So a connection is started.

My guess is the ESP32 attempts to get the service/characteristic of the UART, and this fails so the ESP32 disconnects as expected. But the Bangle.js2 freezes at some point here. I can tell this as my custom widget writes data every minute to local storage, and I have no data between 19h35 and 20h55 during the freeze until I conducted a hard reset (holding button for +10 seconds).
EDIT: The Bangle.js2 is also not discoverable over BLE during the 'freeze' as well.

I did not physically restart my ESP32, but if a Bangle connection stalls for over 20 seconds (e.g. connected but no notifications received for 20 seconds), it automatically restarts esp32 as a failsafe. So connection to the watch is not held. I also only allow for 1 connection to a Bangle at a given time on the ESP32.

@gfwilliams
Copy link
Member

I did not physically restart my ESP32,

Please can you just try this next time it happens? It'd really help to track down what the underlying issue is

@nravanelli
Copy link
Author

Please can you just try this next time it happens? It'd really help to track down what the underlying issue is

Funny enough, the freeze episode happened today. I did power reset the ESP32, but it did nothing. The BangleJS2 was frozen on the clock face.
Information I have:

  • BangleJS2 is still on, clock is stuck (Anton Clock)
  • Bluetooth widget suggests it has a ble connection currently (btwid)
  • The watch cannot be found by scanning over bluetooth. This to me makes me think the NRF stack is stuck somewhere
  • the watch will not write to storage (custom recorder widget not continuing in background)
  • HR sensor is still on and responds to on/off skin (led will go on/off)
  • watch is not responsive to button press to wake screen
  • watch will restart with a 10 second button hold

Hopefully that helps narrow it down...

gfwilliams added a commit that referenced this issue Oct 14, 2024
@gfwilliams
Copy link
Member

It at least shows that something in the 'idle loop' isn't completing, but it's probably not JS because a ~2 sec press of the button would break out otherwise. Without a way for me to easily reproduce or any logs from the Bangle it's not possible to know much more though!

If you could make me a cut-down firmware for the ESP32 that could reproduce the issue on a Bangle here then I could help, but if not there's nothing I can do.

But I did just add manualWatchdog. Please can you try a cutting edge build with:

Bangle.setOptions({manualWatchdog:true}); // disable watchdog timer
setInterval(()=>E.kickWatchdog(), 2000); // disable watchdog

This should now reboot the Bangle if the interval isn't called, which we're pretty sure it's not - so that should fix your problem for now

@nravanelli
Copy link
Author

I will try that and report back! Thanks :)

The press of the button should have unlocked the watchface at least, but the ~2 sec button press didn't/wouldn't work as I followed your previous suggestion of adding these lines of code to the widget:

Bangle.setOptions({btnLoadTimeout:0}); // disable app reload
setInterval(()=>E.kickWatchdog(), 2000); // disable watchdog

But with this cutting edge build I can ignore that first bit of code...

I think I could create a simplified version of the firmware for you to test on a standard ESP32 ... not an easy task so may take a couple of days. Will have to send through email.

I am also going to add a memory watcher too so when / if the screen freezes I have a visual of used memory too, just to confirm that is/is not the culprit.

@gfwilliams
Copy link
Member

gfwilliams commented Oct 14, 2024

Ok, great, thanks!

And yes, dumping memory use on the screen would be handy. I believe there are some widgets for that so you may not even have to write any code.

The press of the button should have unlocked the watchface at least, but the ~2 sec button press didn't/wouldn't work as I followed your previous suggestion

Ahh, ok - interesting! So you think without btnLoadTimeout it would have unlocked?

If so it could be JS - if you enabled writing the debug log on the watch it might have even been able to write a stack trace of where it got stuck into the log

@nravanelli
Copy link
Author

Ahh, ok - interesting! So you think without btnLoadTimeout it would have unlocked?
Nope, it still wouldn't have unlocked (based on previous experience). only a 10 second hold will restart the watch...

If so it could be JS - if you enabled writing the debug log on the watch it might have even been able to write a stack trace of where it got stuck into the log

Will do!

@nravanelli
Copy link
Author

updates... stack log looks like it might not be much assistance, but it seems that at the point of 'freeze' it prints this:

�[J-> Bluetooth
<- Terminal
>�[?7l

I also had to remove the manualWatchdogTimer option too. It seems that it would trip during my file transfer code;

\x03\x10(function(file){E.showMessage('Syncing...');var f=require('Storage').open(file,'r');var d=f.read(384);while(d!==undefined){print(btoa(d));d=f.read(384);}})(file)\n;

removing the manualWatchdogTimer and retaining this:

setInterval(()=>E.kickWatchdog(), 2000); // disable watchdog

resolved that, but I still experienced a freeze (overnight last night) during one of the Bluetooth connections (attempts are made every 3 - 5 mins). It had the blue bluetooth symbol, suggesting a connection, but it was not advertising.

@gfwilliams
Copy link
Member

Thanks for the update...

Please keep manualWatchdog - the whole point is that if something locks up the device for a long time, it restarts.

If your code is taking too long to execute and you're sure it's ok, just add kickWatchdog inside it:

\x03\x10(function(file){E.showMessage('Syncing...');var f=require('Storage').open(file,'r');var d=f.read(384);while(d!==undefined){print(btoa(d));d=f.read(384);E.kickWatchdog();}})(file)\n;

@nravanelli
Copy link
Author

Alright - more updates, but I don't think this narrow us in on the problem.

Did as instructed, but I experienced another freeze during Bluetooth connection establishment. Here is a photo of what the screen logged right before freeze, and the watch never restarted, even with the manualWatchdog.

unnamed

@gfwilliams
Copy link
Member

Wow, ok, well that's odd. Was it still scrolling ->Bluetooth even when the ESP32 wasn't there, or you think each of those corresponds to a connection attempt?

And manualWatchdog was definitely set? Or might you have changed apps/restarted so it didn't take effect any more? Because if enabled it's at a hardware level - if that JS code isn't running, the watch will restart.

... unless your code is stuck executing some JS code that kicks the watchdog itself (just Storage compaction or defragmentation as far as I know)

@nravanelli
Copy link
Author

nravanelli commented Dec 12, 2024

I think I have come to a conclusion that the Bluetooth stack freezes when you try to make connections too often from multiple independent devices (e.g. beacons connecting to the BanglesJS to write a command and then immediately disconnect). The origins of the freeze are unknown at this time as once the BangleJS2 freezes, it is not responsive, but it likely occurs during connection initiation and before services/characteristics are found.

I did observe that having only one beacon connecting to the BangleJS2 intermittently resolved the issue (frequency of once every 5 minutes). It was only when I had multiple devices connecting (+2) did the BangleJS2 hang, even if I set a flag to ensure beacons did not connect to the same BangleJS2 within 5 minute windows.

When it froze was irregular as well, it was not consistent - sometimes freezing after a couple of hours to a couple of days.

So I am marking this as closed for now and will reopen / update if I come by a fix

@gfwilliams
Copy link
Member

Ok, thanks! That explains why my test setup worked for days without issues.

If you do have success reproducing (eg with two ESP32 trying to connect every few seconds) please let me know - the trick will be reproducing it here so I can do it with the watch on a debugger.

I also believe the use of the watchdog tweaks above should really have resolved almost any crashing issues you had

@nravanelli
Copy link
Author

I thought so too (re: manual watchdog), but even with it I was experiencing the freeze condition, albeit less frequently so it did do something. I really did try various permutations for testing (e.g. different BLE connection intervals without flags [1,5,10 minutes], flags for ensuring the Bangle wasn't connected to more than once every 1, 5, or 10 minutes, 2 or more beacons in proximity, etc). The only resolution I experienced was to not have more than 1 ESP32 making direct communications with the watch. ~2, 5, or 10 mins intervals were all ok for days.

This, in general makes me think that I am observing a CPU halt somewhere during the BLE connection, but cannot pin point where in the stack this is happening as logging freezes. The only reset is a hardware button hold.

It isn't the end of the world - just requires a different solution for my context! But my alternative approach sparked another observation with the low-powered GPS mode provided by the gpssetup module.

@gfwilliams
Copy link
Member

Well that itself is odd because the only way long-pressing on the button resets the Bangle is because of the watchdog, which should be disabled by manualWatchdog:true.

If you've got the code I gave you running on one of the recent firmwares:

Bangle.setOptions({manualWatchdog:true}); // disable watchdog timer
// and maybe "btnLoadTimeout:0" to stop pressing the button triggering a code reload
setInterval(()=>E.kickWatchdog(), 2000);

then pressing the button on the watch will have zero effect on whether it reboots or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants