Welcome to the new FlexRadio Community! Please review the new Community Rules and other important new Community information on the Message Board.
If you are having a problem, please refer to the product documentation or check the Help Center for known solutions.
Need technical support from FlexRadio? It's as simple as Creating a HelpDesk ticket.

September 11 - 2023 - SmartLink Outage

Options
Mike-VA3MW
Mike-VA3MW Administrator, FlexRadio Employee, Community Manager, Super Elmer, Moderator admin
edited October 2023 in Message Board

We want to inform you that we are currently addressing a SmartLink issue that could potentially affect your ability to access your remote radios. Please be assured that our dedicated team at FlexRadio is actively working to resolve this matter. We will keep you updated on our progress as we work towards a solution.

Please keep an eye on this posting for any updated status.

Thank you for your patience and understanding.

Comments

  • Mike-VA3MW
    Mike-VA3MW Administrator, FlexRadio Employee, Community Manager, Super Elmer, Moderator admin
    Options

    Friday AM - Update. This was also emailed to all of our user install base.

    Our servers are still working at 100% to resolve the required TLS calculations. More and more users are returning to being online.

    Regarding our SmartLink Outage

    On 9/11 around noon CDT, we began to receive reports of trouble using SmartLink. This outage was caused by a Root CA certificate expiring that our SmartLink SSL cert depended on. In layman’s terms, this meant that a secure connection to the server could not be made.  

    The discovery of the root problem occurred around 3AM CDT 9/12, roughly 15 hours after the initial reports. Fortunately, we were already in a position to update our cert and the supporting root CA cert because of lessons learned from prior years. Actions taken earlier in the morning on 9/11 helped accelerate this work. The new certificates were in place on the SmartLink server by 4AM CDT 9/12.

    Unfortunately, there was one more step required as radios were still not communicating with the server due to the new root CA not being in our cert bundle in the radio. SmartSDR v3.5.9 and v2.10.1 were developed to address that issue and pushed through internal and Alpha testing. With their release around 3PM CDT 9/12, SmartLink is functional again, though the server was extremely sluggish.  

    With the SmartLink service functional, our attention has turned to the sluggish response of the SmartLink server. This has caused a fair bit of confusion for folks that updated their software and still experienced timeout errors. Through some investigation, it has become clear that radios running earlier versions of SmartSDR attempting to connect to the SmartLink server are driving most of the server load.  

    This isn’t a terrible surprise as the TLS connection negotiation is a CPU intensive process and the radio will continue to attempt to connect upon a failure. We will continue to investigate ways to mitigate this problem on the server side, but the best way that you can help is to update your software (and radio) to the latest version.

    Rest assured that steps are being taken to avoid problems like this from happening in the future. We are committed to providing the best radio experience possible and we understand the frustration such an outage causes to our customers and apologize for the inconvenience.

    **Action Required:**

    If you have never used SmartLink or created a SmartLink login, there is no action required on your behalf.

    In order to help, we are asking you to upgrade to the latest version of SmartSDR. 

     This can be either v2.10.1 or V3.5.9. 

     If you have already taken the time to update SmartSDR on your PC and also upgraded the firmware on the radio, we thank you very much.  

     For those SmartLink users, you will be required to be at v2.10.1 or v3.5.9 in order to continue to use SmartLink.

     If you are a MAC or iOS, make sure that SmartSDR for MAC and also SmartSDR for iOS have been updated. 

    1.  **Visit our Downloads Page:** Go to our official downloads page at https://www.flexradio.com/ssdr/
    2. **Select Your Software Version:** Choose either SmartSDR v2.10.1 or v3.5.9, depending on your current version in use.
    3. **Install SmartSDR:** Download and install the selected software version.

    If you need a review on how to do so, this video should help: https://youtu.be/yr1CWw9NgPM?si=rgII1fmfFqXxtSwS 

    At FlexRadio, we are committed to your security and satisfaction. By keeping your software up to date, you not only enhance your device's security but also ensure that you have access to the latest features and improvements.

  • Mike-VA3MW
    Mike-VA3MW Administrator, FlexRadio Employee, Community Manager, Super Elmer, Moderator admin
    Options

    Hi All

    We want to take a moment to address the ongoing issues with SmartLink and reassure you that we are actively working on improvements to enhance your user experience. Your feedback has been invaluable in helping us identify areas for enhancement, and we want you to know that we have not forgotten about your concerns.

    We understand that some customers have been experiencing lingering issues with SmartLink, and we want to assure you that our engineering team is diligently working to create a more robust and reliable solution.

    We are committed to resolving these issues. We appreciate your continued support and look forward to delivering a vastly improved SmartLink service in the near future.

    As more details are available that I can share, I will do so on this posting.

    73 and thanks for your patience and understanding

  • Mike-VA3MW
    Mike-VA3MW Administrator, FlexRadio Employee, Community Manager, Super Elmer, Moderator admin
    Options

    All

    We would like to inform you of an important upgrade to our SmartLink servers that will enhance your FlexRadio experience. 

    On Tuesday, October 3rd, at 3 PM CDT, 2000Z, we will be performing maintenance on our SmartLink servers. 

    During this upgrade, SmartLink will be temporarily unavailable for up to 4 hours. Should things go well, we expect this time to be much shorter.  

    We understand the importance of SmartLink to our users, and we're excited to share that we are replacing our servers with a new, more robust system.  

    Should there be any change in the schedule or some other important news, we will communicate via: FlexRadio Community, FaceBook Enthusiasts Group, and X (formally known as Twitter) from @FlexRadioSystem.  

    Our engineering teams have been hard at work since the SmartLink outage was first reported, and this upgrade represents a significant step forward. Moving to a new server is a positive step forward as it ensures greater reliability and performance, providing our users with a smoother and more dependable SmartLink experience. This upgrade reflects our commitment to constantly improving our services to meet our customers' needs.

    To ensure its reliability, we've been testing the new server intensely with our team internally and recently expanded the testing to our Alpha Team. This new server is optimized for faster response times and will seamlessly scale to accommodate user load, providing you with a better SmartLink experience.

    To take advantage of these changes and ensure SmartLink's correct operation, update your FLEX-6000 radio to SmartSDR v3.5.9 (or v2.10.1 if licensed for v2).   

    It is highly recommended that you update to these latest versions even if you are not a SmartLink user to take advantage of many internal performance improvements.

    SmartSDR v2.10.1

    SmartSDR v3.5.9

    If you have already upgraded your FlexRadio to v2.10.1 or v3.5.9, no further action is required on your part as the server transition will happen transparently during the scheduled maintenance.

    We value your trust in FlexRadio, and we apologize for any inconvenience this maintenance window may cause. 

    Please mark the date and time in your calendar and we appreciate your understanding as we continue to enhance our services.

    Thank you for being a part of our FlexRadio community. If you have any questions or need further assistance, please don't hesitate to reach out to us.

    FlexRadio Systems

    512-535-4713

    FlexRadio HelpDesk

     

  • Eric-KE5DTO
    Eric-KE5DTO Administrator, FlexRadio Employee admin
    Options

    Post Incident Review

    Summary

    On Monday, 2023-09-11 0000Z, users attempting to use SmartLink™ experienced timeout errors and no access to their radios.

    The event was triggered by an expired Root CA certificate on which our smartlink.flexradio.com certificate depended.  This was discovered on 2023-09-12 around 0800Z.

    The event was reported both through internal support channels and through the FlexRadio HelpDesk starting around 1700Z.

    This critical severity incident affected all SmartLink users and had secondary performance effects on internet connected radios that were being used on Local Area Networks (LAN).

    The SmartLink service was restored on the server side 2023-09-12 0900Z by installing a new smartlink.flexradio.com certificate. This addressed SmartSDR client connection failures.  However, the updated certificate depended on a new SSL.com Root CA that was not present in the FLEX-6000 radio firmware.

    New SmartSDR software (v3.5.9 and v2.10.1) was released 2023-09-13 1930Z to add the new SSL.com Root CA to the FLEX-6000 radio firmware and restore full SmartLink functionality for users.

    SmartLink server stability has been intermittent since the new software release due to load.  Downtime has been mitigated through server optimization, but ultimately it was determined a more robust infrastructure would be required to handle current load and will scale more appropriately in the future.

    Improved SmartLink server infrastructure design began 2023-09-14, was internally pre-Alpha tested 2023-09-22, and was released to the Alpha team 2023-09-25.

    Transition to roll out the new SmartLink server infrastructure through scheduled downtime occurred 2023-10-03 2000Z.  SmartLink server stability and uptime improved dramatically.

  • Eric-KE5DTO
    Eric-KE5DTO Administrator, FlexRadio Employee admin
    Options

    Leadup

    Previous SmartLink outage due to certificate expiration on 2022-09-07 led to a number of stop gap measures in an attempt to prevent another outage.  These steps included monthly meetings to expose any public facing services with expirations (certificates, credit cards, etc) that could cause downtime along with automated expiration date reminders.

    This led to purchasing an updated smartlink.flexradio.com certificate 30 days early in order to avoid an outage.  While this did not ultimately prevent the outage, it did accelerate the server side fix once the root cause was identified.


    Fault

    While the smartlink.flexradio.com SSL certificate still had 27 days left prior to expiration, the Root Certificate Authority (CA) SSL.com certificate on which the smartlink cert depends expired 2023-09-11.  As a necessary part of the certificate chain for the smartlink.flexradio.com, this expiration resulted in authorization failures from both FLEX-6000 radios and SmartSDR clients.  Failure to update the certificate and the related certificate chain on the server and in the radio firmware prior to the Root CA expiration is the root cause of the incident.


    Impact

    For over 2 days, all SmartLink users were unable to access their radios remotely.  Once service was ostensibly restored, it was spotty (sometimes radios would not show up on the available list).  


    Response

    After becoming aware of the outage 2023-09-11 1700Z, engineering was in the middle of installing the updated smartlink.flexradio.com certificate.  This work was stopped to begin investigation of the growing reports of SmartLink problems.  The root cause was identified 2023-09-12 0800Z.  Thanks to the earlier prep and acquisition of an updated smartlink.flexradio.com SSL certificate, the issue was addressed on the server side within an hour by 2023-09-12 0900Z.  This allowed client functionality to be restored with existing software.

    Unfortunately, the newer certificate CA chain was not included in the FLEX-6000 radio firmware.  A patch was generated and released for Alpha testing 2023-09-12 2300Z before being released publicly 2023-09-13 1930Z.

    The SmartLink server stability was impacted by high load due to FLEX-6000 radios and FlexLib based clients (SmartSDR for Windows, SmartSDR for Maestro, SmartSDR for M models) running earlier software versions with the expired certificate.  The combination of the expired certificate and poor connection handling patterns in both the radios and clients overwhelmed the server.

    Server scaling was initiated 2023-09-15 1830Z that helped to address load issues, but introduced new issues due to race conditions resulting in successful connections to the SmartLink server, but no data being exchanged.  While the timeout errors were mitigated due to handling the load, clients did not receive necessary information regarding available radios and thus SmartLink functionality was again not possible.

    Optimization efforts on the SmartLink server side were initiated prior to the new software release.  The largest improvements were in place by 2023-09-17 1300Z, however performance was still unsatisfactory due to occasional instability due to excessive load resulting in timeout errors and longer latencies when logging in.

    Additional monitoring was added to mitigate SmartLink server downtime 2023-09-20 1800Z.  Since then, SmartLink server restarts/swaps have been initiated to mitigate server downtime.  This means shorter outages continue to be experienced intermittently.

    As it became clear that the existing infrastructure was insufficient for our current load (and future growth), work began 2023-09-14 1600Z on an improved infrastructure architecture and implementation.

    The new SmartLink server infrastructure began testing outside of development in engineering test beginning 2023-09-18.  The pool of testers was expanded outside of engineering 2023-09-22.  Minor performance tweaks and bug fixes were applied before releasing the new SmartLink server for testing by the full Alpha team.  The changes were rolled out publicly during scheduled maintenance on 2023-10-03.

This discussion has been closed.