Welcome to the new FlexRadio Community! Please review the new Community Rules and other important new Community information on the Message Board.
If you are having a problem, please refer to the product documentation or check the Help Center for known solutions.
Need technical support from FlexRadio? It's as simple as Creating a HelpDesk ticket.

SmartLink Post-mortem for October 14, 2023 Outage

Options
Mike-VA3MW
Mike-VA3MW Administrator, FlexRadio Employee, Community Manager, Super Elmer, Moderator admin

Background

The SmartLink system uses a secrets storage system to store sensitive information in a secure fashion.  These secrets might be credentials for accessing a resource like a database, for example.  This complies with industry best practices for securing online services like SmartLink.  The  secret storage system has a permissions system that allows only authorized users and applications to access them.  A third party library allows the SmartLink server application to access appropriate secrets when the server application first starts up.

There is a current issue with the SmartLink server application that causes the application to crash approximately every two days. This is not a large issue since there are systems that automatically restart the application in case of a crash, and there is no noticeable interruption to the users thanks to failover features of the server. Nevertheless, Software Engineering has been working on fixes to this issue. Since the issue only seems to happen in production and not in local test instances, Software Engineering has been working on deploying a development environment at our cloud provider.

Incident Description

On Friday, 2023-10-13, Software Engineering was working on deploying a test environment to our cloud provider. As a part of this work a new secret was added to the secrets storage system. Since the production environment did not need permissions to this secret, none were given.

On Saturday, 2023-10-14 at approximately 11:18 UTC, the SmartLink application server restarted because of a crash. This caused the third-party library to attempt to reload the secrets from secrets storage during startup. Since it did not have permissions to the newly added secret, it crashed shortly after startup. The system began restarting the application repeatedly, but no user requests were serviced.

At approximately 14:30 UTC, Engineering responded to reports from support indicating  issues with SmartLink, and began debugging. Once the issue was identified, a new version of the server application was deployed to the production environment. At approximately 16:56 UTC, the issue was resolved and normal operation restored.

Remedial Actions

  • The third-party library used to access the secrets does so in a way that is prone to breaking. There are alternative methods to pass the secrets to the application software. This is how Engineering restored normal operation. There are some refinements that need to be made to this process to improve security, and Software Engineering will continue to make these.
  • There needs to be more separation between the production and development or testing instances. Software Engineering will be creating a more isolated environment according to our cloud provider’s best practices. There are also mitigations that can be performed in the meantime to attempt to protect the production environment from similar unintended conflicts.
  • Software Engineering continues to work on the precipitating issue causing the application software to crash every few days.


Comments

  • Dan Trainor
    Dan Trainor Member ✭✭✭
    Options

    Thank you for the transparency.

  • Bob  KN4HH
    Bob KN4HH Member ✭✭
    Options

    Mike, could this on going issue have any bearing with the problem I have had ever since the initial outage. I only get partial log in on my Maestro when remoting my Flex 6500 (basic display, no panadapter, no audio).

    Yesterday I got four successful remote connections. Today no complete connections.


    Bob, KN4HH

  • Mike-VA3MW
    Mike-VA3MW Administrator, FlexRadio Employee, Community Manager, Super Elmer, Moderator admin
    Options

    No, not at all.

    Something is blocking the VITA49 packets which are not controlled by SmartLink at all.

    Please open a support ticket.

    73