DTX Wednesday

Architecting for Efficiency

The panel discussed how Platform teams can improve developer experience. Some of the key takeaways were to enable fast release cycles via efficient pipelines. Platform teams should design and offer golden paths to deploy and develop the services which an organisation needs. Also, communication is key to the success of a platform – developer input and feedback should be gathered about platform features. Platform engineers should take time to explain key metrics which are captured and help developers use the different tools which are provided. An internal developer portal is useful to allow developers to spin up environments where they can run, test and deploy code – these steps should be automated and help reduce defects in production.

An interesting point was made about providing incident roles which developers can use when things go wrong – we currently all have Admin rights to AWS Accounts so this could be an improvement which we could look to implement when we start removing admin access.

One way to improve communication between the platform teams and developers was to have a member of the teams visit occasional stand ups or retros and listen for any pain points. Or developers may have ideas on how the platform could be used in different ways.

To enable golden path developer experience, the platform can modularise as much as possible an abstract deployment and resource specification. 

CloudWatch Magic

A short talk on some of the features which are offered by CloudWatch which can improve metrics and observability. The key points were:

  • Embedded Metric Format
  • You can produce logs in the metric format which are picked up by CloudWatch logs and automatically added to your metrics.
  • Metric Filters
  • By setting up metric filters you can create metrics by using RegEx patterns which analyse your logs.
  • Subscription Filters
  • You can send logs to another service (Kinesis, Lambda) to be processed elsewhere – potentially a central logging account.

Demystifying Accessibility in Development

A very interesting talk about setting standards and creating applications which consider accessibility first. The slides are attached which contain some great resources on the dos and don’ts for different types of disabilities. It would be really good to provide similar resources to developers so we know what to consider during development.

Monoliths in the 21st Century

A panel of AWS employees discussed how amazing Serverless technologies are… They also made some great points that monoliths are not just legacy codebases and that some implementations of microservices are just as complicated and could classify as a distributed monolith if they are not independent of one another. Some areas which are useful for monoliths are FinTech, Big Data and Machine Learning workloads. The panel agreed that they would always reach for Serverless tools first for speed, efficiency and security when starting a project but accepted that sometimes if performance or cost becomes an issue then a monolith deployed on servers or containers (Fargate) could be a better solution. They referenced the Amazon Prime article and said that some people had mistaken the article and that only one small part of the workflow had been switched from Lambda to a server. They also discussed that if you start to hit api limits that this could also be red flags that the Serverless technologies are being misused or the workload may be better suited to a different stack.

The panel discussed that Serverless can be easier for new developers to get introduced to IaC from the beginning of a product’s journey. Some of the services which were discussed were AppRunnerControl TowerCode Catalyst which can help develop different services on AWS. AWS Verified Permissions is a new way to use IAM Roles to verify a user is allowed to carry out different actions throughout a workflow.

DTX Thursday

Applied AI – Using in Life and Business

An introduction to AI and Large Language Models was given before an example of how AI could improve your day-to-day life. Small things such as your alarm waking you when it learns how your sleep cycles work, providing an overview of your day’s meetings – even picking up bugs which occur and suggesting fixes for the code which you could review in a PR which was created for you. A platform called Zapier was suggested, which can take output from one platform and push it to another for you. An example here for us at ATG could be a post on slack with certain content about a new idea could trigger a ticket being created in Jira automatically. Other tools could be LangChain and PineCone which can improve the use of AI Tools.

Building a Serverless Mindset

An AWS Senior Architect gave a talk about how to get into the Serverless first mindset. Workloads can be run in containers of ECS, EKS, Fargate or AppRunner – these are best suited to workloads which require low latency or long-running processes. Lambda, which we are mostly well aware of abstract this even further by just running your application logic. One of the main sticking points for Lambda is the cold starts, these can be improved by keeping the bundle small, move reusable code outside the main function so that this can be reused on subsequent invocations (warm start), another way to reduce cold starts is to use provisioned concurrency which will keep environments loaded within Lambda; however there is a cost to this. X-Ray can be a good tool to find out certain parts of a Serverless system to find bottlenecks which are taking up the most time. Another way to improve the speed of a Lambda is to configure the memory which is available to a Lambda. AWS offer a tool called Power Tuning, which could be useful to us to determine how much memory we allocate to Lambda’s. I believe I have seen that on some stacks we default to 1gb – we could potentially save costs by adjusting the amount of memory we allocate. Providing more memory also adjusts the amount of vCpu’s which are available, so in some instances increasing the memory can actually help reduce costs.

Developers should embrace managed services where they can so they can take advantage of all the work and testing which has gone into running those services at scale. Orchestration was also discussed using Step Functions which can allow you to completely remove writing lots of client libraries for SDK calls to different AWS Services, you also get access to retries, exponential back off and no longer have to manage package updates for client libraries. I asked a question around the testing and local development of Step Functions as this is one of the biggest pain points and unfortunately the AWS Experts did not have a great idea how to solve this issue, but they have said they will put me in contact with their Serverless expert… Finally, event design was discussed with the metadata – data pattern which we already follow, some useful keys which were shown in the metadata where version (allows breaking changes), trace ID, event ID (can be used to assist with idempotence in services), event type, published by and date.

Why Architecture Resiliency Matters

The talk highlight some of the key benefits of having resilient architecture, scalability, cost efficiency, fast recovery… Four key areas were given to focus, anticipate, monitoring, responding and learning. By looking at these different areas, you can work out how best to manage architecture if there was a disaster scenario or someone ends up deploying something dodgy into prod! AWS obviously offer a set of tools to help look into this and these are part of the AWS Resilience Hub, by providing the workloads and giving some information about how we handle certain events it will determine how we would cope in different disruption types; application errors, infrastructure, availability zones and a region going down. Monitoring consists of metrics, traces and logs – basic logs alone would not be enough and custom metrics and logs can be added to code to help determine errors which are occurring in our applications. Responding should be event driven, events can be triggered upon failures which can notify other systems. Having a look into running some type of chaos engineering in a UAT environment can give an indication into how resilient our services and applications are.

GitOps Club: The first rule of GitOps club is talk about GitOps Club (to anyone who will listen)

Fascinating talk on GitOps – a good resource is available on GitLab about GitOps and what it is all about. The key points were that by following the GitOps practices that you can reduce risk and deliver changes faster by deploying small incremental changes to production. The core principles discussed were to sell GitOps practice to all developers, use Git as the source of truth for all code and infrastructure. Deploy little and often, and use a mentality of 1 ticket = 1 PR. To enable changes to be deployed to production continuously, you can adopt the use of feature flags so that features can be deployed behind flags instead of letting features pile up waiting for a release date. Automation of everything everywhere is key to enabling developer to follow GitOps best practices, and the commits should be always forward. If a bug makes its way through the system, create another PR to revert or resolve the issue. Another idea which was discussed was that each developer or team had their own environment where they could deploy their changes to, and this could be automated from the creation of the PR. Environment branches could be available so that developers could push their code into Staging or UAT to enable tests to be run against the changes before they are deployed – one thing to note here is that as soon as the branch is merged into main, the staging and UAT environments should honour these changes as well.