This post is the first of what I'm hoping will be a fairly interesting series of technology focused blog posts and will be looking at some of the technical areas that the team have been working on recently, and that are either super interesting, or just seemed to be really poorly documented!
We start the series off in AWS land, for those of you that don't know, AWS (Amazon Web Services) is an online cloud provider, and is one of the big hitters in the field. AWS have been established since 2006 and have evolved significantly since then, and are key infrastructure provider for thousands of companies throughout the world.
Here at ATLAS, we've been growing our AWS Portfolio over the last 12-18 months, and as we continue to grow our in-house portfolio of bespoke software solutions, along with working more heavily with our clients in their digital transformation journeys we've started reaching the point where we're creating brand new AWS accounts on a far more regular basis than we ever historically had thought we would. First though, some background on why this is important.
The Background
For some time now, ATLAS has owned the Superior-Networks hosting company, Superior Networks provide VPS hosting for a range of clients ranging from students and enthusiasts, to large businesses looking for value server hosting.
Our current approach when it comes to hosting applications is to adopt what's called "Cloud First", this means if we can find a SaaS provider that will host the solution for us, we will. Around 12 months ago now we moved our corporate Atlassian Jira, BitBucket and Confluence into the cloud using the Atlassian offering, and we've continued to re-evaluate what we need to host outside of a cloud provider. Where ATLAS would have a requirement to use a virtual machine such as Azure's Virtual Machines or AWS's EC2 instances, we use Superior-Networks VPS's.
We have also so far primarily adopted AWS as our primary cloud provider, in short it's because it's the technology stack that our team have the most experience with, and we can be most effective working in the AWS environments. We require that any of our projects have their own isolated AWS accounts with separate access control on a per-account basis, and that every 'Environment' within a project should have it's own account, for example if Project Z needs a Sandbox, Development Environment, Staging Environment, User Acceptance Environment, Reference Environment and Operational Environment, we would provision 6 separate AWS accounts for the project.
We have adopted this model of AWS accounts on the grounds that it significantly reduces the "Blast Zone" in the event that we have to destroy one of the accounts for any reason, be that to make cost savings, due to a security breach or because it's just no longer required. There is a near 0 cost in setting up new accounts, so it makes sense to maintain a greater number of accounts with a smaller risk of incident within each of those accounts.
The Problem With Our Model
The issue we have found with this model, is how we then manage the authentication and authorisation of these accounts, which in theory could be anywhere from 1-2 accounts, all the way up to 100+ depending on how many projects we need to work on internally and in parallel, it's critical for us to ensure that we can scale in a sensible way.
We also want to make sure that our users have as few identities as possible, we are embracing SSO where we can through our corporate Azure Active Directory (Azure AD) subscription. Finally, we need to be able to setup permissions based on the groups that a user is in or at least to be able to restrict access to accounts and to grant User X Role Y in account 1 but User Z Role N in account 1 and all sorts of other combinations to ensure that our team can do their work effectively and without fighting against access policies.
The Solution (Well Partially)
The best solution was clearly to integrate AWS with our Azure AD and to use the AWS app in the Azure AD Marketplace to manage the single sign on experience for all of our users.
This is a model we very quickly adopted though originally had the setup where a handful of the team could SSO into our "Root" Account, before then using AWS role switching to assume a role in another account, which really just felt a bit messy and actually created quite a bit of work.
We started to follow the model that Microsoft outline here, and it's in our opinion a solid approach, and meets all of our requirements around being able to grant individual users specific roles and in turn specific policies and accesses on the specific accounts in the estate.
Where this fell down, and what we did to fix it...
The model Microsoft outline is solid, however to our frustration it involved us going in to every single account to set it up, and while we could probably set each account up in under 30mins, that ends up getting quite time consuming at the scale we wanted to be able to cope with longer term.
There are a few ways to do this, we could go back to just role switching within AWS and having a shed load of roles that users can role switch using, however we still realised there was a non-trivial amount of effort, and this broke the best practice patterns, and ultimately gave us a single point of failure that if we for any reason had to decommission the role switching account, we exposed ourselves to a very high risk in terms of being able to continue working effectively.
Microsoft do publish a way of having a single Azure AD app with multiple AWS accounts under that single application here. This however is something they don't suggest doing, breaks our best practice understandings, and also doesn't reduce the overall amount of effort we would have to put in to making this work.
In the end we've stuck with our original setup, create a specific Azure AD app for a specific AWS account. If a user has multiple roles within that app AWS actually handles this really nicely, and will give the user the choice of which role they need to assume (Ie if we assign an admin "Administrator" and "Developer" access, they can pick which role they actually need to use every time they login).
We did however need to streamline things, and we ended up creating a Cloudformation Template in this Github Project to store a copy of the configuration that we roll out. The way this works is pretty simple, we enable StackSet's with trusted access in the root account of the organisation in AWS, from this account I can specify exactly how I want the Cloudformation stack to be deployed into the various account options, be that specific accounts, every account in the organisation or to specific OU's defined in organisations.
The cloudformation we've created currently does a few things. It creates an Azure AD Processing user that we can use to list the roles back to Azure AD, it creates 2 roles (Developer and Administrator currently) and the associated policies with them, and it handles the trust relationship between the roles and our SAML authentication provider.
Due to some complexities in how Azure AD and AWS have to work just as a result of SAML, this is not currently a fully automated process, what we currently do looks something like this:
- Login to the root account (Using SSO) and create a new AWS account in the organisation.
- Reset the password to this account and then login
- Enable MFA for the root account and set a IAM password policy (Policy is just to make the IAM dashboard happy)
- Create the Azure AD app for the account using the guide we listed previously and using the AWS account number as a unique identifier.
- Create a new SAML Identity provider in the account called "ATLAS_CORP-AzureAD" and upload the metadata that O365 gives you to that SAML identity provider.
- Log back in to the root account and make sure to include this new account in the Stackset configuration and update to ensure that the Cloudformation is ran against this account.
- Once complete log back into the account you've created, and find the IAM user created, generate an access key and ID, and use this for the provisioning tab in Azure AD.
While that is still quite a few steps, I've been able to do that entire process in around 10mins, which while isn't the shortest, has now reached the point where it would be more complex to automate at the scale we're working at.
It is certainly possible to fully automate just about all of this though, Microsoft have a range of API's that we could interact with to create a new mailbox alias, create the App in Azure AD and all of that fun stuff, however I think 10mins is more than enough time to get things working.
We've still got some improvements we need to make to this process, and we've documented these on the Github project, and welcome any feedback or contributions you may have.
We've decided to ensure the project is open source, as it was surprisingly difficult to find anything that did this out on the web already and while it's nothing super complex, it works for what we need it to do. We did look at automating step 7 as well, however you still needed to log in to the account and view the cloudformation stack in that account to get the access key and ID, and at that point it makes sense to generate them once and make sure that the credentials can't be accessed again.
I hope you found this interesting, and hopefully it might just save someone else some time when they come to need to do this themselves!