AWS in production: EC2, SSM, IAM and how they fit together

awsec2ssmiaminfrastructure

The problem

Imagine you're building a product where every user gets their own private server. Not a shared backend. A real isolated machine dedicated to them, running their own AI assistant with their own API keys and bot tokens.

Now ask the hard questions:

  1. How do you launch one server per user, on demand, in seconds?
  2. Where do you store each user's secrets so they're not pasted in code or environment files?
  3. How does the server pick up its own config when it boots, with zero manual setup?
  4. How do you push a software update to all 1,000 servers tomorrow without SSHing into each one?
  5. And how does any of this stay secure?

Those five questions are the entire reason AWS exists. In this post I'll walk through the five services that solve them: EC2, UserData, IAM, SSM Parameter Store, and SSM Run Command. I’ll use a real system I built as the example. The goal isn't to teach AWS in the abstract. It's to show you the moving parts of a production setup, in the order they matter.

If you're prepping for an interview, this is the mental model that recruiters want to see. If you're learning AWS, this is the path that makes the docs finally click.

The solution at a glance

Here's the picture before we zoom in:

A user signs up. Your backend tells AWS "launch a new EC2 instance for them." The instance boots, runs a startup script, pulls the user's secrets from SSM Parameter Store, and starts serving traffic. Later, when you ship a new version, you don't SSH into anything. You fire an SSM Run Command at every instance and they update themselves. None of this works without IAM roles giving each piece permission to talk to the next.

Let's break it down service by service.


EC2: the server itself

EC2 (Elastic Compute Cloud) is the simplest AWS service to explain: it's a computer you rent by the second.

You pick a size (how much CPU and RAM), pick an operating system image (called an AMI (Amazon Machine Image), which is basically a snapshot of a pre-configured Linux), click launch, and ~30 seconds later you have a Linux box on the public internet with an IP address.

That's it. That's EC2.

Sizes are called instance types. Smaller and cheaper at the top, bigger and more expensive as you go down. The naming convention is family.size: the family hints at the workload (general-purpose, memory-optimized, compute-optimized) and the size scales the resources.

Instance typevCPUsRAMRough monthly cost (on-demand)Good for
t3.micro21 GB~$8Hobby projects, light APIs
t3.small22 GB~$15Small services, bots, side projects
t3.medium24 GB~$30Production web apps with real traffic
m5.large28 GB~$70Memory-hungry workloads, ML inference

The system I built uses t3.small per user: 2 vCPUs and 2 GB RAM, because each instance just runs a lightweight Node service and a Telegram bot.

The lifecycle of an instance

When you "launch" an EC2 instance, it doesn't appear instantly ready. It moves through phases:

In code, you observe this by polling AWS for two signals. Only when both flip to "ok" is the instance truly ready to serve traffic:

SignalWhat it tells you
instanceStateThe VM exists and is running (pendingrunning).
systemStatusThe underlying AWS hardware is healthy (initializingok).
instanceStatusThe OS booted and is reachable on the network (initializingok).

Launching one

Here's the AWS CLI version:

aws ec2 run-instances \
  --image-id ami-0abcd1234efgh5678 \
  --instance-type t3.small \
  --iam-instance-profile Name=OpenClawInstanceProfile \
  --user-data file://bootstrap.sh \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=user,Value=alice}]'

And the same thing in TypeScript using AWS SDK v3:

import { EC2Client, RunInstancesCommand } from "@aws-sdk/client-ec2";

const ec2 = new EC2Client({ region: "us-east-1" });

const result = await ec2.send(new RunInstancesCommand({
  ImageId: "ami-0abcd1234efgh5678",
  InstanceType: "t3.small",
  MinCount: 1,
  MaxCount: 1,
  IamInstanceProfile: { Name: "OpenClawInstanceProfile" },
  UserData: Buffer.from(bootstrapScript).toString("base64"),
  TagSpecifications: [{
    ResourceType: "instance",
    Tags: [{ Key: "user", Value: "alice" }],
  }],
}));

const instanceId = result.Instances?.[0].InstanceId;

A few things to notice: you pass an IamInstanceProfile (we'll get to IAM soon; this is what lets the instance talk to other AWS services), you pass UserData (a script that runs on first boot, covered in the next section), and you tag the instance so you know which user it belongs to.

Note

Why EC2 vs ECS / Fargate / Lambda?

EC2 gives you a full Linux machine you control. ECS and Fargate run containers, which are great if your workload is stateless and short-lived. Lambda runs functions, capped at 15 minutes per invocation.

I chose EC2 because each user's instance is long-running, has persistent local state, runs a daemonized Node process, and benefits from being a "real machine" rather than a container. If your workload were stateless HTTP, Fargate would be a better fit.


Security groups: the firewall around your instance

Your EC2 instance has a public IP. Without something in front of it, anyone on the internet can hit any port on it: SSH, HTTP, your dev server you forgot to shut down, the works. That "something in front" is a security group.

A security group is a stateful virtual firewall attached to your instance. Two things to remember:

  1. Default behavior is paranoid. Inbound: deny everything. Outbound: allow everything. You explicitly open ports for incoming traffic; outgoing traffic is unrestricted unless you tighten it.
  2. It's stateful. If you allow inbound traffic on port 443, the response packets are automatically allowed back out. You don't write rules for "return traffic". That is the difference between a security group and a network ACL.

Inbound vs outbound rules

DirectionWhat it controlsTypical real-world rules
InboundTraffic arriving at your instancePort 22 (SSH) from your IP only; port 443 (HTTPS) from anywhere; port 80 (HTTP) from anywhere
OutboundTraffic leaving your instanceUsually wide open (0.0.0.0/0 on all ports), which is needed for apt install, SSM, NTP, etc.

Each rule has four parts: protocol (TCP/UDP/ICMP), port range (e.g. 443 or 8000-9000), source/destination (an IP CIDR like 10.0.0.0/16, or another security group ID), and an optional description.

A real example

Here's a security group for a typical web server with HTTPS open to the world, SSH locked to your office IP:

# Create the SG
aws ec2 create-security-group \
  --group-name web-server-sg \
  --description "Web server: HTTPS public, SSH restricted"

# Allow HTTPS from anywhere
aws ec2 authorize-security-group-ingress \
  --group-name web-server-sg \
  --protocol tcp --port 443 --cidr 0.0.0.0/0

# Allow SSH only from your office IP
aws ec2 authorize-security-group-ingress \
  --group-name web-server-sg \
  --protocol tcp --port 22 --cidr 203.0.113.42/32
import {
  EC2Client,
  CreateSecurityGroupCommand,
  AuthorizeSecurityGroupIngressCommand,
} from "@aws-sdk/client-ec2";

const ec2 = new EC2Client({ region: "us-east-1" });

const { GroupId } = await ec2.send(new CreateSecurityGroupCommand({
  GroupName: "web-server-sg",
  Description: "Web server: HTTPS public, SSH restricted",
}));

await ec2.send(new AuthorizeSecurityGroupIngressCommand({
  GroupId,
  IpPermissions: [
    {
      IpProtocol: "tcp",
      FromPort: 443,
      ToPort: 443,
      IpRanges: [{ CidrIp: "0.0.0.0/0", Description: "HTTPS from anywhere" }],
    },
    {
      IpProtocol: "tcp",
      FromPort: 22,
      ToPort: 22,
      IpRanges: [{ CidrIp: "203.0.113.42/32", Description: "SSH from office" }],
    },
  ],
}));

You then attach the SG when launching the instance via --security-group-ids sg-abc123 (CLI) or SecurityGroupIds: ["sg-abc123"] (SDK).

Chaining security groups

You can use another security group as the source instead of an IP range. This is huge for multi-tier apps. Example: your web tier should only let your load balancer talk to it.

Web SG inbound rule:
  Protocol: TCP, Port: 443, Source: sg-loadbalancer (the LB's SG)

Now even if someone discovers the web server's IP, they can't connect. Only instances inside the LB's security group can. And the rule keeps working even if you scale the LB and its IPs change. This is the AWS-native way to express "service A can talk to service B" without hardcoding IPs.

Tip

Three habits that prevent 90% of "why can't my instance reach X?" headaches.

  1. Never open port 22 to 0.0.0.0/0. Always restrict SSH to your IP. Better yet, skip SSH entirely and use SSM Session Manager (we'll get to it).
  2. Name and tag your security groups. sg-0abc123 tells you nothing in a debug session at 2am; web-prod-public does.
  3. When connectivity breaks, check the SG on both ends of the connection. Outbound on the source, inbound on the destination; both have to allow the traffic.
Note

Security Groups vs Network ACLs (NACLs)

Both are firewalls in AWS, but they operate at different layers:

Security GroupNetwork ACL
ScopePer-instance (or per-ENI)Per-subnet (affects every instance in it)
Stateful?Yes, return traffic auto-allowedNo, rules must exist in both directions
DefaultDeny all inbound, allow all outboundAllow all (custom NACLs default to deny)
Rule typesAllow onlyAllow and deny

Use security groups for almost everything. Reach for NACLs only when you need a deny rule (e.g. block a specific bad IP from an entire subnet) or coarse-grained subnet-level boundaries.


UserData: bootstrapping on first boot

Renting a Linux box is one thing. Getting your software onto it is another. You don't want to SSH in and apt install and git clone every time. That is not a system, that's a hobby.

EC2 solves this with UserData: a script you attach when launching the instance. AWS runs it as root on the instance's first boot, before anything else. By the time the instance reports "ready," your software is installed, your config is in place, and your service is running.

Here is a real-world UserData script:

#!/bin/bash
set -euxo pipefail

# Detect the AWS region from instance metadata (needed for SSM calls)
REGION=$(curl -s http://169.254.169.254/latest/meta-data/placement/region)

# Pull the user's bot token and OpenAI key from SSM
WALLET="0xUserWalletHere"
TELEGRAM_TOKEN=$(aws ssm get-parameter \
  --name "/openclaw/users/$WALLET/telegram-bot-token" \
  --with-decryption \
  --region "$REGION" \
  --query 'Parameter.Value' --output text)

OPENAI_KEY=$(aws ssm get-parameter \
  --name "/openclaw/users/$WALLET/openai-api-key" \
  --with-decryption \
  --region "$REGION" \
  --query 'Parameter.Value' --output text)

# Install Node, then our app
curl -fsSL https://rpm.nodesource.com/setup_20.x | bash -
yum install -y nodejs
npm install -g openclaw

# Hand secrets to the app via env, then daemonize
cat > /etc/openclaw.env <<EOF
TELEGRAM_BOT_TOKEN=$TELEGRAM_TOKEN
OPENAI_API_KEY=$OPENAI_KEY
EOF

systemctl enable --now openclaw

# Sentinel file so the backend knows bootstrap finished
touch /var/run/openclaw-ready

The last line is a small but important pattern: the backend polls for that sentinel file (via SSM) to know when the instance has finished setting itself up, beyond what AWS's own health checks can tell us.

Note

UserData vs baking an AMI vs Ansible

UserData runs every time on first boot. That is slow if your setup is heavy (every new instance re-downloads Node, npm-installs your package, etc.).

For faster boots you can bake an AMI: spin up an instance, install everything, take a snapshot, and use that as your ImageId. Future launches are seconds, not minutes.

Ansible is a different approach for ongoing config management. It is overkill for one-shot bootstrap. Most production systems do both: a baked AMI for the heavy stuff, plus UserData for per-instance customization (like fetching that user's specific secrets).


IAM: the permission glue

Here is the permission question behind the bootstrap script: who said the EC2 instance is allowed to read secrets from SSM?

The answer is IAM (Identity and Access Management), AWS's permission system. Almost every "why doesn't this work?" question in AWS has an IAM answer.

The three pieces:

ConceptWhat it is
IAM UserA human (or a CI bot) with a username, password, and maybe API keys.
IAM RoleA bundle of permissions that something can temporarily assume, such as an EC2 instance, a Lambda function, another AWS account.
IAM PolicyA JSON document listing what's allowed, for example: "this principal can do ssm:GetParameter on resources matching arn:aws:ssm:*:*:parameter/openclaw/*."

For our story, the relevant one is IAM Role. When you launch an EC2 instance, you can attach a role to it via an instance profile. From that moment on, any process on the instance can use the AWS CLI or SDK and AWS will treat those calls as being made by that role.

So the script aws ssm get-parameter ... works only because we attached a role with this policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["ssm:GetParameter", "ssm:GetParameters"],
      "Resource": "arn:aws:ssm:us-east-1:123456789012:parameter/openclaw/users/*"
    }
  ]
}

Read that resource line carefully. It says: this role can read parameters under /openclaw/users/* and nothing else. It can't read /admin/*. It can't read /billing/*. It can't write, delete, or list. This is the principle of least privilege: give the smallest possible permission for the job, never *:*.

Tip

Don't reach for AdministratorAccess to debug.

When something fails with AccessDenied, the fix is almost never "give it admin." It's almost always "open the policy, find the missing action or resource, add exactly that."

If you're tempted to attach AdministratorAccess to debug, you'll forget to remove it later. (Everyone has done it. That's how it always starts.)

The IAM mental model worth carrying into every interview: "Who is making this call, and what policy says they can?" That sentence is most of AWS security in one line.


SSM Parameter Store: secrets without hardcoding

Now we can finally talk about SSM Parameter Store, because we have the IAM context for it to make sense.

The problem it solves is small but universal: where do you put secrets? Hardcoding them in code is obviously wrong. .env files committed to git are wrong. .env files baked into AMIs are wrong (anyone with the snapshot has the secret). Even shared S3 buckets are awkward.

SSM Parameter Store is a managed key-value store inside AWS, where:

  • Each parameter has a hierarchical path-style name like /openclaw/users/0xabc.../telegram-bot-token.
  • Values can be plain String or encrypted SecureString (encrypted with a KMS key (AWS's key management service)).
  • Access is controlled by IAM policies (we just saw one).
  • It's free for standard parameters, dirt cheap for advanced.

That hierarchical naming is the useful part. A path like /openclaw/users/{wallet}/telegram-bot-token lets you write IAM policies that scope exactly to one prefix, and it gives you a clean way to organize secrets by tenant, environment, or service.

Reading a parameter

aws ssm get-parameter \
  --name "/openclaw/users/0xabc.../telegram-bot-token" \
  --with-decryption \
  --query 'Parameter.Value' --output text
import { SSMClient, GetParameterCommand } from "@aws-sdk/client-ssm";

const ssm = new SSMClient({ region: "us-east-1" });

const { Parameter } = await ssm.send(new GetParameterCommand({
  Name: `/openclaw/users/${wallet}/telegram-bot-token`,
  WithDecryption: true,
}));

const token = Parameter?.Value;

Writing one (typically from your backend, when the user provides it)

aws ssm put-parameter \
  --name "/openclaw/users/0xabc.../telegram-bot-token" \
  --value "1234:ABC..." \
  --type SecureString \
  --overwrite
Note

Parameter Store vs Secrets Manager

AWS has two services for secrets, and the choice is easy to overthink. Quick comparison:

FeatureParameter StoreSecrets Manager
CostFree (standard tier)~$0.40 per secret / month
EncryptionYes, via SecureString + KMSYes, by default
Automatic rotationNoYes (native, schedule-driven)
Best forAPI keys, config, static-ish valuesDB credentials needing periodic rotation

For this system I picked Parameter Store: secrets are user-provided (bot tokens, OpenAI keys) and don't need rotation. If I were storing a DB password I had to rotate every 30 days, Secrets Manager would be the right call.


SSM Run Command: remote control without SSH

The last service. Here's the scenario: you have 500 EC2 instances, one per user, and you just shipped v2.3 of your software. How do you upgrade all of them?

The bad answer: SSH into each one, run npm update -g, hope it works, repeat. That's not infrastructure, that's pain.

The good answer: SSM Run Command lets you tell AWS "run this shell command on these instances." AWS handles delivery, execution, output capture, error reporting, and audit logging (every command shows up in CloudTrail). Your servers don't need open SSH ports. You don't need SSH keys. You don't even need them to be on a public network.

How it works: every EC2 instance with the SSM Agent installed (it comes pre-installed on most modern AMIs) maintains a long-poll connection to AWS. When you submit a command, AWS hands it to the agent, the agent runs it, and the output flows back. From your code's perspective, you send the command once and poll for the result later by command ID.

A real example: pushing an update

aws ssm send-command \
  --document-name "AWS-RunShellScript" \
  --targets "Key=tag:user,Values=alice" \
  --parameters 'commands=["npm update -g openclaw","openclaw doctor"]'
import { SSMClient, SendCommandCommand } from "@aws-sdk/client-ssm";

const ssm = new SSMClient({ region: "us-east-1" });

const result = await ssm.send(new SendCommandCommand({
  DocumentName: "AWS-RunShellScript",
  Targets: [{ Key: "tag:user", Values: ["alice"] }],
  Parameters: {
    commands: ["npm update -g openclaw", "openclaw doctor"],
  },
}));

const commandId = result.Command?.CommandId;
// later: poll GetCommandInvocation with commandId + instanceId for output

The --targets flag is the useful part. Instead of listing instance IDs, you target by tag. "Run this command on every instance tagged user=alice." Or environment=production. Or version<2.3. Suddenly fleet-wide operations are one API call.

Note

Run Command vs SSH vs Ansible/SaltStack

SSH gives you the most control but requires open ports, key management, and doesn't scale.

Ansible and SaltStack are real config-management tools that solve the same problem at a higher level (declarative state, inventories, playbooks). They are more powerful for complex orchestration, but heavier to operate.

Run Command is a good default when you're already on EC2: no extra infrastructure to manage, every command audited in CloudTrail, IAM-controlled access, no open SSH, and no key rotation.


Putting it together

Now we can draw the full picture, with every service in the right place:

Each AWS service did exactly one job:

ServiceRole
EC2Provided the isolated server per user.
UserDataBootstrapped the server on first boot, no human needed.
IAM RoleGave the server scoped permission to read its own secrets and nothing else.
SSM Parameter StoreStored the secrets, encrypted, hierarchically organized.
SSM Run CommandPushed updates and config changes across the fleet without SSH.

The thing I want you to take away is that no single service is doing all the work. EC2 just rents you a Linux box. SSM just stores key-value pairs. The power is in how they compose: an IAM role glues an EC2 instance to the right SSM path, UserData uses that connection on first boot, and Run Command keeps using it for the rest of the instance's life. That is production AWS: small services wired together with IAM permissions.

What I'd do differently

A few honest reflections from shipping this:

LessonWhat I'd change
Bake more into the AMIUserData was the right starting point, but the bootstrap script got slow (~90s of npm install per launch). Baking dependencies into a custom AMI would cut that to ~15s and make "launching..." feel instant.
Cache parameters at bootEach instance hits SSM only at startup, but if you have hot-paths that read parameters often, build an in-memory cache with a TTL. SSM has API rate limits, and a thousand instances polling for the same value is a foot-gun.
Tag everythingPast me regretted every untagged resource. Tags are free, searchable, how Run Command targets, and how your billing breakdown becomes legible. Tag with user ID, environment, version, owner, and every dimension you'll ever filter on.
Treat IAM like code from day oneDon't click through the IAM console. Write roles and policies in Terraform, CloudFormation, or CDK from the start. They multiply fast, and versioned policies are the difference between knowing your permissions and hoping they are still sane.

Further reading

If this was useful, start by wiring up one small version of this: launch an instance, give it a narrow IAM role, read one parameter from SSM, and run one command without SSH. Once that works, the bigger system is just more of the same pieces.